Do you recommend OS or machines for large corpuses?
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
The most complex operation is OCR (we use Apache Tesseract) so if your documents don't contain many images, it might be worth deactivating it (--ocr false
).
Last updated