Do you recommend OS or machines for large corpuses?
Last updated
Last updated
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
The most complex operation is OCR (we use ) so if your documents don't contain many images, it might be worth deactivating it (--ocr false
).