Performance considerations

Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with Apache Tika and Tesseract OCR, this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.

Separate Processing Stages

Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.

Examples:

datashare --mode CLI --stage SCAN --redisAddress redis://redis:6379 --busType REDIS
datashare --mode CLI --stage INDEX --redisAddress redis://redis:6379 --busType REDIS

Distribute the INDEX Stage

Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlarge instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.

For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.

Leverage Parallelism

Datashare offer --parallelism and --parserParallelism options to enhance processing speed.

Example (for g4dn.8xlarge with 32 CPUs):

datashare --mode CLI --stage INDEX --parallelism 14 --parserParallelism 14
datashare --mode CLI --stage NLP --parallelism 14 --nlpParallelism 14

Optimize ElasticSearch

ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.

Adjust JAVA_OPTS

You can fine-tune the JAVA_OPTS environment variable based on your system's configuration to optimize Java Virtual Machine memory usage. Example (for g4dn.8xlarge8with 120 GB Memory):

JAVA_OPTS="-Xms10g -Xmx50g" datashare --mode CLI --stage INDEX

Specify Document Language

If the document language is known, explicitly setting it can save processing time.

  • Use --language for general language setting (e.g., FRENCH, ENGLISH).

  • Use --ocrLanguage for OCR tasks to specify the Tesseract model (e.g., fra, eng).

Example:

datashare --mode CLI --stage INDEX --language FRENCH --ocrLanguage fra
datashare --mode CLI --stage INDEX --language CHINESE --ocrLanguage chi_sim
datashare --mode CLI --stage INDEX --language GREEK --ocrLanguage ell

Manage OCR Tasks Wisely

OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false.

Example:

datashare --mode CLI --stage INDEX --ocr false

Efficient Handling of Large Files

Large PST files or archives can hinder processing efficiency. We recommend extract these files before processing with Datashare. If they are too many of them, keep in mind Datashare will be able to extract them anyway.

Example to split Outlook PST files in multiple .eml files with readpst:

readpst -reD <Filename>.pst

Last updated

Datashare is an open source project by the International Consortium of Investigative Journalists