Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with Apache Tika and Tesseract OCR, this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.
Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.
Examples:
Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlarge
instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.
For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.
Datashare offer --parallelism
and --parserParallelism
options to enhance processing speed.
Example (for g4dn.8xlarge
with 32 CPUs):
ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.
You can fine-tune the JAVA_OPTS
environment variable based on your system's configuration to optimize Java Virtual Machine memory usage.
Example (for g4dn.8xlarge8
with 120 GB Memory):
If the document language is known, explicitly setting it can save processing time.
Use --language
for general language setting (e.g., FRENCH
, ENGLISH
).
Use --ocrLanguage
for OCR tasks to specify the Tesseract model (e.g., fra
, eng
).
Example:
OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false
.
Example:
Large PST files or archives can hinder processing efficiency. We recommend extract these files before processing with Datashare. If they are too many of them, keep in mind Datashare will be able to extract them anyway.
Example to split Outlook PST files in multiple .eml
files with readpst: