Performance considerations
Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with Apache Tika and Tesseract OCR, this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.
Separate Processing Stages
Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.
Examples:
Distribute the INDEX Stage
Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlarge
instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.
For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.
Leverage Parallelism
Datashare offer --parallelism
and --parserParallelism
options to enhance processing speed.
Example (for g4dn.8xlarge
with 32 CPUs):
Optimize ElasticSearch
ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.
Adjust JAVA_OPTS
You can fine-tune the JAVA_OPTS
environment variable based on your system's configuration to optimize Java Virtual Machine memory usage.
Example (for g4dn.8xlarge8
with 120 GB Memory):
Specify Document Language
If the document language is known, explicitly setting it can save processing time.
Use
--language
for general language setting (e.g.,FRENCH
,ENGLISH
).Use
--ocrLanguage
for OCR tasks to specify the Tesseract model (e.g.,fra
,eng
).
Example:
Manage OCR Tasks Wisely
OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false
.
Example:
Efficient Handling of Large Files
Large PST files or archives can hinder processing efficiency. We recommend extract these files before processing with Datashare. If they are too many of them, keep in mind Datashare will be able to extract them anyway.
Example to split Outlook PST files in multiple .eml
files with readpst:
Last updated