Loading...
Loading...
This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
π·ββοΈ This page is currently being written by Datashare team.
π·ββοΈ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
π·ββοΈ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
π·ββοΈ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
π·ββοΈ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
To report a bug, please contribute in our GitHub detailing your logs with:
your Operating System (Mac, Windows or Linux)
the version of your Operating System
the version of Datashare
screenshots of your issue
a description of your issue.
If for confidentiality reasons you don't want to open an issue on Github, please write to datashare@icij.org and our team will do its best to answer you in a timely manner.
In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
Datashare allows you to search within your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).
Welcome to Datashare - a self-hosted documents search software. It is a free and open-source software developed by the International Consortium of Investigative Journalists (ICIJ). Initially created to combine multiple named-entity recognition pipelines, this tool is now a fully-featured search interface to dig into your documents. With the help of several open source tools (Extract, Apache Tika, Apache Tesseract, CoreNLP, OpenNLP, Elasticsearch, etc), Datashare can be used on one single personal computer as well as on 100 interconnected servers.
Datashare is developed by the ICIJ, a collective of investigative journalists. Datashare is built at the top of technologies and methods already tested with investigations like the Panama Papers or the Luanda Leaks. Seeing the growing interest for ICIJ's technology, we decided to open source this key component of our investigations so a single journalist as well as big media organizations could use it on their own documents.
Curious to know more about how we use Datashare?
We setup a Demo instance of Datashare with a small set of documents from the Luxleaks investigation (2014). When using this instance, you will be assigned a temporary user which can star, tag, recommend and explore documents.
Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to the server documentation to know how it works.
When building Datashare, one of our first decisions was to use Elasticsearch to create an index of documents. It would be fair to describe Datashare as a nice looking web interface for Elasticsearch. We want our search platform to be user-friendly while keeping all the powerful Elasticsearch features available for advanced users. This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers who want to query the index directly with our API.
We implemented the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform. Please have a look at the documentation.
This project is currently available in English, French, Spanish and Japanese. You can help us to improve and complete translations on Crowdin.
When running Datashare from the command-line, you can pick which "stage" to apply to analyse your documents.
The CLI stages are primarly intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heaving tasks between servers.
This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.
Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entities mentions in ElasticSearch, it will mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be paralelized as well on several servers.
Datashare runs using different modes with their own specifities.
Those two modes are the only one who create
In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
The running modes offer advantages and limitations. This matrix summarizes the differences:
When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.
In cli mode, Datashare starts without a web server and allow user to perform task over their documents. This mode can be used in conjunction both with local and server modes, while allowing users to distribute heaving task between several servers.
Those modes are intended to be used for action that requires to "wait" for pendings tasks.
In batch download mode, the daemon wait for a user to request a batch download of documents. When a request is receive, the daemon start a task to download the document matching the user search, a bundle them into a zip file.
In batch search mode, the daemon wait for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can to upload a list of search terms (in CSV format). The daemon will then start a task to search all matching document and store every occurences in the database.
Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mpode is the embedded mode. Yet when starting Datashare in command line, you can explicitely specify the running mode. For instance on Ubuntu/Debian:
The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue create in the previous step, then use a combination of and to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurent value at the time. This allows users to distribute this command on serveral servers.
If you want to learn more about which tasks you can execute in this mode, checkout the .
LOCAL
Web
To run Datashare on a single computer for a single user.
SERVER
Web
To run Datashare on a server for multiple users.
CLI
CLI
To index documents and analyze them directly in the command-line.
TASK_RUNNER
Daemon
To execute async tasks (batch searches, batch downloads, scan, index, NER extraction
local
server
Multi-users
β
β
Multi-projects
β
β
Access-control
β
β
Indexing UI
β
β
Plugins UI
β
β
Extension UI
β
β
HTTP API
β
β
API Key
β
β
Single JVM
β
β
Tasks execution
β
β
It will help you set up and install Datashare on your computer.
The installer will setup:
MacPorts (if neither Homebrew or MacPorts are installed)
Apache Tesseract with MacPorts or Homenrew
Java JRE 17
Datashare executable
Go to your "Downloads" directory in Finder and double-click "datashare-X.Y.Z.pkg":
Click 'Continue', 'Install', enter your password and 'Install Software':
The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.
You can see what it actually does by typing command+L, it will open a window which logs every action made.
In the end, you should see this screen:
This guide will explain to you how to install Datashare on Mac. The installer will take care of checking your system have all the dependencies to run Datashare. Because this software use (to perform Optical Character Recognition) and Mac doesn't support them out-of-box, heavy dependencies must be downloaded. If your system have none of those dependencies, the first installation of Datashare can take up to 30 minutes.
Xcode Command Line Tools (if neither XCode or are installed)
Note: previous versions of this document refered to a "Docker Installer". We do not provide this installer anymore but Datashare is still and supported with Docker.
Go to , scroll down and click 'Download for Mac'.
You can now !
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
When you open your desktop, you will see a folder called 'Datashare Data'. Move or copy and paste the documents you want to add to Datashare to this folder:
Now open Datashare, which you will find in your main menu (see above: 'Open Datashare')
Once Datashare has opened, click on 'Analyze documents' on the top navigation bar in Datashare:β
You're now ready to analyze your documents in Datashare.
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':
On the menu bar at the top of your computer, click 'Go'. Click on 'Home' (the house icon).
You will see a folder called 'Datashare':
If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:
Copy or place the documents you want to have in Datashare in this Datashare folder.
Open your Applications. You should see Datashare. Double click on it:
Datashare opens in your default internet browser. Click 'Tasks':
Click the 3rd tab 'Analyze your documents':
You can now analyze your documents in Datashare.
Find the application on your computer and have it running locally in your browser.
Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)
A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.
Keep this Terminal window open as long as you use Datashare.
Find the application on your computer and run it locally on your browser.
Once Datashare is installed, go to "Finder", then "Applications", and double-click on "Datashare".
A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening.
Keep this Terminal window open as long as you use Datashare.
It will help you set up the software on your computer.
The file "datashare-X.Y.Z.exe" is now downloaded. Double click on the name of the file in order to execute it.
As Datashare is not signed, this popup asks for your permission. Don't click 'Don't run' but click 'More info':
Click 'Run anyway':
It asks if you want to allow the app to make changes to your device. Click 'Yes':
On the Installer Wizard, as you need to download and install OpenJDK11 if it is not installed on your device, click 'Install':
The following windows with progress bars will be displayed:
Choose a language and click 'OK':
To install Tesseract OCR, click the following buttons on the Installer Wizard's windows:
Untick 'Show README' and click 'Finish':
Finally, click "Close" to close the installer of TesseractOCR.
It now downloads the back end and the front end, Datashare.jar:
When it is finished, click 'Close':
Datashare should now automatically open in your default internet browser. If it doesnβt, type "" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: ).
It's now time to .
Datashare should now automatically open in your default internet browser. If it doesnβt, type "" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: ).
You can now .
Before we start, please uninstall any prior standard version of Datashare if you had already installed it. You can follow these steps:
Go to , scroll down and click 'Download for free'.
You can now !
Install Datashare will help you set up the software on your computer.
Currently, only a .deb package for Debian/Ubuntu is provided.
If you want to run it with another Linux distribution, you can download the latest version of the Datashare jar here : https://github.com/ICIJ/datashare/releases/latest
And adapt the following launch script : https://github.com/ICIJ/datashare/blob/master/datashare-dist/src/main/deb/bin/datashare to your environment.
Go to datashare.icij.org, scroll down and click 'Download .deb'
Save the Debian package as a file
You can now start Datashare!
This page explain how to start Datashare within a Docker.
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Read more about how to install Docker on your system.
To start Datashare within a Docker container, you can use this command:
Make sure the Datashare
folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).
By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare with Docker Compose, you can use the following docker-compose.yml file:
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml
file. Then run the following command to start the Datashare service:
The -d
flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080
. This assumes that the default port mapping of 8080:8080
is used for the Datashare container in the YAML file.
That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml
file is located:
This will stop and remove the containers, freeing up system resources.
Find the application on your computer and run it locally on your browser.
Start Datashare by launching it from the command-line:
Datashare should now automatically open in your default internet browser. If it doesnβt, type "localhost:8080" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see: Can I use Datashare with no internet connection?).
It's now time to add documents to Datashare.
This page will explain to you how to install language packages to support Optical Character Recognition (OCR) on more languages.
To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses "language packages" especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide will explain you how to manually install it on your system.
To add ocr languages on linux, simply use the following command:
Where `[lang]` is can be :
all
if you want to install all languages
First, you must check that MacPort is installed on your computer. Please run in a Terminal:
You should see an output similar to this:
If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):
Once the installation is done, simply close and restart Datashare to be able to use the newly installed packages.
If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!
If you want to check if Homebrew is installed, run the following command in a Terminal:
You should see an output similar to this:
*Additionnal languages can be also added during Tesseract's installation.
The list of installed languages can be checked with Windows command prompt or Powershell with the commandtesseract --list-langs.
Datashare has to be restarted after the language installation.
It will help you index and have your documents in Datashare. This step is required in order to explore your documents.
1. To add your documents in Datashare, click 'Tasks' in the left menu:
2. Click 'Analyze your documents':
3. Click 'Add documents' so Datashare can extract the texts from your files:
You can:
Select the specific folder or sub-folder containing the documents you want to add.
Extract text also from images/PDFs (OCR). Be aware the indexing can be up to 10 times longer.
Skip already indexed files.
Two extraction tasks are now running: the first is the scanning of your Datashare folder which sees if there are new documents to analyze (ScanTask). The second is the indexing of these files (IndexTask):
It is not possible to 'Find people, organizations and locations' while of these two tasks is still running.
When tasks are done, you can start exploring documents by clicking 'Search' in the left menu but you won't have the named entities (names of people, organizations and locations) yet. To have these, follow the steps below.
1. After the text is extracted, you can launch named entities recognition by clicking the button 'Find people, organizations and locations'.
2. In the window below, you are asked to choose between finding Named Entities or finding email addresses (you cannot do both simultaneously, you need to do one after the other, no matter the order):
You can now see running tasks and their progress. After they are done, you can click 'Clear done tasks' to stop displaying tasks that are completed.
3. You can search your indexed documents without having to wait for all tasks to be done. To access your documents, click 'Search':
To extract email addresses in your documents:
Re-click on 'Find people, organizations, locations and email addresses' (in Tasks (left menu) > Analyze your documents)
Click the second radio button 'Find email addresses':
a language code (ex: fra
, for French), the list of languages is available
The Datashare Installer for Mac checks for the existence of either or , which packages managers are used for the installation of Tesseract. If none of those two package managers is present, the Datashare Installer will install MacPorts by default.
If you get a command not found: port
, this either means you are using Homebrew (see next section) or you did not yet.
The full list of supported language packages can be found .
If you get a command not found: brew
error, this mean Homebrew is not installed on your system. You might either use MacPorts (see previous section) or on your computer.
Languages packages are available on Tesseract . Trained data files have to be downloaded and added into tessdata
folder in Tesseract's installation folder.
Select the language of you document if you don't want Datashare to guess it automatically. Note: if you choose to also extract text from images (previous option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know .
You can now .
This page explains how to setup neo4j, install the neo4j plugin and create a graph on your computer
Follow the instructions of the dedicated FAQ page to get neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0
) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
If it's not done yet analyze your documents and extract both names of people, organizations and locations as well as email addresses.
If your project contains email documents, make sure to also extract email addresses.
You can now run Datashare with the neo4j plugin !
It will help you locally add plugins and extensions to Datashare.
Plugins are small programs that you can add to Datashare's front-end to get new features (the front-end is the interface, "the part of the software with which the user interacts directly" - source).
Extensions are small programs that you can add to Datashare's back-end to add new features (the back-end is "the part of the software that is not directly accessed by the user, typically responsible for storing and manipulating data" - source).
Go to "Settings":
Click "Plugins":
Choose the plugin you want to add and click "Install now":
If you want to install a plugin from an URL, click "Install plugin from URL".
Your plugin is installed.
Refresh your page to see your new plugin activated in Datashare.
Go to "Settings":
Click "Extensions":
Choose the extension you want to add and click "Install now":
If you want to install an extension from an URL, click "Install extension from URL".
Your extension is installed.
Restart Datashare to see your new extension activated in Datashare.
When a newer version of a plugin or extension is available, you can click on the "Update" button to get the latest version.
After that, if it is a plugin, refresh your page to activate the latest version.
If it is an extension, restart Datashare to activate the latest version.
People who code can create their own plugins and extensions by following these steps:
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
You can find a folder called 'Datashare' in your home directory.
Move the documents you want to add to Datashare into this folder.
Open Datashare to extract text and eventually find people, organizations and locations in your documents.
You can now analyze your documents.
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
Datashare is launched with --mode SERVER
and you have to provide:
the external elasticsearch index address elasticsearchAddress
a Redis store address redisAddress
a Redis data bus address messageBusAddress
a database JDBC URL dataSourceUrl
the host of Datashare (used to generate batch search results URLs) rootHost
an authentication mechanism and its parameters
Example:
This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects
Open the 'Projects' page and select your project:
Create the graph by clicking on the 'Create graph' button inside the neo4j widget:
You will see a new import task running:
When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:
When new documents or entities are added or modified inside Datashare, you will need to update the neo4j graph to reflect these changes.
To update the graph click on the 'Update graph' button inside the neo4j widget:
To detect whether a graph update is needed you can compare the number of documents found inside Datashare to the number found in the 'Graph statistics' and run an update in case of mismatch:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
explore your graph using your favorite visualization tool
1. Go to "Settings":
2. Make sure the following settings are properly set:
Neo4j Host
should be localhost
or the address where your neo4j instance is running
Neo4j Port
should be the port where your neo4j instance is running (7687
by default)
Neo4j User
should be set to your neo4j user name (neo4j
by default)
Neo4j Password
should only be set if your neo4j user is using password authentication
3. When running Neo4j Community Edition
, set the Neo4j Single Project
value. In community edition, the neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project
with the name of the project which will use neo4j plugin. Other projects won't be able to use the neo4j plugin.
4. Restart Datashare to apply the changes
5. You should be able to see the neo4j widget in your project page, after a little while its status should be RUNNING
:
This is likelly to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.
Here is a simple command to scan a directory and index its files:
What's happening here:
The SCAN stage feeds a queue in memory with file to add
The INDEX stage pulls files from the queue to add them to ElasticSearch
We tell Datashare to use the elasticsearch
service
Files to add are located in /home/datashare/Datashare/
which is a directory mounted from the host machine
Alternativly, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the redis:
Once the opperation is done, we can easily check the content of queue created by Datashare in redis. In this example we only display the 20 first files in the datashare:queue
:
Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.
Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :
Scan the existing ElasticSearch index and gather document paths to store it inside a report queue
Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it
Install the neo4j plugin following instructions available in the .
You can now !
This document assumes you have installed Datashare .
In server , it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differenciate user to offer admin additional tools.
Datashare starts in "CLI"
We ask to process both SCAN and INDEX at the same time
The INDEX can now be executed in the same container:
This page explain how to start Datashare within a Docker in server mode.
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Read more about how to install Docker on your system.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare in server mode with Docker Compose, you can use the following docker-compose.yml file:
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml
file. Then run the following command to start the Datashare service:
The -d
flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080
. This assumes that the default port mapping of 8080:8080
is used for the Datashare container in the YAML file.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml
file is located:
This will stop and remove the containers, freeing up system resources.
If you reach that point, Datashare is up and running but you will discover very quickly that no documents is available in the search results. Next step: Add documents from the CLI.
Datashare as the ability to detect email addresses, name of people, organizations and locations. You must perform the named entities extraction in the same fashion than the previous commands. Final step: Add named entities from the CLI.
Install the neo4j plugin using the Datashare CLI so that users can access it from the frontend:
Installing the plugin install the datashare-plugin-neo4j-graph-widget
plugin inside /home/datashare/plugings
and will also install the datashare-extension-neo4j
backend extension inside /home/datashare/extensions
. These locations can be changed by updating the docker-compose.yml
.
Update the docker-compose.yml
to reflect your neo4j docker service settings.
If your choose a different neo4j user or set a password for your neo4j user make sure to also set DS_DOCKER_NEO4J_USER
and DS_DOCKER_NEO4J_PASSWORD
.
When running Neo4j Community Edition
, set the DS_DOCKER_NEO4J_SINGLE_PROJECT
value. In community edition, the neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the DS_DOCKER_NEO4J_SINGLE_PROJECT
with the name of the project which will use neo4j plugin. Other projects won't be able to use the neo4j plugin.
After installing the plugin a restart might be needed for the plugin to display:
You can now create the graph !
This page explains how to setup neo4j, install the neo4j plugin and create a graph on your server
Follow the instructions of the dedicated FAQ page to get neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0
) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
If it's not done yet add entities to your project using the Datashare CLI.
If your project contains email documents, make sure to run the EMAIL
pipeline together with regular NLP pipeline. To do so add set the follow nlpp
flag to --nlpp CORENLP,EMAIL
.
You can now run Datashare with the neo4j plugin !
Authentication with Datashare in server mode is the most impacting choice that has to be made. It can be one of the followings:
basic authentication with credentials stored in database (PostgreSQL)
basic authentication with credentials stored in Redis
OAuth2 with credentials provided by an identity provider (KeyCloak for example)
dummy basic auth to accept any user (β οΈ if the service is exposed to internet, it will leak your documents)
This document assumes you have installed Datashare in server mode within Docker and already added documents to Datasharte.
In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate user to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.
Datashare as the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:
What's happening here:
Datashare starts in "CLI" mode
We ask to process the NLP stage
We tell Datashare to use the elasticsearch
service
Datashare will pull documents from ElasticSearch directly
Up to 2 documents will be analyzed in parallel
Datashare will use the CORENLP pipeline
Datashare will use the output queue from the previous INDEX
stage (by default extract:queue:nlp
in Redis) that contains all the document ids to be analyzed.
The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.
You can also use chain the 3 stages altogether:
As for the previous stages you may want to restore the output queue from the INDEX
stage. You can do:
The added ENQUEUEIDX
stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the ids of those documents into the extract:queue:nlp
queue.
OAuth2 authentication with a third-party id service
This is the default authentication mode: if not provided in CLI it will be selected. With OAuth2 you will need a third-party authorization service. The diagram below describes the workflow:
We made a small demo repository to show how it could be setup.
Basic authentication with a database.
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization
with user:password
base64 encoded:
It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.
Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):
Then you can insert the user like this in your database:
If you use other indices, you'll have to include them in the group_by_applications
, but local-datashare
should remain. For exammple if you use myindex
:
Or you can use PostgreSQL import CSV COPY
statement if you want to create them all at once.
Then when accessing Datashare, you should see this popup:
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in database:
Basic authentication with Redis
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization
with user:password
base64 encoded:
It is secure as long as the communication to the server is encrypted (with SSL
for example).
On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.
So you have to provision users. The passwords are sha256 hex
encoded. For example using bash
:
Then insert the user like this in Redis:
If you use other indices, you'll have to include them in the group_by_applications
, but local-datashare
should remain. For exammple if you use myindex
:
Then you should see this popup:
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in Redis:
You can search with the main search bar, with operators, and also within a document thanks to control or command + F.
1. To see all your documents (you need to have added documents to Datashare and have analyzed them before), click 'Search in documents':
If not collapsed yet, to collapse the left menu in order to gain room, click the 'hamburger menu':
2. Search for specific documents. Type terms in the search bar, press Enter or click 'Search':
IMPORTANT:
To make your searches more precise, you can search with operators (AND, OR, ....): read more here.
If you get a message "Your search query is wrong", it is probably because you are misusing one or some reserved characters (like ^ " ? ( [ * OR AND etc). Please refer to this page.
3. You can search in specific fields like tags, title, author, recipient, content, path or thread ID. Click 'All fields' and select your choice in the dropdown menu:
Select the view on the top right.
List:
Grid:
Table:
Once a document is opened, you can search for terms in this document:
Press Command (β) + F (on Mac) or Control + F (on Windows and Linux) or click on the search bar above your Extracted Text
Type what you search for
Press ENTER to go from one occurrence to the next one
Presse SHIFT + ENTER to go from one occurrence to the previous one
(To know all the shortcuts in Datashare, please read 'Use keyboard shortcuts'.)
This also counts the number of occurrences of your searched terms in this document:
If you run email extraction and searched for one or several email addresses, if the email adresses are in the email's metadata (recipient, sender or other field), there will be a 'in metadata' label attached to the email addresses:
To make your searches more precise, you can use operators in the main search bar.
To have all documents mentioning an exact phrase, you can use double quotes. Use straight double quotes ("example"), not curly double quotes (βexampleβ).
Example: "Alicia Martinezβs bank account in Portugal"
To have all documents mentioning all or one of the queried terms, you can use a simple space between your queries or 'OR'. You need to write 'OR' with all letters uppercase.
Example: Alicia Martinez
Same search: Alicia OR Martinez
To have all documents mentioning all the queried terms, you can use 'AND' between your queried words. You need to write 'AND' with all letters uppercase.
Example: Alicia AND Martinez
Same search: +Alicia +Martinez
To have all documents NOT mentioning some queried terms, you can use 'NOT' before each word you don't want. You need to write 'NOT' with all letters uppercase.
Example: NOT Martinez
Same search: !Martinez
Same search: -Martinez
Parentheses should be used whenever multiple operators are used together and you want to give priority to some.
Example: ((Alicia AND Martinez) OR (Delaware AND Pekin) OR Grey) AND NOT "parking lot"
If you search faithf?l, the search engine will look for all words with all possible single character between the second f and the l in this word. It also works with * to replace multiple characters.
Example: Alicia Martin?z
Example: Alicia Mar*z
You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you type an exact phrase (in double quotes) and use fuzziness, then the meaning of the fuzziness changes. Now, the fuzziness means the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
"the cat is blue" -> "the small cat is blue" (1 insertion = fuzziness is 1)
"the cat is blue" -> "the small is cat blue" (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Use the boost operator ^
to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes:
Example: quick^2 fox
The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups:
Example: "john smith"^2 (foo bar)^4
1. You can use Regex in Datashare. Regular expressions (Regex) in Datashare need to be written between 2 slashes.
Example: /.*..*@.*..*/
The example above will search for any expression which is structured like an email address with a dot between two expressions before the @ and a dot between two expressions after the @ like in 'first.lastname@email.com' for instance.
2. Regex can be combined with standard queries in Datashare :
Example: ("Ada Lovelace" OR "Ado Lavelace") AND paris AND /.*..*@.*..*/
3. You need to escape the following characters by typing a backslash just before them (without space):β # @ & < > ~
Example: /.*..*@.*..*/ (the @ was escaped by a backslash \ just before it)
We encourage you to use the AND operator to work around this limitation and make sure you can make your search.
If you're looking for French International Bank Account Number (IBAN) that can or cannot contain space and contain FR followed by numbers and/or letters (it could be FR7630001007941234567890185 ou FR76 3000 4000 0312 3456 7890 H43 for example), you can then search for:
Example: /FR[0-9]{14}[0-9a-zA-Z]{11}/ OR (/FR[0-9]{2}.*/ AND /[0-9]{4}.*/ AND /[0-9a-zA-Z]{11}.*/)
Here are a few examples of useful Regex:
You can search for /Dimitr[iyu]/ instead of searching for Dimitri OR Dimitry OR Dimitru. It will find all the Dimitri, Dimitry or Dimitru.
You can search for /Dimitr[^yu]/ if you want to search all the words which begin with Dimitr and do not end with either y nor u.
You can search for /Dimitri<1-5>/ if you want to search Dimitri1, Dimitri2, Dimitri3, Dimitri4 or Dimitri5.
Other common Regex examples:
phone numbers with "-" and/or country code like +919367788755, 8989829304, +16308520397 or 786-307-3615 for instance: /[\+]?[(]?[0-9]{3}[)]?[-\s.]?[0-9]{3}[-\s.]?[0-9]{4,6}/
credit cards: /(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35[0-9]{3})[0-9]{11})/
To find the list of existing metadata fields, go to a document's 'Tags and details' tab, click 'Show more details'.
When you hover the lines, you see a magnifying glass on each line. Click on it and Datashare will look for this field. Here is the one for content language:
Here is the one for 'indexing date' (also called extraction date here) for instance:
So for example, if you are looking for documents that:
contains term1, term2 and term3
and were created after 2010
you can use the 'Date' filter or type in the search bar:
term1 AND term2 AND term3 AND metadata.tika_metadata_creation_date:>=2010-01-01
Explanations:
'metadata.tika_metadata_creation_date:' means that we filter with creation date
'>="'means 'since January 1st included'
'2010-01-01' stands for January 2010 and the search will include January 2010
For other searches:
'<' will mean 'strictly after (with January 1st excluded)'
nothing will mean 'at this exact date'
You can search for numbers in a range. Ranges can be specified for date, numeric or string fields amont the ones you can find by clicking the magnifying glass when you hover the fields in a document's tab 'Tags and Details'.
Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.
Examples:
Datashare offer --parallelism
and --parserParallelism
options to enhance processing speed.
Example (for g4dn.8xlarge
with 32 CPUs):
ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.
You can fine-tune the JAVA_OPTS
environment variable based on your system's configuration to optimize Java Virtual Machine memory usage.
Example (for g4dn.8xlarge8
with 120 GB Memory):
If the document language is known, explicitly setting it can save processing time.
Use --language
for general language setting (e.g., FRENCH
, ENGLISH
).
Use --ocrLanguage
for OCR tasks to specify the Tesseract model (e.g., fra
, eng
).
Example:
OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false
.
Example:
Large PST files or archives can hinder processing efficiency. We recommend extract these files before processing with Datashare. If they are too many of them, keep in mind Datashare will be able to extract them anyway.
This page describes how to create and maintain your neo4j graph up to date with your server's Datashare projects
The neo4j related features are added to the DatashareCLI through the extension mechanism. In order to run the extended CLI, the Java CLASSPATH
must be extended with the path of the datashare-extension-neo4j
jar. By default, this jar is located in /home/datashare/extensions
, so the CLI will be run as following:
In order to create the graph, run the --fullImport
command for your project:
the CLI will display the import task progress and log import related information.
When new documents or entities are added or modified inside Datashare, you will need to update the neo4j graph to reflect these changes.
To update the graph, you can just re-run the full export:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
To detect whether a graph update is needed, open the 'Projects' page and select your project:
compare the number of documents and entities found inside Datashare:
to the numbers found in the 'Graph statistics' and run an update in case of mismatch:
You can use several filters on the left of the main screen. Applied filters are reminded on the top of the results' column. You can also 'contextualize' and reset the filters.
On the left column, you can apply filters by ticking them, like 'Portable Document Format (PDF)' in File Types and 'English' in Languages in the example below:
A reminder of the currently applied filters, as well as your queried terms, are displayed at the top of the results' column. You can easily unselect these filters from there by clicking them or clear all of them:
The currently available filters are:
Projects: if you have more than one project, you can select several of them and run searches in multiple projects at once.
Starred: If you have starred documents, you can easily find them again.
Tags: If you wrote some tags, you will be able to select and search for them.
Recommended by: available only on server (collaborative) mode, this functionality helps you find the document recommended by you and/or others.
File type: This is the 'Content type' of the file (Word, PDF, JEPG image, etc.) as you can read it in a document's 'Tags & Details'.
Creation dates: the calendar allows you to select a single creation date or a date range. This is when the document was created as it is noticed in their properties. You can find this in a document's 'Tags & Details'.
Languages: Datashare detects the main language of each document.
People / Organizations / Locations: you can selected these named entities and search them.
Path: This is the location of your documents as it is indicated in your original files (ex: desktop/importantdocuments/mypictures). You can find this in a document's 'Tags & Details'.
Indexing date: This date corresponds to when you indexed the documents in Datashare.
Extraction level: This regards embedded documents. The file on disk is level zero. If a document (pictures, etc) is attached or contained in a file on disk, extraction level is β1stβ. If a document is attached or contained in a document itself contained in a file on disk, extraction level is β2ndβ, etc.
Filters can be combined together and combined with searches in order to refine results.
If you have asked Datashare to 'Find people, organizations and locations', you can see names of individuals, organizations and locations in the filters. These are the named entities automatically detected by Datashare.
Search for named entities in the filter's search bar:
Select all of them, one or several of them to filter the documents that mention them:
If you want to select all items except one or several of them, you can use the 'Exclude button'.
It allows you to search for all documents which do not correspond to the filter(s) you selected, that is to say to the currently strikethrough filters.
In several filters, you can tick 'Contextualize' : this will update the number of documents indicated in the filters in order to reflect the results. The filter will only count what you selected.
In the example below, the 'Contextualize' checkboxes are not ticked:
After the Contextualize button in Tags filter is ticked:
After the Languages button in Tags filter is ticked:
To reset all filters at the same time, click 'Clear all':
It allows to get the results of each query of a list, but all at once.
If you want to search a list of queries in Datashare, instead of doing each of them one by one, you can upload the list directly in Datashare. To do so, you will:
Create a list of terms that you want to search in the first column of a spreadsheet
Export the spreadsheet as a CSV (a special format available in any spreadsheet software)
Upload this CSV in the "new Batch Search" form in Datashare
Get the results for each query in Datashare - or in a CSV.
Write your queries, one per line and per cell, in the first column of a spreadsheet (Excel, Google Sheets, Numbers, Framacalc, etc.). In the example below, there are 4 queries:
Do not put line break(s) in any of your cells.
To delete line break(s) in your spreadsheet, you can use the "Find and replace all" functionality. Find all "\n" and replace them all by nothing or a space.
Write 2 characters minimum in the cells. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to 'failure'.
If you have blank cells in your spreadsheet...
...the CSV (which stand for 'Comma-separated values') will keep these blank cells. It will separate them with semicolons (the 'commas'). You will thus have semicolons in your batch search results (see screenshot below). To avoid that, you need to remove blank cells in your spreadsheet before exporting it as a CSV.
If there is a comma in one of your cells (like in "1,8 million" in our example above), the CSV will formally put the content of the cell in double quotes in your results and search for the exact phrase in double quotes.
For instance, if you want to search only in some documents with certain tag(s), you can write your queries like this: "Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)".
Please also note that searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.
Export your spreadsheet in a CSV format like this:
LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':
Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".
Other spreadsheet softwares: please refer to each software's user guide.
Open Datashare, click 'Batch searches' in the left menu and click 'New batch search' on the top right:
Type a name for your batch search:
Upload your CSV:
Add a description (optional):
Set the advanced filters ('Do phrase matches', 'Fuzziness' or 'Proximity searches', 'File types' and 'Path') according to your preferences:
'Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness.
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
βthe cat is blueβ -> βthe small cat is blueβ (1 insertion = fuzziness is 1)
βthe cat is blueβ -> βthe small is cat blueβ (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Click 'Add'. Your batch search will appear in the table of batch searches.
Open your batch search by clicking its name:
You see your results and you can sort them by clicking the column's name. 'Rank' means the order by which each queries would be sorted out if run in Datashare's main search bar. They are thus sorted by relevance score by default.
You can click on a document's name and it will open it in a new tab:
You can filter your results by query and read how many documents there are for each query:
You can search for specific queries:
You can also download your results in a CSV format:
If you add more and more files in Datashare, you might want to relaunch existing batch search on your new documents too.
Notes:
In the server collaborative mode, you can only relaunch your own batch searches, not others'.
The relaunched batch search will apply to your whole corpus, newly indexed documents and previously indexed documents (not only the newly indexed ones).
To do so, open the batch search that you'd like to relaunch and click 'Relaunch':
Edit the name and the description of your batch search if needed:
You can choose to delete the current batch search after relaunching it:
Note: if you're worried about losing your previous results because of an error, we recommend to keep your current batch search (turn off this toggle button) and delete it only after the relaunch is a success.
Click 'Submit':
You can see your relaunched batch search running in the batch search's list:
Failures in batch searches can be due to several causes.
Click the 'See error' button to open the error window:
The first query containing an error makes the batch search fail and stop.
Check this first failure-generating query in the error window:
We recommend to remove the slash, as well as any reserved characters, and re-run the batch search again.
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
__
One of your queries can lead to a 'Data too large' error.
It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.
We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.
****
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators (see below).
Datashare stops at the first syntax error. It reports only the first βerror. You might need to check all your queries as some errors can remain after correcting the first one.
They are more likely to happen when 'do phrase matches' toggle button is turned off:
When 'Do phrase matches' is on, syntax error can still happen though:
Here are the most common errors:
Open your batch search and click the trash icon:
Then click 'Yes':
You can also combine these with 'regular expressions' Regex between two slashes ().
If you search for similar terms (to catch typos for example), you can use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: ).
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox"
would be considered more relevant than "quick brown fox"
(source: ).
(source: )
β"A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern." ().
4. Important: Datashare relies on Elastic's Regex syntax as explained. Datashare uses . A consequence of this is that spaces cannot be searched as such in Regex.
emails (): /[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+/
You can find many other examples . More generally, if you use a regex found on internet, beware that the syntax is not necessarily compatible with elasticsearch's. For example \d
, \S
and the like .
Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}. For more details, please refer to .
Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with and , this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.
Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multiple instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.
For projects like the (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.
Example to split Outlook PST files in multiple .eml
files with :