This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.
About the local mode
In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines.
The software allows users to search into their documents within their own local environment, without relying on external servers or cloud infrastructure.
This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
Install on Mac
These pages will help you set up and install Datashare on your computer.
Running modes
Datashare runs using different modes with their own features.
Mode
Category
Description
LOCAL
Web
To run Datashare on a single computer for a single user.
SERVER
Web
To run Datashare on a server for multiple users.
CLI
CLI
To index documents and analyze them directly .
Web modes
There are two modes:
In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
Comparaison between modes
The running modes offer advantages and limitations. This matrix summarizes the differences:
When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.
CLI mode
In cli mode, Datashare starts without a web server and allows user to perform tasks over their documents. This mode can be used in conjunction with both local and server modes, while allowing users to distribute heavy tasks between several servers.
If you want to learn more about which tasks you can execute in this mode, checkout the .
Daemon modes
Those modes are intended to be used for action that requires to "wait" for pendings tasks.
In batch download mode, the daemon waits for a user to request a batch download of documents. When a request is received, the daemon starts a task to download the document matching the user search, and bundle them into a zip file.
In batch search mode, the daemon waits for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can upload a list of search terms (in CSV format). The daemon will then start a task to search all matching documents and store every occurrences in the database.
How to change modes
Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mode is the embedded mode. Yet when starting Datashare in command line, you can explicitly specify the running mode. For instance on Ubuntu/Debian:
✅
❌
Extension UI
✅
❌
HTTP API
✅
✅
API Key
✅
✅
Single JVM
✅
❌
Tasks execution
✅
❌
TASK_RUNNER
Daemon
To execute async tasks (batch searches, batch downloads, scan, index, NER extraction, ...)
datashare \
# Switch to SERVER mode
--mode SERVER \
# Dummy session filter to creates ephemeral users
--authFilter org.icij.datashare.session.YesCookieAuthFilter \
# Name of the default project for every user
--defaultProject local-datashare \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200 \
# URI of Redis
--redisAddress redis://redis:6379 \
# store user sessions in Redis.
--sessionStoreType REDIS
Start Datashare
Find the Datashare application on your computer and run it locally on your browser.
Once Datashare is , go to 'Finder' > 'Applications', and double-click on 'Datashare':
A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening:
⇒ Important: Keep this Terminal window open as long as you use Datashare.
Once the process is done, Datashare should now automatically open in your default internet browser. If it doesn’t, type '
Install on Windows
These pages will help you set up and install Datashare on your computer.
' as a URL in your browser.
Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: Can I use Datashare with no internet connection?).
When running Datashare from the command-line, pick which "stage" to apply to analyse your documents.
The CLI stages are primarily intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heavy tasks between servers.
1. SCAN
This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.
datashare --mode CLI \
# Select the SCAN stage
--stage SCAN \
# Where the document are located
--dataDir /path/to/documents \
# Store the queued files in Redis
--dataBusType REDIS \
# URI of Redis
--redisAddress redis://redis:6379
2. INDEX
The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue created in the previous step, then use a combination of and to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurrent value at the time. This allows users to distribute this command on several servers.
3. NLP
Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entity mentions in ElasticSearch, it will also mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be parallelized as well on several servers.
Add documents to Datashare
Datashare provides a folder on your Mac to collect documents you want to have in Datashare.
1
Find your Datashare folder on your Mac
Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':
On the menu bar at the top of your computer, click 'Go' and 'Home' (the house icon):
You will see a folder called 'Datashare':
If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:
2
Add documents to your Datashare folder on your Mac
Copy or drop the documents that you want to add to Datashare in this Datashare folder.
3
Launch Datashare
Open your Applications. You should see Datashare. Double-click on it:
4
In the menu, in 'Tasks', open 'Documents'
Expand the menu on the left:
In 'Tasks', open 'Documents':
5
Choose your options
Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.
6
Watch the progress of your document addition
Two extraction tasks are now running:
The first is the
You can now .
About Datashare
Datashare allows you to search in your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).
With the help of several open-source tools (Extract, , , , , , and more), Datashare can be used on one single personal computer, as well as on 100 interconnected servers.
Who uses it?
Datashare is developed by the ICIJ, a collective of investigative journalists. Datashare is built at the top of technologies and methods already tested with investigations like the or the .
Seeing the growing interest for ICIJ's technology, we decided to open source this key component of our investigations so a single journalist as well as big media organizations could use it for their own documents.
Datashare is free so anyone can use it and find is useful.
Curious to know more about how we use Datashare?
Where can I see Datashare in action?
We setup a with a small set of documents from the investigation (2014). When using this instance, you will be assigned a temporary user which can star, tag, recommend and explore documents.
Can I run Datashare on my server?
Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to to know how it works.
Can I customize Datashare?
When building Datashare, one of our first decisions was to use to create an index of documents. It would be fair to describe Datashare as a nice looking web interface for Elasticsearch. We want our search platform to be user-friendly while keeping all the powerful Elasticsearch features available for advanced users. This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers who want to query the index directly .
We implemented the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user can pick the plugins they need or want, and have a fully customized installation of our search platform. Please have a look at the .
In which languages is Datashare available?
This project is currently available in English, French and Spanish. You can help improve and complete translations on .
Install Datashare
The installer will take care of checking that your system have all the dependencies to run Datashare. Because this software use (to perform Optical Character Recognition, OCR) and Mac doesn't support them out-of-box, heavy dependencies must be downloaded. If your system have none of those dependencies, the first installation of Datashare can take up to 30 minutes.
The installer will set up:
Xcode Command Line Tools (if neither XCode or are installed)
These pages will help you set up and install Datashare on your computer.
datashare --mode CLI \
# Select the NLP stage
--stage NLP \
# Use CORENLP to detect named entities
--nlpp CORENLP \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200
datashare --mode CLI \
# Select the INDEX stage
--stage INDEX \
# Where the document are located
--dataDir /path/to/documents \
# Store the queued files in Redis
--dataBusType REDIS \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200 \
# Enable OCR \
--ocr true
# URI of Redis
--redisAddress redis://redis:6379
On the top right, click the 'Plus' button:
Click the 'Plus' button
Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.
Choose the language of your documents if you don't want Datashare to guess it automatically.
Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.
Skip already indexed documents if you'd like.
Click 'Add'
Form for adding documents
scanning
of your Datashare folder - it sees if there are documents to analyze. It is called 'Scan folders'.
The second is the indexing of these files. It is called 'Index documents'.
Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.
But you can start searching in your documents without having to wait for all tasks to be done.
Note: Previous versions of this document referred to a "Docker Installer". We do not provide this installer anymore but Datashare is still published on the Docker Hub and supported with Docker.
Installation fails:
Error while installing Homebrew or MacPorts: you can manually install Homebrew first and then restart the installer.
"System Software from application was blocked from loading" : Check in your Mac's "System Settings" > "privacy & security" if you have a section with this mention "System software from application 'Datashare' was blocked from loading" or something similar related to Datashare. If you have this section you'll have to click "Allow" to be able to install datashare.
For any other issue check our Github issues or create a new one with your setup (macOs version) and installer logs (Command+L when the installer is launched and failed).
Find the application on your computer and run it locally in your browser.
Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)
A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.
Keep this Terminal window open as long as you use Datashare.
Datashare should now automatically open in your default internet browser.
If it doesn’t, type 'localhost:8080' in your browser.
Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: ).
You can now .
Add documents to Datashare
Datashare provides a folder to collect documents on your computer to index in Datashare.
1
Add documents in 'Datashare Data' folder
When you open your desktop in Windows on your computer, you will see a folder called 'Datashare Data'.
Move or copy and paste the documents you want to add to Datashare to this folder:
2
Launch Datashare
You will find it in your main menu:
3
In the menu, in 'Tasks', open 'Documents'
Expand the menu on the left:
In 'Tasks', open 'Documents':
4
Choose your options
Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.
5
Watch the progress of your document addition
Two extraction tasks are now running:
The first is the
You can now .
Add documents to Datashare
Datashare provides a folder to collect documents on your computer to index in Datashare.
1
Add documents to your 'Datashare' folder
You can find a folder called 'Datashare' in your home directory.
Move the documents you want to add to Datashare into this folder.
2
Launch Datashare
Launch Datashare and see the interface opening in your default browser.
3
In the menu, in 'Tasks', open 'Documents'
Expand the menu on the left:
In 'Tasks', open 'Documents':
4
Choose your options
Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.
5
Watch the progress of your document addition
Two extraction tasks are now running:
The first is the
You can now .
Start Datashare
Find the application on your computer and run it locally on your browser.
Start Datashare by launching it from the command-line:
Datashare should now automatically open in your default internet browser. If it doesn’t, type '' in your browser.
Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see: ).
It's now time to .
Install plugins and extensions
This page explains how to locally add plugins and extensions to Datashare.
Plugins are front-end modules to add new features in Datashare's user interface.
Extensions are back-end modules to add new features to store and manipulate data with Datashare.
Add plugins to Datashare's front-end
1
Install with Docker
This page will help you set up and install Datashare within a Docker.
Prerequisites
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Add more languages
This page explains how to install language packages to support Optical Character Recognition (OCR) on more languages.
To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses 'language packages' especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide explains you how to manually install it on your system.
Install packages on Linux
To add OCR languages on Linux, simply use the following command:
Find entities
This page helps you find entities (people, organizations, locations, e-mail addresses) in your documents.
Prerequisite: Your documents must be added to Datashare. Check how for , and .
To start Datashare within a Docker container, you can use this command:
Make sure the Datashare folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.
Starting Datashare with multiple containers
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).
By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store, will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare with Docker Compose, you can use the following docker-compose.yml file:
Apple Silicon (M1/M2/M3) users:
If you encounter the error Error response from daemon: no matching manifest for linux/arm64/v8 in the manifest list entries when pulling the redis Docker image, add the following line to the redis service in your docker-compose.yml:
This forces Docker to use the x86_64 image, which is necessary because some Redis images do not provide ARM64 builds.
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:
The -d flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this operation with:
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.
That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:
This will stop and remove the containers, freeing up system resources.
The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.
You can see what it actually does by typing command+L: it will open a window which logs every action made.
Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.
Choose the language of your documents if you don't want Datashare to guess it automatically.
Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.
Skip already indexed documents if you'd like.
Click 'Add'
Form for adding documents
scanning
of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.
The second is the indexing of these files. It is called 'IndexTask'.
Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.
But you can start searching in your documents without having to wait for all tasks to be done.
Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.
Choose the language of your documents if you don't want Datashare to guess it automatically.
Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.
Skip already indexed documents if you'd like.
Click 'Add'
Form for adding documents
scanning
of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.
The second is the indexing of these files. It is called 'IndexTask'.
Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.
But you can start searching in your documents without having to wait for all tasks to be done.
a language code (ex: fra, for French), the list of languages is available here
Install packages on Mac
The Datashare Installer for Mac checks for the existence of either MacPorts or Homebrew, which package managers are used for the installation of Tesseract. If none of those two package managers is present, the Datashare Installer will install MacPorts by default.
With MacPorts (default)
First, you must check that MacPort is installed on your computer. Please run in a Terminal:
You should see an output similar to this:
If you get a command not found: port, this either means you are using Homebrew (see next section) or you did not run the Datashare installer for Mac yet.
If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):
The full list of supported language packages can be found on MacPorts website.
Once the installation is done, close and restart Datashare to be able to use the newly installed packages.
With Homebrew
If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!
If you want to check if Homebrew is installed, run the following command in a Terminal:
You should see an output similar to this:
If you get a command not found: brew error, this mean Homebrew is not installed on your system. You might either use MacPorts (see previous section) or run the Datashare installer for Mac on your computer.
Install languages on Windows
Languages packages are available on Tesseract Github repository. Trained data files have to be downloaded and added into tessdata folder in Tesseract's installation folder.
*Additional languages can be also added during Tesseract's installation.
Download and add French into tessdata
The list of installed languages can be checked with Windows command prompt or Powershell with the command tesseract --list-langs.
French is listed in installed languages
Datashare has to be restarted after the language installation. Check how for Mac, Windows and Linux.
2
In the menu or on the top right, click the 'Plus' button or on the page, click 'Find entities':
3
Select your options
Select a project where you want to find entities
Choose between finding names of people, organizations and locations or finding email addresses. You cannot do both simultaneously, you need to do one after the other, no matter the order.
Choose a Natural Language Processing model, that is to say the software which will run the entity recognition. If you want to add more models, you can check .
4
In 'Tasks' > 'Entities', watch the progress of your entity recognition:
Once they are done, you can click 'Delete done tasks' to stop displaying tasks that are completed.
5
Explore your entities in the documents
You can now start searching your entities in the documents without having to wait for all tasks to be done.
In the menu, click 'Search' > 'Documents' and open the 'Entities' tab of your documents or use the Entities filters.
1. At the bottom of the menu, click on the 'Settings' icon:
2. Make sure the following settings are properly set:
Neo4j Host should be localhost or the address where your Neo4j instance is running
Neo4j Port should be the port where your Neo4j instance is running (7687 by default)
3. When running Neo4j Community Edition, set the Neo4j Single Project value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.
4. Restart Datashare to apply the changes. Check how for , or .
5. Go to 'Projects' > your project's page > the Graph tab. You should see the Neo4j widget. After a little while, its status should be RUNNING:
You can now .
About the server mode
In server mode, Datashare operates as a centralized server-based system. Users can access the platform through a web interface, and the documents are stored and processed on Datashare's servers.
This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
Launch configuration
Datashare is launched with --mode SERVER and you have to provide:
The external elasticsearch index address elasticsearchAddress
In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.
Here is a simple command to scan a directory and index its files:
What's happening here:
Datashare starts in "CLI"
We ask to process both SCAN and INDEX at the same time
The SCAN stage feeds a queue in memory with file to add
The INDEX stage pulls files from the queue to add them to ElasticSearch
We tell Datashare to use the elasticsearch service
Files to add are located in /home/datashare/Datashare/ which is a directory mounted from the host machine
Alternatively, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the Redis:
Once the operation is done, we can easily check the content of queue created by Datashare in Redis. In this example we only display the 20 first files in the datashare:queue:
The INDEX can now be executed in the same container:
Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.
Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :
Scan the existing ElasticSearch index and gather document paths to store it inside a report queue
Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it
Neo4j
This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your computer.
Prerequisites
Get Neo4j up and running
Follow the instructions of the to get Neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
Add entities
If it's not done yet and extract names of people, organizations and locations as well as email addresses.
If your project contains emails, make sure to also extract email addresses.
In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.
Datashare has the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing (NLP) pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:
What's happening here:
Datashare starts in "CLI"
We ask to process the NLP
We tell Datashare to use the elasticsearch service
Datashare will use the output queue from the previous INDEX stage (by default extract:queue:nlp in Redis) that contains all the document ids to be analyzed.
The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.
You can also use chain the 3 stages altogether:
As for the previous you may want to restore the output queue from the INDEX stage. You can do:
The added ENQUEUEIDX stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the IDs of those documents into the extract:queue:nlp queue.
Install with Docker
This page explain how to start Datashare within a Docker in server mode.
Prerequisites
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare in server mode with , you can use the following docker-compose.yml file for version 20.1.4 (check latest version on ):
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:
The -d flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:
This will stop and remove the containers, freeing up system resources.
Add documents to Datashare
If you reach that point, Datashare is up and running but you will discover very quickly that no documents is available in the search results. Next step: .
Extract named entities
Datashare has the ability to detect email addresses, name of people, organizations and locations. You must perform the named entities extraction in the same fashion than the previous commands. Final step: .
Dummy
Dummy authentication provider to disable authentication
You can have a dummy authentication that always accepts basic auth. So you should see this popup:
But then whatever user or password you type, it will enter Datashare.
Example
Basic with a database
Basic authentication with a database.
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:
It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then Datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.
Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):
Authentication providers
Authentication with Datashare in server mode is the most impacting choice that has to be made. It can be one of the followings:
Basic authentication with credentials stored in database (PostgreSQL)
Basic authentication with credentials stored in Redis
Create and update Neo4j graph
This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects
Create the graph
Go to 'All projects' and click on your project's name:
Install Neo4j plugin
Install the Neo4j plugin
Install the Neo4j plugin using the Datashare CLI so that users can access it from the frontend:
Installing the plugin installs the datashare-plugin-neo4j-graph-widget plugin inside /home/datashare/plugings and will also install the datashare-extension-neo4j backend extension inside
Basic with Redis
Basic authentication with Redis
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:
It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.
So you have to provision users. The passwords are sha256 hex encoded. For example using bash
If your choose a different Neo4j user or set a password for your Neo4j user make sure to also set DS_DOCKER_NEO4J_USER and DS_DOCKER_NEO4J_PASSWORD.
When running Neo4j Community Edition, set the DS_DOCKER_NEO4J_SINGLE_PROJECT value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the DS_DOCKER_NEO4J_SINGLE_PROJECT with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.
Restart Datasahre
After installing the plugin a restart might be needed for the plugin to display:
...
services:
datashare_web:
...
environment:
- DS_DOCKER_NEO4J_HOST=neo4j
- DS_DOCKER_NEO4J_PORT=7687
- DS_DOCKER_NEO4J_SINGLE_PROJECT=secret-project # This is for community edition only
docker compose restart datashare_web
Then you can insert the user like this in your database:
If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For example if you use myindex:
Or you can use PostgreSQL import CSVCOPY statement if you want to create them all at once.
Then when accessing Datashare, you should see this popup:
basic auth popup
Example
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in database:
Authorization: Basic dXNlcjpwYXNzd29yZA==
Go to the Graph tab and in the first step 'Import', click on the 'Import' button:
You will then see a new import task running.
When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:
Update the graph
If new documents or entities are added or modified in Datashare, you will need to update the Neo4j graph to reflect these changes.
Go to 'All projects' > one project's page > the 'Graph' tab. In the first step, click on the 'Update graph' button:
To detect whether a graph update is needed, go to the 'Projects' page and open your project:
Open your project
Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...
Statistics of one project
...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For exammple if you use myindex:
Then you should see this popup:
basic auth popup
Example
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in Redis:
Authorization: Basic dXNlcjpwYXNzd29yZA==
$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 -
docker run -ti ICIJ/datashare -m SERVER \
--dataDir /home/dev/data \
--batchQueueType REDIS \
--dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=dstest&password=test'\
--sessionStoreType REDIS \
--authFilter org.icij.datashare.session.YesBasicAuthFilter
basic auth popup
Create and update Neo4j graph
This page describes how to create and maintain your Neo4j graph up to date with your server's Datashare projects
Run the Neo4j extension CLI
The Neo4j related features are added to the DatashareCLI through the extension mechanism. In order to run the extended CLI, the Java CLASSPATH must be extended with the path of the datashare-extension-neo4j jar. By default, this jar is located in /home/.local/share/datashare/extensions/*, so the CLI will be run as following:
Create the graph
In order to create the graph, run the --fullImport command for your project:
The CLI will display the import task progress and log import related information.
Update the graph
When new documents or entities are added or modified inside Datashare, you will need to update the Neo4j graph to reflect these changes.
To update the graph, you can just re-run the full export:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
To detect whether a graph update is needed, go to the 'Projects' page and open your project:
Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...
...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
You can now using your favorite visualization tool.
Neo4j
This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your server
Prerequisites
Get Neo4j up and running
Follow the instructions of the to get Neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'All platforms and versions' button when downloading to access versions if necessary.
Add entities
If it's not done yet add entities to your project .
If your project contains email documents, make sure to run the EMAIL pipeline together with regular NLP pipeline. To do so add set the follow nlpp flag to --nlpp CORENLP,EMAIL.
Next step
You can now .
Search projects
Projects are collections of documents. Datashare displays statistics about each projects.
Expand the menu to go to 'Projects' > 'All projects':
Search in projects' names using the search bar on the right:
Sort your projects by clicking the top right Settings icon:
In the Page settings, choose a sort by option, change the number of projects per page or the layout:
To explore a project, close the Settings and click on the name of the project:
You can now .
Search documents
Search with the main search bar and configure settings to display your search's results.
You must have added documents in Datashare before. Check how for Mac, Windows and Linux.
Search bar
Expand the menu to go to 'Search' > 'Documents':
Make room by closing the menu:
Type terms in the search bar and press Enter:
Default operator is OR
If you type several terms separated by space, as the default operator is OR, Datashare will search for all documents containing at least one of the searched terms.
For instance, Datashare finds documents containing either 'ikea' or 'paris' or both terms here:
Linked entities
As you type a term, Datashare suggest linked entities - only if a task to find entities in this project was completed.
Press Esc on your keyboard to close the dropdown or click on one of the entities to replace your term in the search bar:
Search in a field
Search within a specific field only, by using the dropdown 'All fields':
Search breadcrumb
To see your queries in the search breadcrumb, click on the icon on the left of the search bar:
If you'd like to remove all searched terms from the search bar, click 'Clear query':
Results settings
To change the page settings, click the Settings icon on the top right:
You can change Sort by, Documents per page, Layout and also Properties:
Ticking these properties will change which document's metadata are displayed in the results, in the document cards, in all 3 layouts (List, Grid, Table):
You can now make your search more precise .
OAuth2
OAuth2 authentication with a third-party id service
This is the default authentication mode: if not provided in CLI, it will be selected. With OAuth2 you will need a third-party authorization service. The diagram below describes the workflow:
We made a small demo to show how it could be setup.
Keyboard shortcuts
Shortcuts help do some actions faster.
Open the menu > 'Search' > 'Documents' and click the keyboard icon at the bottom of the menu:
It opens a window with the shortcuts for your OS (Mac, Windows, Linux):
Click on 'See all shortcuts' to reach the full page view:
Search with operators or Regex
To make your searches more precise, use operators in the main search bar.
Double quotes for exact phrase
To have all documents mentioning an exact phrase, you can use double quotes. Use straight double quotes ("example"), not curly double quotes (“example”).
"Alicia Martinez’s bank account in Portugal"
Create a Neo4j graph and explore it
This page explains how to leverage Neo4j to explore your Datashare projects.
Prerequisites
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature. To download a specific version, click on 'All platforms and versions' .
If you are not familiar with graph and Neo4j, take a look at the following resources:
Explore a project
A project is a collection of documents. Datashare displays statistics about each projects.
Expand the menu, open 'All projects' and click on the name of the project that you want to explore:
If you'd like to pin this project in the menu for an easy access, click 'Pin to menu':
Your project is now pinned in the menu:
In a project page, in the first tab called 'Insights', you find statistics and a bar chart displaying the
Performance considerations
Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with and , this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.
Separate Processing Stages
Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.
Examples:
Filter documents
Filters are on the left of the search bar. You can contextualize, exclude and reset them. Active filters are displayed in the search breadcrumb.
Filters
Open 'Filters' on the left of the search bar:
'Indexing dates' arethe dates when the documents were added to Datashare.
Star, tag and recommend
Turn the documents into starred, tag them or, in server mode, recommend them to project's other members.
Star documents
In server collaborative mode, starring documents is private. Other members of your projects can't see your starred documents.
$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 -
docker compose exec \
# if you are not using the default extensions directory
# you have to specify it extending the CLASSPATH variable ex:
# -e CLASSPATH=/home/datashare/extensions/* \
datashare_web /entrypoint.sh \
--mode CLI \
--ext neo4j \
...
👷♀️ This page is currently being written by Datashare team.
FAQ
👷♀️ This page is currently being written by Datashare team.
Do you recommend OS or machines for large corpuses?
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
The most complex operation is OCR (we use Apache Tesseract) so if your documents don't contain many images, it might be worth deactivating it (--ocr false).
Can I use Datashare with no internet connection?
You need an internet connection to install Datashare.
You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.
You don't need internet connection to:
Add documents to Datashare
Find named entities (except for the first time you use an NLP options - see above)
Search and explore documents
Download documents
This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.
To have all documents mentioning at least one of the queried terms, you can use a simple space between your queries (as OR is the default operator in Datashare) or OR . You need to write OR with all letters uppercase.
Alicia Martinez
Alicia OR Martinez
AND (or +)
To have all documents mentioning all the queried terms, you can use AND between your queried words. You need to write AND with all letters uppercase.
Alicia AND Martinez
+Alicia +Martinez
NOT (or ! or -)
To have all documents NOT mentioning some queried terms, you can use NOT before each word you don't want. You need to write NOT with all letters uppercase.
NOT Martinez
!Martinez
-Martinez
Combine operators
Parentheses should be used whenever multiple operators are used together and you want to give priority to some.
((Alicia AND Martinez) OR (Delaware AND Pekin) OR Grey) AND NOT "parking lot"
You can also combine these with regular expressions (regex) between two slashes (see below).
Wildcards
If you search faithf?l, the search engine will look for all words with all possible single character between the second f and the l in this word. It also works with * to replace multiple characters.
Alicia Martin?z
Alicia Mar*z
Fuzziness
You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), you can use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
Proximity searches
When you type an exact phrase (in double quotes) and use fuzziness, then the meaning of the fuzziness changes. Now, the fuzziness means the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
"the cat is blue" -> "the small cat is blue" (1 insertion = fuzziness is 1)
"the cat is blue" -> "the small is cat blue" (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elastic).
"fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase quick fox would be considered more relevant than quick brown fox(source: Elastic).
Boosting operators
Use the boost operator ^ to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes:
quick^2 fox
The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups:
"A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern." (Wikipedia).
1. You can use Regex in Datashare. Regular expressions (Regex) in Datashare need to be written between 2 slashes and starting with the field (content, name, author, recipients, etc):
content: /.*..*@.*..*/
The example above will search in the content of the document for any expression which is structured like an email address with a dot between two expressions before the @ and a dot between two expressions after the @ like in 'first.lastname@email.com' for instance.
2. Regex can be combined with standard queries in Datashare :
("Ada Lovelace" OR "Ado Lavelace") AND paris AND content:/.*..*@.*..*/
3. You need to escape the following characters by typing a backslash just before them (without space): . ? + * | { } [ ] ( ) " \ # @ & < > ~
/.*..*\@.*..*/ (the @ was escaped by a backslash \ just before it)
4. Important: Datashare relies on Elastic's Regex syntax as explained here. Datashare uses the Standard tokenizer. A consequence of this is that spaces cannot be searched as such in Regex.
We encourage you to use the AND operator to work around this limitation and make sure you can make your search.
If you're looking for French International Bank Account Number (IBAN) that can or cannot contain space and contain FR followed by numbers and/or letters (it could be FR7630001007941234567890185 ou FR76 3000 4000 0312 3456 7890 H43 for example), you can then search for:
/FR[0-9]{14}[0-9a-zA-Z]{11}/ OR (/FR[0-9]{2}.*/ AND /[0-9]{4}.*/ AND /[0-9a-zA-Z]{11}.*/)
Here are a few examples of useful Regex:
You can search for /Dimitr[iyu]/ instead of searching for Dimitri OR Dimitry OR Dimitru. It will find all the Dimitri, Dimitry or Dimitru.
You can search for /Dimitr[^yu]/ if you want to search all the words which begin with Dimitr and do not end with either y nor u.
You can search for /Dimitri<1-5>/ if you want to search Dimitri1, Dimitri2, Dimitri3, Dimitri4 or Dimitri5.
Other common Regex examples:
phone numbers with "-" and/or country code like +919367788755, 8989829304, +16308520397 or 786-307-3615 for instance: /[\+]?[(]?[0-9]{3}[)]?[-\s.]?[0-9]{3}[-\s.]?[0-9]{4,6}/
You can find many other examples on this site. More generally, if you use a regex found on internet, beware that the syntax is not necessarily compatible with elasticsearch's. For example \d, \S and the like are not understood.
Search with metadata fields
1
In 'Search' > 'Documents', open a document and go to the 'Metadata' tab:
2
Click a metadata's searchicon to search documents with same properties:
3
See the query in the main search bar. It contains the field name, two dots and the searched value:
So for example, if you are looking for documents that:
Contains term1, term2 and term3
And were created after 2010
you can use the 'Date' filter or type in the search bar:
Neo4j is a graph database technology which lets you represent your data as a graph.
Inside Datashare, Neo4j lets you connect entities between them through documents in which they appear.
After creating a graph from your Datashare project, you will be able to explore this graph and visualize these kinds of relationships between you project entities:
In the above graph, we can see 3 e-mail document nodes in orange, 3 e-mail address nodes in red, 1 person node in green and 1 location node in yellow. Reading the relationship types on the arrows, we can deduce the following information from the graph:
shapp@caiso.com emailed 20participants@caiso.com, the sent email has an ID starting with f4db344...
One person named vincent is mentioned inside this email, as well as the california location
Finally, the e-mail also mentions the dle@caiso.com e-mail address which is also mentioned in 2 other e-mail documents (with ID starting with 11df197... and 033b4a2...)
Graph nodes
The Neo4j graph is composed of :Document nodes representing Datashare documents and :NamedEntity nodes representing entities mentioned in these documents.
The :NamedEntity nodes are additionally annotated with their entity types: :NamedEntity:PERSON, :NamedEntity:ORGANIZATION, :NamedEntity:LOCATION, :NamedEntity:EMAIL...
Graph relationships
In most cases, an entity :APPEARS_IN a document, which means that it was detected in the document content. In the particular case of e-mail documents and EMAIL addresses, it is most of the time possible to identify richer relationships from the e-mail metadata, such as who sent (:SENT relationship) and who received (:RECEIVED relationship) the e-mail.
When an :EMAIL address entity is neither :SENT or :RECEIVED, like it is the case in the above graph for dle@caiso.com, it means that the address was mentioned in the e-mail document body.
When a document is embedded inside another document (as an e-mail attachment for instance), the child document is connected to its parent through the :HAS_PARENT relationship.
Create your Datashare project's graph
The creation of a Neo4j graph inside Datashare is supported through a plugin. To use the plugin to create a graph, follow these instructions:
After the graph is created, open the menu, go to the 'Projects' page, select your project and go to the Graph tab.
You should be able to visualize a new Neo4j widget displaying the number of documents and entities found inside the graph:
Access your project's graph
Depending on your access to the Neo4j database behind Datashare, you might need to export the Neo4j graph and import it locally to access it from visualization tools.
Exporting and importing the graph into your own database is also useful when you want to perform write operations on your graph without any consequences on Datashare.
With read access to Datashare's Neo4j database
If you have read access to the Neo4j database (it should be the case if you are running Datashare on your computer), you will be able to plug visualization tools to it and start exploring.
Without read access to Datashare's Neo4j database
If you can't have read access to the database, you will need to export it and import it into your own Neo4j instance (running on your laptop for instance).
In case you don't have access to the DB and can't be provided with a dump, you can export the graph from inside. Be aware that limits might be applied on the size of the exported graph.
To export the graph, open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Cypher shell' export format and at the end of the form, click the 'Export' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.
Copy your the graph dump inside your Neo4j container import directory:
Import the dumped file using the cypher-shell command:
Neo4j Desktop import
Open 'Cypher shell':
desktop-shell
Copy your the graph dump inside your neo4j instance import directory:
Import the dumped file using the cypher-shell command:
You will now be able to explore the graph imported in your own Neo4j instance.
Explore and visualize entity links
Once your graph is created and you can access it (see this section if you can't access the Datashare's Neo4j instance), you will be able to use your favorite tool to extract meaningful information from it.
Neo4j Bloom is a simple and powerful tool developed by Neo4j to quickly visualize and query graphs, if you run Neo4j Enterprise Edition. Bloom lets you navigate and explore the graph through a user interface similar to the one below:
bloom-viz
Neo4j Bloom is accessible from inside Neo4j Desktop app.
Find out more information about how to use Neo4j Bloom to explore your graph with:
The Neo4j Browser lets you run Cypher queries on your graph to explore it and retrieve information from it. Cypher is like SQL for graphs, running Cypher queries inside the Neo4j browser lets you explore the results as shown below:
browser-viz
The Neo4j Browser is available for both Enterprise and Community distributions. You can access it:
Inside the Neo4j Desktop app when running Neo4j from the Desktop app
Gephi is a simple open-source visualization software. It is possible to export graphs from Datashare into the GraphML File Format and import them into Gephi.
To export the graph in the GraphML file format, open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Graph ML' export format and at the end of the form, click the 'Export' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.
Filter this chart by path by clicking 'Select path':
Click on one bar for a year or month to see all the corresponding documents:
On the 'Languages', 'File Types' and 'Authors' widgets, you see stats:
Search all documents with a specific criteria, for instance here with the French language:
Finally, in the server collaborative mode, you see the Latest recommended documents, that is to say the documents marked as recommended by other members of the project:
Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlarge instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.
For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.\
Leverage Parallelism
Datashare offers --parallelism and --parserParallelism options to enhance processing speed.
Example (for g4dn.8xlarge with 32 CPUs):
Optimize ElasticSearch
ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.
Adjust JAVA_OPTS
You can fine-tune the JAVA_OPTS environment variable based on your system's configuration to optimize Java Virtual Machine memory usage.
&#xNAN;Example (for g4dn.8xlarge8with 120 GB Memory):
Specify Document Language
If the document language is known, explicitly setting it can save processing time.
Use --language for general language setting (e.g., FRENCH, ENGLISH).
Use --ocrLanguage for OCR tasks to specify the Tesseract model (e.g., fra, eng).
Example:
Manage OCR Tasks Wisely
OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false.
Example:
Efficient Handling of Large Files
Large PST files or archives can hinder processing efficiency. We recommend extracting these files before processing with Datashare. If they are too many of them, keep in mind that Datashare will be able to extract them anyway.
Example of splitting Outlook PST files in multiple .eml files withreadpst:
If a document is attached to (or contained in) a file on disk, its extraction level is '1st'
If a document is attached to (or contained in) a document itself contained in a file on disk, its extraction level is '2nd'
And so on
Filter by entities
If you asked Datashare to 'Find entities' and the task was complete, you will see names of people, organizations, locations and e-mail adresses in the filters. These are the entities automatically detected by Datashare:
Exclude filters
Tick the 'Exclude' checkbox to select all items except those selected.
In the search breadcrumb, you see that the excluded filters are strikethrough:
Contextualize filters
In most filters, tick 'Contextualize' to update the number of documents indicated in the filters so they reflect the results.
The filter will only count what you selected, it will reflect the results of your current selection:
Clear all filters
To reset all filters at the same time, open the search breadcrumb:
Click 'Clear filters':
Star a single document
Click the star icon either at the right of the document's card or at the top right of the document:
Click on the same icons to unstar.
Star multiple documents
Open the selection mode by clicking the multiple cards icon on the left of the pagination:
Select the documents you want to star:
Click the star filled icon:
To unstar documents, click the three-dot icon if necessary and click Unstar:
Filter starred documents
Open the filters by clicking the 'Filters' button on the left of the search bar:
In the 'User data' category, open 'Starred' and tick the 'Starred' checkbox:
Tag documents
Tags are always in lower case letters. They can contain numbers, hyphens and special characters but not commas nor semicolons (which are the keyboard shortcuts to add the tags).
In server collaborative mode, tags are public to the project's other members. You can see their tag and they can see yours.
Tag a single document
Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the hashtag icon:
It opens the Tags panel on the left:
Type your tag and press Enter or click 'Add':
Your tag is now displayed in the 'Added by you' category:
Remove your tag, or others' tags, by clicking their crossicon:
Tag multiple documents
Open the selection mode by clicking the multiple cards icon on the left of the pagination:
Select the documents you want to tag:
Click the three-dot icon if necessary and click 'Tag':
Type your tag or type multiple tags by separating them with comma and click 'Add':
Remove your tag, or others' tags, by clicking their crossicon on each single document (you cannot untag multiple documents):
Filter tagged documents
Open the filters by clicking the 'Filters' button on the left of the search bar:
In the 'User data' category, open 'Tags' and tick the 'Tag' checkboxes for tagged documents you want to filter:
Recommend a document
In server collaborative mode, recommending documents is public to the project's other members. All members can see who recommended some documents.
Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the eyes icon:
It opens the Recommendations panel on the left:
Click on the 'Mark as recommended' button:
The document is now marked as recommended by you:
Click 'Unmark as recommended' to unmarked it as recommended.
Filter recommended documents
Open the filters by clicking the 'Filters' button on the left of the search bar:
In the 'User data' category, open 'Recommended by' and tick the 'Username' checkboxes for documents recommended by the users you want to filter:
Explore a document
Explore the document's data through different tabs.
See a document in full-screen view
Open a document in 'Search' > 'Documents' > one document and click the icon with in and out arrows (this applies to the List layout while in Grid and Table layout, the documents always open in full-screen view):
You now see the document in full screen view and can go to the next document in your results by using the pagination carousel on the top of the screen:
Search in a document
Open a document in 'Search' > 'Documents' > one document
Stay on the first tab called 'Text'. This tab shows the text as extracted from your document by Datashare.
Click on the search bar or press Command (⌘) / Control + F
To see all the keyboard shortcuts in Datashare, please read ''.
See original document
Go to the 'View' tab to see the original document.
Note: this visualization of the document is available only for some file types: images, PDF, CSV, xlsx and tiff but not other file types like Word documents or e-mails for instance.
Search for attachments and documents in the same folder
Attachments are called 'children documents' in Datashare.
Go to the 'Metadata' tab and click on 'X documents in the same folder' or 'Y children documents':
You see the list of documents. To open all the documents in the same folder or all the children documents, click 'Search all' below. There is no 'Search all' button if there is no documents, as for the children documents below:
Explore metadata
Go the 'Metadata' tab to explore all the properties of the document:
If a metadata is interesting to you and you'd like to know if other documents in your project share the same metadata, click the search icon:
You can also copy or pin a metadata.
Entities
In the 'Entities' tab, only if you previously run tasks to in Datashare, you read the name of people, organizations, locations and e-mail adresses, along with the number of their occurrences in the document:
Hover one entity to see a popover with all their mentions in context in the document by clicking on the arrows:
Go to the 'Info' tab to check how the entity was extracted:
Batch search documents
Batch searches allow to get the results of each query of a list all at once: instead of searching each query one by one, upload a list, set options/filters and see the matching documents.
1
Prepare a CSV list
Open a spreadsheet (LibreOffice, Framacalc, Excel, Google Sheets, Numbers, ...)
Write your queries in the first column of the spreadsheet, typing one query per line:
Do not put line break(s) in any of your cells.
To delete all line breaks in your spreadsheet, use 'Find and replace all': find all '\n' and replace them by nothing or a space.
Write 2 characters minimum in each query. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to a 'failure'.
If you have blank cells in your spreadsheet...
...the CSV, which stand for 'Comma-separated values', will translate these blank cells into semicolons (the 'commas'). You will thus see semicolons in your batch search results:
To avoid that, remove blank cells in your spreadsheet before exporting it as a CSV.
If there is a comma in one of your cells (like in 'Jane, Austen' below), the CSV will put the content of the cell in double quotes so it will search for the exact phrase in the documents:
Remove all commas in your spreadsheet if you want to avoid exact phrase search.
Want to search only in some documents? Use the 'Filters' step in the batch search's form (see below). Or describe fields directly in your queries in the CSV. For instance, if you want to search only in some documents with certain tags, write your queries like this:
Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)
Use operators in your CSV: AND NOT * ? ! + - and other operators do work in batch searches as they do in the regular search bar but only if "Do phrase match" at step 3 is turned off. You can thus turn it off and write your queries like this for instance:
2
Export the list as a CSV
Export your spreadsheet of queries in a CSV format:
Important: Use thein your spreadsheet software's settings.
3
Create the batch search
Open the menu, go to 'Tasks', open 'Batch searches' and click the 'Plus' button at the top right:
Alternatively, in the menu next to 'Batch searches', click the 'Plus' button :
4
Explore your results
In the menu, click 'Batch searches' and click the name of the batch search to open it:
See the number of matching documents per query:
5
Relaunch a batch search (optional)
If you've added new files in Datashare after you launched a batch search, you might want to relaunch the batch search to search in the new documents too.
The relaunched batch search will apply to newly indexed documents and previously indexed documents (not only the newly indexed ones).
6
Failures
Failures in batch searches can be due to several causes.
How can I contact ICIJ for help, bug reporting or suggestions?
You can send an email to datashare@icij.org.
When reporting a bug, please share:
Your OS (Mac, Windows or Linux) and version
The problem, with screenshots
How can we use Datashare on a collaborative mode on a server?
You can use Datashare with multiple users accessing a centralized database on a server.
Warning: to put the server mode in place and to maintain it requires some technical knowledge.
Please find the .
Can I remove document(s) from Datashare?
In local mode, you cannot remove a single document or a selection of documents from Datashare. But you can remove all your projects and documents from Datashare.
Open the menu and on the bottom of the menu, click the trash icon:
A confirmation window opens. The action cannot be undone. It removes all the projects and their documents from Datashare. Click 'Yes' if you are sure:
For advanced users - if you'd like to do it with the Terminal, here are the instructions:
Can I download a document from Datashare?
Yes, you can download a document from Datashare.
Download a document
Open the menu > 'Search' > 'Documents' and click on the download icon on the right of documents' cards:
...or on the top right of an opened document:
What should I do if I get more than 10,000 results?
In Datashare, for technical reasons, it is not possible to open the 10,000th document.
Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.
As it is not possible to fix this, here are some tips:
: use filters to narrow down your results and ensure you have less than 10,000 matching documents
docker ps | grep neo4j # Should display your running neo4j container ID
JAVA_OPTS="-Xms10g -Xmx50g" datashare --mode CLI --stage INDEX
datashare --mode CLI --stage INDEX --language FRENCH --ocrLanguage fra
datashare --mode CLI --stage INDEX --language CHINESE --ocrLanguage chi_sim
datashare --mode CLI --stage INDEX --language GREEK --ocrLanguage ell
Change thesorting of your results: use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring
Search your query in abatch search: you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
term1 AND term2 AND term3 AND metadata.tika_metadata_creation_date:>=2010-01-01
Explanations:
'metadata.tika_metadata_creation_date:' means that we filter with creation date
'>="'means 'since January 1st included'
'2010-01-01' stands for January 2010 and the search will include January 2010
For other searches:
'<' will mean 'strictly before (with January 1st excluded)'
no character will mean 'at this exact date'
Ranges: You can also search for numbers in a range. Ranges can be specified for date, numeric or string fields among the ones you can find by clicking the magnifying glass when you hover the fields in a document's tab 'Metadata'. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}. For more details, please refer to Elastic's page on ranges.
Type the terms you're searching for
Press ENTER to go from one occurrence to the next one
Presse SHIFT + ENTER to go from one occurrence to the previous one
👷♀️ This page is currently being written by Datashare team.
What is an entity?
An entity in Datashare is the name of people, organizations or locations or an email address.
Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically detect entities in your documents.
You can filter documents by their entities and see all the entities mentioned in a document.
What if the 'View' of my documents is 'not available'?
Datashare can display 'View' for some file types only: images, PDF, CSV, xlsx and tiff. Other document types are not supported yet.
Paris NOT Barcelona AND Taipei
Reserved characters (^ " ? ( [ *), when misused, can lead to failuresbecause of syntax errors.
Searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.
LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':
Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, as explained here.
Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".
The form to create a batch search opens:
Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query. In that case, it won't apply any operators (AND OR, etc) that would be in the queries. If 'Do phrase match' is off, queries are searched without double quotes and with potential operators.
What is fuzziness? When you run a batch search, you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
What are proximity searches? When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Once you filled all steps, click 'Create' and wait for the batch search to complete.
Sort the queries by number of matching documents or by query position using the page settings (icon at the top right of the screen). The query position will put the query in their original order as you put them in the CSV.
To explore a query's matching documents, click its name and see the list of matching documents:
Click a document's name to open it. Use the page settings or the column's names to sort documents.
In 'Batch searches', go at the end of the table and click the 'Relaunch' icon:
Or click 'Relaunch' in the batch search page below its name on the right panel:
Change its name, description and decide to delete current batch search after relaunch or not:
See your relaunched batch search in the list of batch searches:
The first query containing an error makes the batch search fail and stop.
Go to 'Tasks' > 'Batch searches' > open the batch search with a failure status and click the 'Red cross icon' button on the right panel:
Check the first failure-generating query in the error window:
Here it says:
The first line contained a comma while it shouldn't. Datashare interpreted this query as a syntax error, it thus failed so the batch search stopped.
We recommend to remove the commas, as well as any reserved characters, in your CSV using 'Find and replace all' features in your spreadsheet software and re-create the batch search.
'elasticsearch: Name does not resolve'
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: check how for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
'Data too large'
One of your queries can lead to a 'Data too large' error.
It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.
We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.
Use this functionality to delete all line break(s)
Blank columns in a spreadsheet
Remove blank cells in your spreadsheet in order to avoid this.
Batch download documents
You can also batch download all the documents that match a search. It is limited to 100.00MB.
Open the menu > 'Search' > 'Documents', make queries and apply filter. Once all the results of a specific search are relevant to you, click on the download icon on the right of results:
Find your batch downloads as zip files in the menu > 'Tasks' > 'Batch downloads':
Click on a batch download's name to download it:
Can't download?
If you can't download a document, it means that:
either Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on Datashare's Github.
or you are using the server collaborative mode and the admins prevented users from downloading documents
How to run Neo4j?
This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)
Run Neo4j inside docker
1. enrich the services section of the docker-compose.yml of the install with Docker page, with the following neo4j service:
make sure not to forget the APOC plugin (NEO4J_PLUGINS: '["apoc"]').
2. enrich the volumes section of the docker-compose.yml of the page, with the following neo4j volumes:
3. Start the neo4j service using:
Run Neo4j Desktop
install with , follow installation instructions found
and save your password for later
if the installer notifies you of any ports modification, check the and save the server.bolt.listen_address for later
Additional options
Additional options to install neo4j are .
Why results from a simple search and a batch search can be slightly different?
If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.
Why?
For technical reasons, Datashare processes both queries in 2 different ways:
a. Search bar (a simple search processed in the browser):
The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.
b. Batch search (several searches processed by the server):
Datashare's server processes each of the batch search's queries
Each query is sent to Elasticsearch. The results are saved into a database
When the batch search is finished, you get the results from Datashare
Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.
Advanced: how can I do bulk actions with Tarentula?
Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:
Please find all the use cases in Datashare Tarentula's .
How can I uninstall Datashare?
Mac
1. Go to Applications
2. Click right on 'Datashare' and click 'Move to Bin'
Windows
Follow the steps here:
Linux
Use the following command:
sudo apt remove datashare-dist
What are proximity searches?
As a search operator
In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)
the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: ).
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than quick brown fox(source: ).
In batch searches
When you run a , if you turn 'Do phrase matches' on, you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)
the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
What are NLP pipelines?
Pipelines of Natural Language Processing are tools that automatically identify entities in your documents. You can only choose one model at a time for one entity detection task.
Open the menu > 'Tasks' > 'Entities' and . Select 'CoreNLP' if you want to use the model with the highest probability of working in most of documents.
What is fuzziness?
As a search operator
In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
List of common errors leading to "failure" in Batch Searches
SearchException: query='AND ada'
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators: .
You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors.
'We were unable to perform your search.' What should I do?
This can be due to some syntax errors in the way you wrote your query.
Here are the most common errors that you should correct:
The query starts with AND
You cannot start a query with AND all uppercase. .
Unexpected char 106 at (line no=1, column no=81, offset=80)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
In batch searches
When you run a batch search, you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your quferies as some errors can remain after correcting the first one.
Example of a syntax error message:
SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'
elasticsearch: Name does not resolve
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-start Datashare: check how for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
"Old" named entities can remain in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later.
In the future, removing the documents from Datashare before indexing new ones will remove the entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can follow these instructions.
You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).
The query starts with or contains tilde: ~
You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for fuzziness or proximity searches.
If you see a progress of less than 100%, please wait.
If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on with the application logs.
What do I do if Datashare opens a blank screen in my browser?
If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:
First wait 30 seconds and reload the page.
If the screen remains blank, restart Datashare following instructions forMac,WindowsorLinux.
If you still see a blank screen, please uninstall and reinstall Datashare
To uninstall Datashare:
On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.
On Windows, please follow .
On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.
To reinstall Datashare, see 'Install Datashare' for , or .
What if Datashare says 'No documents found'?
If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of losing your previously selected filters, open the menu > 'Search' > 'Documents', open the search breadcrumb on the left of the search bar, click 'Clear filters'.
You may not have added documents to Datashare yet. Check how to add documents for Mac, Windows or .
In 'Tasks' > 'Documents', in the Progress column, if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you added, it can take multiple hours.
Write extensions
What if you want to add features to Datashare backend?
Unlike plugins that are providing a way to modify the Datashare frontend, extensions have been created to extend the backend functionalities. There are two extension points that have been defined :
NLP pipelines : you can add a new java NLP pipeline to Datashare
HTTP API : you can add HTTP endpoints to Datashare and call the Java API you need in those endpoints
Since , instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the extensions they need or want, and have a fully customized installation of our search platform.
Getting started
When starting, Datashare can receive an extensionsDir option, pointing to your extensions' directory. In this example, let's call it /home/user/extensions:
Installing and Removing registered extensions
Listing
You can list official Datashare extensions like this :
You can add a to --extensionList. You can filter the extension list if you know what you are looking for.
Installing
You can install an extension with its id and providing where the Datashare extensions are stored:
Then if you launch Datashare with the same extension location, the extension will be loaded.
Removing
When you want to stop using an extension, you can either remove by hand the jar inside the extensions folder or remove it with datashare --extensionDelete :
Create your first extension
NLP extension
You can create a "simple" java project like (as simple as a java project can be right), with you preferred build tool.
You will have to add a dependency to the last version of to be able to implement your NLP pipeline.
With the datashare API dependency you can then create a class implementing or extending . When Datashare will load the jar, it will look for a Pipeline interface.
Unfortunately, you'll have also to make a pull request to datashare-api to add a new type of pipeline. We this step in the future.
Build the jar with its dependencies, and install it in the /home/user/extensions then start datashare with the extensionsDir set to /home/user/extensions. Your plugin will be loaded by datashare.
Finally, your pipeline will be listed in the available pipelines in the UI, when .
HTTP extension
For making a HTTP extension it will be the same as NLP, you'll have to make a java project that will build a jar. The only dependency that you will need is because datashare will look for fluent http annotations @Get, @Post, @Put...
For example, we can create a small class like :
Build the jar, copy it to the /home/user/extensions then start datashare:
et voilà 🔮 ! You can query your new endpoint. Easy, right?
Installing and Removing your custom Extension
You can also install and remove extensions with the Datashare CLI.
The Datashare API is fully defined using the OpenAPI 3.0 specification and automatically generated after every Datashare release.
The OpenAPI spec is a language-agnostic, machine-readable document that describes all of the API’s endpoints, parameter and response schemas, security schemes, and metadata. It empowers developers to discover available operations, validate requests and responses, generate client libraries, and power interactive documentation tools.
$ datashare -m CLI --extensionList
2020-08-29 09:27:51,219 [main] INFO Main - Running datashare
extension datashare-extension-nlp-opennlp
OPENNLP Pipeline
7.0.0
https://github.com/ICIJ/datashare-extension-nlp-opennlp/releases/download/7.0.0/datashare-nlp-opennlp-7.0.0-jar-with-dependencies.jar
Extension to extract NER entities with OPENNLP
NLP
...
$ datashare -m CLI --extensionInstall datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions"
2020-08-29 09:34:30,927 [main] INFO Main - Running datashare
2020-08-29 09:34:32,632 [main] INFO Extension - downloading from url https://github.com/ICIJ/datashare-extension-nlp-mitie/releases/download/7.0.0/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jar
2020-08-29 09:34:36,324 [main] INFO Extension - installing extension from file /tmp/tmp218535941624710718.jar into /home/user/extensions
$ datashare -m CLI --extensionDelete datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions/"
2020-08-29 09:40:11,033 [main] INFO Main - Running datashare
2020-08-29 09:40:11,249 [main] INFO Extension - removing extension datashare-extension-nlp-mitie jar /home/user/extensions/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jar
package org.myorg;
import net.codestory.http.annotations.Get;
import net.codestory.http.annotations.Prefix;
@Prefix("myorg")
public class FooResource {
@Get("foo")
public String getFoo() {
return "hello from foo extension";
}
}
$ datashare --extensionsDir /home/user/extensions/
# ... starting logs
2020-08-29 11:03:59,776 [Thread-0] INFO ExtensionLoader - loading jar /home/user/extensions/my-extension.jar
2020-08-29 11:03:59,779 [Thread-0] INFO CorsFilter - adding Cross-Origin Request filter allows *
2020-08-29 11:04:00,314 [Thread-0] INFO Fluent - Production mode
2020-08-29 11:04:00,331 [Thread-0] INFO Fluent - Server started on port 8080
$ curl localhost:8080/myorg/foo
hello from foo extension
$ datashare -m CLI --extensionInstall /home/user/src/my-extension/dist/my-extension.jar --extensionsDir "/home/user/extensions"
2020-07-27 10:02:32,381 [main] INFO Main - Running datashare
2020-07-27 10:02:32,596 [main] INFO ExtensionService - installing extension from file /home/user/src/my-extension/dist/my-extension.jar into /home/user/extensions
$ datashare -m CLI --extensionDelete my-extension.jar --extensionsDir "/home/user/extensions"
2020-08-29 10:45:37,363 [main] INFO Main - Running datashare
2020-08-29 10:45:37,579 [main] INFO Extension - removing extension my-extension jar /home/user/extensions/my-extension.jar
Datashare doesn't open. What should I do?
It can be due to extensions priorly installed. The tech team is fixing the issue. In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:
Datashare Playground delivers a collection of Bash scripts (free of external dependencies) that streamline interaction with a Datashare instance’s Elasticsearch index and Redis queue.
From cloning or replacing whole indices and reindexing specific directories, to adjusting replica settings, monitoring or cancelling long-running tasks, and queuing files for processing, Playground implements each capability through intuitive shell scripts organized under the elasticsearch/ and redis/ directories.
To get started, set ELASTICSEARCH_URL and REDIS_URL in your environment (or add them to a .env file at the repo root). For a comprehensive guide to script options, directory layout, and example workflows, see the full documentation on Github:
Use playground to update index's mappings and settings
Some Datashare updates can bring some fixes and improvements on the index. The index has to be reindexed accordingly.
1. Create a temporary empty index and specify the desired Datashare version number:
2. Reindex all documents (under "/" path) from the original index under a temporary one:
This step can take some time if your index has plenty of documents.
3. Replace the old index by the new one:
4. Delete the temporary index:
Write plugins
What if you want to integrate text translations to Datashare’s interface? Or make it display tweets scraped with Twint? Ask no more: there is plugins for that!
Since version 5.6.1, instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform.
Getting started
When starting, Datashare can receive a pluginsDir option, pointing to your plugins' directory. In this example, this directory is called ~/Datashare Plugins:
Installing and Removing registered plugins
Listing
You can list official Datashare plugins like this :
The string given to --pluginList is a . You can filter the plugin list if you know what you are looking for.
Installing
You can install a plugin with its id and providing where the Datashare plugins are stored:
Then if you launch Datashare with the same plugin location, the plugin will be loaded.
Removing
When you want to stop using a plugin, you can either remove by hand the directory inside the plugins folder or remove it with datashare --pluginDelete :
Create your first plugin
To inject plugins, Datashare will look for a Node-compatible module in ~/Datashare Plugins. This way we can rely on NPM/Yarn to handle built packages. As described in , it can be:
Datashare will read the content of each module in the plugins directory to automatically inject them in the user interface. The backend will serve the plugin files. The entrypoint of each plugin (usually the main property of ) is injected with a <script> tag, right before the closing </body> tag.
Create a hello-world directory with a single index.js:
Reload the page, open the console: et voilà 🔮! Easy, right?
Installing and Removing your custom Plugin
Now you would like to develop your plugin in your repository and not necessarily in Datashare Plugins folder.
You can have your code under, say ~/src/my-plugin and deploy it into Datashare with the plugin API. To do so, you'll need to make a zip or a tarball, for example in ~/src/my-plugin/dist/my-plugin.tgz.
The tarball could contain :
Then you can install it with:
And remove it:
In that case my-plugin is the base directory of the plugin (the one that is in the tarball).
Adding elements to the Datashare user interface
To allow external developers to add their own components, we added markers in strategic locations of the user interface where a user can define new . These markers are called "hooks":
To register a new component to a hook, use the following method:
Or with a more complex example:
CLI with Tarentula
Datashare Tarentula is a powerful command-line toolbelt designed to streamline bulk operations against any Datashare instance.
Whether you need to count indexed files, download large datasets, batch-tag records, or run complex Elasticsearch aggregations, Tarentula provides a consistent, scriptable interface with flexible query support, and Docker compatibility.
It also exposes a Python API for embedding automated workflows directly into your data pipelines.
With commands like count, download, aggregate, and tagging-by-query, you can handle millions of records in a single invocation, or integrate Tarentula into CI/CD pipelines for reproducible data tasks.
You can install Tarentula with your favorite package manager:
pip3 install --user tarentula
Or alternatively with Docker:
For the complete list of commands, options, and example, read the documentation or Github:
Design System
Datashare's frontend is build with Vue 3 and Bootstrap 5. We document all component of the interface on a dedicated Storybook:
To facile the creation of plugin, each component can be imported directly from the core:
// It's usualy safer to wait for the app to be ready
document.addEventListener('datashare:ready', async () => {
// This load the ButtonIcon component asynchronously
const ButtonIcon = await datashare.findComponent('Button/ButtonIcon')
// Than we create a dummy component. For the sake of simplicity we use
// Vue 3's option API but we strongly encourage you to build your plugins
// with Vite and use the option API.
const definition = {
components: {
ButtonIcon,
},
methods: {
sayHi() {
alert('Hi!')
}
},
template: `
<button-icon @click="sayHi()" icon-left="hand-waving">
Say hi
</button-icon>
`
}
// Finally, we register the component's definition in a hook.
datashare.registerHook({ target: 'app-sidebar-sections:before', definition })
})
In the example you learn that:
Datashare launch must be awaited with "datashare:ready"
You can asynchronously import components with datashare.findComponent
Component can be registered on targeted locations with a "hook"
All icons from are available and loaded automatically
$ datashare -m CLI --pluginList ".*"
2020-07-24 10:04:59,767 [main] INFO Main - Running datashare
plugin datashare-plugin-site-alert
Site Alert
v1.2.0
https://github.com/ICIJ/datashare-plugin-site-alert
A plugin to display an alert banner on the Datashare demo instance.
...
$ datashare -m CLI --pluginInstall datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
2020-07-24 10:15:46,732 [main] INFO Main - Running datashare
2020-07-24 10:15:50,202 [main] INFO PluginService - downloading from url https://github.com/ICIJ/datashare-plugin-site-alert/archive/v1.2.0.tar.gz
2020-07-24 10:15:50,503 [main] INFO PluginService - installing plugin from file /tmp/tmp7747128158158548092.gz into /home/dev/Datashare Plugins
$ datashare -m CLI --pluginDelete datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
2020-07-24 10:20:43,431 [main] INFO Main - Running datashare
2020-07-24 10:20:43,640 [main] INFO PluginService - removing plugin base directory /home/dev/Datashare Plugins/datashare-plugin-site-alert-1.2.0
* A folder with a package.json file containing a "main" field.
* A folder with an index.js file in it.
$ datashare -m CLI --pluginInstall ~/src/my-plugin/dist/my-plugin.tgz --pluginsDir "~/Datashare Plugins"
2020-07-27 10:02:32,381 [main] INFO Main - Running datashare
2020-07-27 10:02:32,596 [main] INFO PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare Plugins
$ datashare -m CLI --pluginDelete my-plugin --pluginsDir "~/Datashare Plugins"
2020-07-27 10:02:32,381 [main] INFO Main - Running datashare
2020-07-27 10:02:32,596 [main] INFO PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare Plugins
// `datashare` is a global variable
datashare.registerHook({ target: 'app-sidebar.menu:before', definition: 'This is a message written with a plugin' })
// It's usualy safer to wait for the app to be ready
document.addEventListener('datashare:ready', ({ detail }) => {
// Alert is a Vue component meaning it can have computed properties, methods, etc...
const Alert = {
computed: {
weekday () {
const today = new Date()
return today.toLocaleDateString('en-US', { weekday: 'long' })
}
},
template: `<div class="text-center bg-info p-2 width-100">
It's {{ weekday }}, have a lovely day!
</div>`
}
// This is the most important part of this snippet:
// we register the component on the a given `target`
// using the core method `registerHook`.
detail.core.registerHook({ target: 'landing.form:before', definition: Alert })
})