Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Datashare allows you to search in your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).
Welcome to Datashare - a self-hosted documents search software. It is a free and open-source software developed by the International Consortium of Investigative Journalists (ICIJ). Initially created to combine multiple named-entity recognition pipelines, this tool is now a fully-featured search interface to dig into your documents.
With the help of several open-source tools (Extract, Apache Tika, Apache Tesseract, CoreNLP, OpenNLP, Elasticsearch, and more), Datashare can be used on one single personal computer, as well as on 100 interconnected servers.
Datashare is developed by the ICIJ, a collective of investigative journalists. Datashare is built at the top of technologies and methods already tested with investigations like the Panama Papers or the Luanda Leaks.
Seeing the growing interest for ICIJ's technology, we decided to open source this key component of our investigations so a single journalist as well as big media organizations could use it for their own documents.
Datashare is free so anyone can use it and find is useful.
Curious to know more about how we use Datashare?
We setup a demo instance of Datashare with a small set of documents from the LuxLeaks investigation (2014). When using this instance, you will be assigned a temporary user which can star, tag, recommend and explore documents.
Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to the server documentation to know how it works.
When building Datashare, one of our first decisions was to use Elasticsearch to create an index of documents. It would be fair to describe Datashare as a nice looking web interface for Elasticsearch. We want our search platform to be user-friendly while keeping all the powerful Elasticsearch features available for advanced users. This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers who want to query the index directly with our API.
We implemented the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user can pick the plugins they need or want, and have a fully customized installation of our search platform. Please have a look at the documentation.
This project is currently available in English, French and Spanish. You can help improve and complete translations on Crowdin.


These pages will help you set up and install Datashare on your computer.
Install the Neo4j plugin following these instructions.
1. At the bottom of the menu, click on the 'Settings' icon:
2. Make sure the following settings are properly set:
Neo4j Host should be localhost or the address where your Neo4j instance is running
Neo4j Port should be the port where your Neo4j instance is running (7687 by default)
Neo4j User should be set to your Neo4j user name (neo4j by default)
Neo4j Password should only be set if your Neo4j user is using password authentication
3. When running Neo4j Community Edition, set the Neo4j Single Project value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.
4. Restart Datashare to apply the changes. Check how for Mac, Windows or Linux.
5. Go to 'Projects' > your project's page > the Graph tab. You should see the Neo4j widget. After a little while, its status should be RUNNING:
You can now create the graph.


If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.
Why?
For technical reasons, Datashare processes both queries in 2 different ways:
a. Search bar (a simple search processed in the browser):
The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.
b. Batch search (several searches processed by the server):
Datashare's server processes each of the batch search's queries
Each query is sent to Elasticsearch. The results are saved into a database
When the batch search is finished, you get the results from Datashare
Datashare sends back the results stored into the database/
Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.

This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)
1. enrich the services section of the docker-compose.yml of the install with Docker page, with the following neo4j service:
...
services:
neo4j:
image: neo4j:5-community
environment:
NEO4J_AUTH: none
NEO4J_PLUGINS: '["apoc"]'
ports:
- 7474:7474
- 7687:7687
volumes:
- neo4j_conf:/var/lib/neo4j/conf
- neo4j_data:/var/lib/neo4j/datamake sure not to forget the APOC plugin (NEO4J_PLUGINS: '["apoc"]').
2. enrich the volumes section of the docker-compose.yml of the install with Docker page, with the following neo4j volumes:
volumes:
...
neo4j_data:
driver: local
neo4j_conf:
driver: local3. Start the neo4j service using:
docker compose up -d neo4jinstall with Neo4j Desktop, follow installation instructions found here
create a new local DBMS and save your password for later
if the installer notifies you of any ports modification, check the DBMS settings and save the server.bolt.listen_address for later
make sure to install the APOC Plugin
Additional options to install neo4j are listed here.
👷♀️ This page is currently being written by Datashare team.
An entity in Datashare is the name of people, organizations or locations or an email address.
Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically detect entities in your documents.
You can filter documents by their entities and see all the entities mentioned in a document.
You started tasks, and they are running as you can see on 'http://localhost:8080/#/indexing' but they are not completing.
There are two possible causes:
If you see a progress of less than 100%, please wait.
If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on Datashare Github with the application logs.
If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:
First wait 30 seconds and reload the page.
If you still see a blank screen, please uninstall and reinstall Datashare
To uninstall Datashare:
On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.
On Windows, please follow these steps.
On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.
To reinstall Datashare, see 'Install Datashare' for Mac, Windows or Linux.
Datashare's filters keep the entities (people, organizations, locations, e-mail addresses) previously found.
"Old" named entities can remain in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later.
In the future, removing the documents from Datashare before indexing new ones will remove the entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can follow .

It can be due to extensions priorly installed. The tech team is fixing the issue. In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:
On Mac
rm -rf ~/Library/datashare/plugins ~/Library/datashare/extensionsOn Linux
rm -rf ~/.local/share/datashare/plugins ~/.local/share/datashare/extensionsOn Windows
del /S %APPDATA%\Datashare\Extensions %APPDATA%\Datashare\PluginsPress Enter. Open Datashare again.
Datashare runs using different modes with their own features.
LOCAL
Web
To run Datashare on a single computer for a single user.
SERVER
Web
To run Datashare on a server for multiple users.
CLI
CLI
To index documents and analyze them directly .
TASK_RUNNER
Daemon
To execute async tasks (, batch downloads, scan, index, NER extraction, ...)
There are two modes:
In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
The running modes offer advantages and limitations. This matrix summarizes the differences:
local
server
Multi-users
❌
✅
Multi-projects
✅
✅
Access-control
❌
✅
Indexing UI
✅
❌
Plugins UI
✅
❌
Extension UI
✅
❌
HTTP API
✅
✅
API Key
✅
✅
Single JVM
✅
❌
Tasks execution
✅
❌
When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.
In cli mode, Datashare starts without a web server and allows user to perform tasks over their documents. This mode can be used in conjunction with both local and server modes, while allowing users to distribute heavy tasks between several servers.
If you want to learn more about which tasks you can execute in this mode, checkout the stages documentation.
Those modes are intended to be used for action that requires to "wait" for pendings tasks.
In batch download mode, the daemon waits for a user to request a batch download of documents. When a request is received, the daemon starts a task to download the document matching the user search, and bundle them into a zip file.
In batch search mode, the daemon waits for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can upload a list of search terms (in CSV format). The daemon will then start a task to search all matching documents and store every occurrences in the database.
Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mode is the embedded mode. Yet when starting Datashare in command line, you can explicitly specify the running mode. For instance on Ubuntu/Debian:
datashare \
# Switch to SERVER mode
--mode SERVER \
# Dummy session filter to creates ephemeral users
--authFilter org.icij.datashare.session.YesCookieAuthFilter \
# Name of the default project for every user
--defaultProject local-datashare \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200 \
# URI of Redis
--redisAddress redis://redis:6379 \
# store user sessions in Redis.
--sessionStoreType REDISWhen running Datashare from the command-line, pick which "stage" to apply to analyse your documents.
The CLI stages are primarily intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heavy tasks between servers.
This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.
datashare --mode CLI \
# Select the SCAN stage
--stage SCAN \
# Where the document are located
--dataDir /path/to/documents \
# Store the queued files in Redis
--dataBusType REDIS \
# URI of Redis
--redisAddress redis://redis:6379The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue created in the previous step, then use a combination of Apache Tika and Tesseract to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurrent value at the time. This allows users to distribute this command on several servers.
datashare --mode CLI \
# Select the INDEX stage
--stage INDEX \
# Where the document are located
--dataDir /path/to/documents \
# Store the queued files in Redis
--dataBusType REDIS \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200 \
# Enable OCR \
--ocr true
# URI of Redis
--redisAddress redis://redis:6379Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entity mentions in ElasticSearch, it will also mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be parallelized as well on several servers.
datashare --mode CLI \
# Select the NLP stage
--stage NLP \
# Use CORENLP to detect named entities
--nlpp CORENLP \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200 To report a bug, please post an issue in our GitHub detailing your logs with:
Your Operating System (Mac, Windows or Linux)
The version of your Operating System
The version of Datashare
Screenshots of your issue
A description of your issue
If, for confidentiality reasons, you don't want to open an issue on Github, please write to datashare@icij.org.
This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.
In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines.
The software allows users to search into their documents within their own local environment, without relying on external servers or cloud infrastructure.
This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
Datashare provides a folder on your Mac to collect documents you want to have in Datashare.
Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':
On the menu bar at the top of your computer, click 'Go' and 'Home' (the house icon):
You will see a folder called 'Datashare':
If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:
Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.
Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.
Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.
Skip already indexed documents if you'd like.
Click 'Add'
Two extraction tasks are now running:
The first is the scanning of your Datashare folder - it sees if there are documents to analyze. It is called 'Scan folders'.
The second is the indexing of these files. It is called 'Index documents'.
Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.
But you can start searching in your documents without having to wait for all tasks to be done.
You can now search documents in Datashare.










These pages will help you set up and install Datashare on your computer.
You must have Windows 7 Service Pack 2 or any newer version.
Before we start, please uninstall any prior standard version of Datashare if you had already installed it. You can follow these steps: https://www.laptopmag.com/articles/uninstall-programs-windows-10
Go to datashare.icij.org and click 'Download for Windows':
The file 'datashare-X.Y.Z.exe' is now downloaded. You can find it in your Downloads.
Double-click on the name of the file in order to execute it.
You can now start Datashare.
The installer will take care of checking that your system have all the dependencies to run Datashare. Because this software use Apache Tesseract (to perform Optical Character Recognition, OCR) and Mac doesn't support them out-of-box, heavy dependencies must be downloaded. If your system have none of those dependencies, the first installation of Datashare can take up to 30 minutes.
The installer will set up:
Xcode Command Line Tools (if neither XCode or Xcode Command Line Tools are installed)
Homebrew (if neither Homebrew or MacPorts are installed)
Apache Tesseract with MacPorts or Homebrew
Java JRE 17
Datashare executable
Note: Previous versions of this document referred to a "Docker Installer". We do not provide this installer anymore but Datashare is still published on the Docker Hub and supported with Docker.
Installation fails:
Error while installing Homebrew or MacPorts: you can manually install Homebrew first and then restart the installer.
"System Software from application was blocked from loading" : Check in your Mac's "System Settings" > "privacy & security" if you have a section with this mention "System software from application 'Datashare' was blocked from loading" or something similar related to Datashare. If you have this section you'll have to click "Allow" to be able to install datashare.
For any other issue check our Github issues or create a new one with your setup (macOs version) and installer logs (Command+L when the installer is launched and failed).
Click 'Continue', 'Install', enter your password and 'Install Software':
The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.
You can see what it actually does by typing command+L: it will open a window which logs every action made.
In the end, you should see this screen:
You can now safely close this window.
You can now start Datashare.

























This page will help you set up and install Datashare within a Docker.
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Read more about how to install Docker on your system
To start Datashare within a Docker container, you can use this command:
docker run --mount src=$HOME/Datashare,target=/home/datashare/data,type=bind -p 8080:8080 icij/datashare:11.1.9 --mode EMBEDDEDMake sure the Datashare folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).
By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store, will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare with Docker Compose, you can use the following docker-compose.yml file:
version: "3.7"
services:
datashare:
image: icij/datashare:18.1.3
hostname: datashare
ports:
- 8080:8080
environment:
- DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
volumes:
- type: bind
source: ${HOME}/Datashare
target: /home/datashare/data
- type: volume
source: datashare-models
target: /home/datashare/dist
command: >-
--dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password
--mode LOCAL
--tcpListenPort 8080
depends_on:
- postgresql
- redis
- elasticsearch
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
restart: on-failure
volumes:
- type: volume
source: elasticsearch-data
target: /usr/share/elasticsearch/data
read_only: false
environment:
- "http.host=0.0.0.0"
- "transport.host=0.0.0.0"
- "cluster.name=datashare"
- "discovery.type=single-node"
- "discovery.zen.minimum_master_nodes=1"
- "xpack.license.self_generated.type=basic"
- "http.cors.enabled=true"
- "http.cors.allow-origin=*"
- "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"
redis:
image: redis:4.0.1-alpine
restart: on-failure
postgresql:
image: postgres:12-alpine
environment:
- POSTGRES_USER=datashare
- POSTGRES_PASSWORD=password
- POSTGRES_DB=datashare
volumes:
- type: volume
source: postgresql-data
target: /var/lib/postgresql/data
volumes:
datashare-models:
elasticsearch-data:
postgresql-data:Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:
docker-compose up -dThe -d flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this operation with:
docker-compose logs -f datashareOnce the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.
That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:
docker-compose downThis will stop and remove the containers, freeing up system resources.
Currently, only a .deb package for Debian/Ubuntu is provided.
If you want to run it with another Linux distribution, you can download the latest version of the Datashare jar here: https://github.com/ICIJ/datashare/releases/latest
And adapt the following launch script to your environment: https://github.com/ICIJ/datashare/blob/master/datashare-dist/src/main/deb/bin/datashare.
Go to datashare.icij.org and click 'Download for Linux':
Save the Debian package as a file:
You can now start Datashare.
Find the application on your computer and run it locally in your browser.
Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)
A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.
Keep this Terminal window open as long as you use Datashare.
Datashare should now automatically open in your default internet browser. If it doesn’t, type 'localhost:8080' in your browser.
Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: Can I use Datashare with no internet connection?).
You can now add documents to Datashare.
Find the application on your computer and run it locally on your browser.
Start Datashare by launching it from the command-line:
datashareDatashare should now automatically open in your default internet browser. If it doesn’t, type 'localhost:8080' in your browser.
Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see: Can I use Datashare with no internet connection?).
It's now time to add documents to Datashare.






This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your computer.
Follow the instructions of the dedicated FAQ page to get Neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
If it's not done yet find entities and extract names of people, organizations and locations as well as email addresses.
If your project contains emails, make sure to also extract email addresses.
You can now run Datashare with the Neo4j plugin.
In server mode, Datashare operates as a centralized server-based system. Users can access the platform through a web interface, and the documents are stored and processed on Datashare's servers.
This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
Datashare is launched with --mode SERVER and you have to provide:
The external elasticsearch index address elasticsearchAddress
A Redis store address redisAddress
A Redis data bus address messageBusAddress
A database JDBC URL dataSourceUrl
The host of Datashare (used to generate batch search results URLs) rootHost
An authentication mechanism and its parameters
Example:
docker run -ti ICIJ/datashare:version --mode SERVER \
--redisAddress redis://my.redis-server.org:6379 \
--elasticsearchAddress https://my.elastic-server.org:9200 \
--messageBusAddress my.redis-server.org \
--dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
--rootHost https://my.datashare-server.org
# ... +auth parameters (see authentication providers section)This page explains how to locally add plugins and extensions to Datashare.
Plugins are front-end modules to add new features in Datashare's user interface.
Extensions are back-end modules to add new features to store and manipulate data with Datashare.
At the bottom of the menu, click the 'Settings' icon:
Open the 'Plugins' tab:
Choose the plugin you want to add and click 'Install':
If you want to install a plugin from an URL, click 'Install from a URL':
Your plugin is now installed:
Refresh your page to see your new plugin activated in Datashare.
At the bottom of the menu, click the 'Settings' icon:
Open the 'Extensions' tab:
Choose the extension you want to add and click 'Install':
If you want to install an extension from an URL, click 'Install from a URL':
Your extension is now installed:
When a newer version of a plugin or extension is available, get the latest version.
If it is a plugin, refresh your page to activate the latest version.
If it is an extension, restart Datashare to activate the latest version. Check how for Mac, Windows and Linux.
People who can code can create their own plugins and extensions by following these steps:
Datashare provides a folder to collect documents on your computer to index in Datashare.
Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.
Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.
Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.
Skip already indexed documents if you'd like.
Click 'Add'
Two extraction tasks are now running:
The first is the scanning of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.
The second is the indexing of these files. It is called 'IndexTask'.
Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.
But you can start searching in your documents without having to wait for all tasks to be done.
You can now search documents in Datashare.
Datashare provides a folder to collect documents on your computer to index in Datashare.
Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.
Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.
Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.
Skip already indexed documents if you'd like.
Click 'Add'
Two extraction tasks are now running:
The first is the scanning of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.
The second is the indexing of these files. It is called 'IndexTask'.
Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.
But you can start searching in your documents without having to wait for all tasks to be done.
You can now search documents in Datashare.
This page helps you find entities (people, organizations, locations, e-mail addresses) in your documents.
In the menu, in 'Tasks', click 'Entities'
In the menu or on the top right, click the 'Plus' button or on the page, click 'Find entities':
Select your options
Select a project where you want to find entities
Choose between finding names of people, organizations and locations or finding email addresses. You cannot do both simultaneously, you need to do one after the other, no matter the order.
Choose a Natural Language Processing model, that is to say the software which will run the entity recognition. If you want to add more models, you can check how to add them as extensions.
In 'Tasks' > 'Entities', watch the progress of your entity recognition:
Once they are done, you can click 'Delete done tasks' to stop displaying tasks that are completed.
Explore your entities in the documents
You can now start searching your entities in the documents without having to wait for all tasks to be done.
In the menu, click 'Search' > 'Documents' and open the 'Entities' tab of your documents or use the Entities filters.
This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects
Go to 'All projects' and click on your project's name:
Go to the Graph tab and in the first step 'Import', click on the 'Import' button:
You will then see a new import task running.
When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:
If new documents or entities are added or modified in Datashare, you will need to update the Neo4j graph to reflect these changes.
Go to 'All projects' > one project's page > the 'Graph' tab. In the first step, click on the 'Update graph' button:
To detect whether a graph update is needed, go to the 'Projects' page and open your project:
Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...
...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
You can now explore your graph using your favorite visualization tool.
Find the Datashare application on your computer and run it locally on your browser.
Once Datashare is installed, go to 'Finder' > 'Applications', and double-click on 'Datashare':
A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening:
⇒ Important: Keep this Terminal window open as long as you use Datashare.
Once the process is done, Datashare should now automatically open in your default internet browser. If it doesn’t, type 'localhost:8080' as a URL in your browser.
Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: Can I use Datashare with no internet connection?).
You can now add documents to Datashare.




































This document assumes that you have installed Datashare in server mode within Docker and already added documents to Datashare.
In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.
Datashare has the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing (NLP) pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage NLP \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--nlpParallelism 2 \
--nlpp CORENLPWhat's happening here:
Datashare starts in "CLI" mode
We ask to process the NLP stage
We tell Datashare to use the elasticsearch service
Datashare will pull documents from ElasticSearch directly
Up to 2 documents will be analyzed in parallel
Datashare will use the CORENLP pipeline
Datashare will use the output queue from the previous INDEX stage (by default extract:queue:nlp in Redis) that contains all the document ids to be analyzed.
The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.
You can also use chain the 3 stages altogether:
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage SCAN,INDEX,NLP \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--nlpParallelism 2 \
--nlpp CORENLP \
--dataDir /home/datashare/Datashare/As for the previous stages you may want to restore the output queue from the INDEX stage. You can do:
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage ENQUEUEIDX,NLP \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--nlpParallelism 2 \
--nlpp CORENLPThe added ENQUEUEIDX stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the IDs of those documents into the extract:queue:nlp queue.
This page explain how to start Datashare within a Docker in server mode.
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Read more about how to install Docker on your system.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare in server mode with Docker Compose, you can use the following docker-compose.yml file for version 20.1.4 (check latest version on https://datashare.icij.org/):
version: "3.7"
services:
datashare:
image: icij/datashare:20.1.4
hostname: datashare
ports:
- 8080:8080
environment:
- DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
volumes:
- type: bind
source: ${HOME}/Datashare
target: /home/datashare/data
- type: volume
source: datashare-models
target: /home/datashare/dist
command: >-
--dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password
--mode LOCAL
--tcpListenPort 8080
depends_on:
- postgresql
- redis
- elasticsearch
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
restart: on-failure
volumes:
- type: volume
source: elasticsearch-data
target: /usr/share/elasticsearch/data
read_only: false
environment:
- "http.host=0.0.0.0"
- "transport.host=0.0.0.0"
- "cluster.name=datashare"
- "discovery.type=single-node"
- "discovery.zen.minimum_master_nodes=1"
- "xpack.license.self_generated.type=basic"
- "http.cors.enabled=true"
- "http.cors.allow-origin=*"
- "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"
redis:
image: redis:4.0.1-alpine
restart: on-failure
postgresql:
image: postgres:12-alpine
environment:
- POSTGRES_USER=datashare
- POSTGRES_PASSWORD=password
- POSTGRES_DB=datashare
volumes:
- type: volume
source: postgresql-data
target: /var/lib/postgresql/data
volumes:
datashare-models:
elasticsearch-data:
postgresql-data:Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:
docker-compose up -dThe -d flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
docker-compose logs -f datashare_webOnce the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:
docker-compose downThis will stop and remove the containers, freeing up system resources.
If you reach that point, Datashare is up and running but you will discover very quickly that no documents is available in the search results. Next step: Add documents from the CLI.
Datashare has the ability to detect email addresses, name of people, organizations and locations. You must perform the named entities extraction in the same fashion than the previous commands. Final step: Add named entities from the CLI.
Authentication with Datashare in server mode is the most impacting choice that has to be made. It can be one of the followings:
Basic authentication with credentials stored in database (PostgreSQL)
Basic authentication with credentials stored in Redis
OAuth2 with credentials provided by an identity provider (KeyCloak for example)
Dummy basic auth to accept any user (⚠️ if the service is exposed to internet, it will leak your documents)
This page explains how to install language packages to support Optical Character Recognition (OCR) on more languages.
To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses 'language packages' especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide explains you how to manually install it on your system.
To add OCR languages on Linux, simply use the following command:
sudo apt install tesseract-ocr-[lang]Where `[lang]` is can be :
all if you want to install all languages
a language code (ex: fra, for French), the list of languages is available here
The Datashare Installer for Mac checks for the existence of either MacPorts or Homebrew, which package managers are used for the installation of Tesseract. If none of those two package managers is present, the Datashare Installer will install MacPorts by default.
First, you must check that MacPort is installed on your computer. Please run in a Terminal:
port versionYou should see an output similar to this:
If you get a command not found: port, this either means you are using Homebrew (see next section) or you did not run the Datashare installer for Mac yet.
If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):
port install tesseract-deuThe full list of supported language packages can be found on MacPorts website.
Once the installation is done, close and restart Datashare to be able to use the newly installed packages.
If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!
If you want to check if Homebrew is installed, run the following command in a Terminal:
brew -vYou should see an output similar to this:
If you get a command not found: brew error, this mean Homebrew is not installed on your system. You might either use MacPorts (see previous section) or run the Datashare installer for Mac on your computer.
Languages packages are available on Tesseract Github repository. Trained data files have to be downloaded and added into tessdata folder in Tesseract's installation folder.
*Additional languages can be also added during Tesseract's installation.
The list of installed languages can be checked with Windows command prompt or Powershell with the command tesseract --list-langs.
Datashare has to be restarted after the language installation. Check how for Mac, Windows and Linux.
Basic authentication with Redis
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:
Authorization: Basic dXNlcjpwYXNzd29yZA==It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.
So you have to provision users. The passwords are sha256 hex encoded. For example using bash:
$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 -Then insert the user like this in Redis:
$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}'If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For exammple if you use myindex:
$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex","local-datashare"]}}'Then you should see this popup:
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in Redis:
docker run -ti ICIJ/datashare --mode SERVER \
--batchQueueType REDIS \
--dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=<username>&password=<password>' \
--sessionStoreType REDIS \
--authFilter org.icij.datashare.session.BasicAuthAdaptorFilter \
--authUsersProvider org.icij.datashare.session.UsersInRedis



Install the Neo4j plugin using the Datashare CLI so that users can access it from the frontend:
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--pluginInstall datashare-plugin-neo4j-graph-widget Installing the plugin installs the datashare-plugin-neo4j-graph-widget plugin inside /home/datashare/plugings and will also install the datashare-extension-neo4j backend extension inside /home/datashare/extensions. These locations can be changed by updating the docker-compose.yml.
Update the docker-compose.yml to reflect your Neo4j docker service settings.
...
services:
datashare_web:
...
environment:
- DS_DOCKER_NEO4J_HOST=neo4j
- DS_DOCKER_NEO4J_PORT=7687
- DS_DOCKER_NEO4J_SINGLE_PROJECT=secret-project # This is for community edition onlyIf your choose a different Neo4j user or set a password for your Neo4j user make sure to also set DS_DOCKER_NEO4J_USER and DS_DOCKER_NEO4J_PASSWORD.
When running Neo4j Community Edition, set the DS_DOCKER_NEO4J_SINGLE_PROJECT value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the DS_DOCKER_NEO4J_SINGLE_PROJECT with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.
After installing the plugin a restart might be needed for the plugin to display:
docker compose restart datashare_webYou can now create the graph.
This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your server
Follow the instructions of the dedicated FAQ page to get Neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'All platforms and versions' button when downloading to access versions if necessary.
If it's not done yet add entities to your project using the Datashare CLI.
If your project contains email documents, make sure to run the EMAIL pipeline together with regular NLP pipeline. To do so add set the follow nlpp flag to --nlpp CORENLP,EMAIL.
You can now run Datashare with the Neo4j plugin.
This document assumes that you have installed Datashare in server mode within Docker.
In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.
Here is a simple command to scan a directory and index its files:
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage SCAN,INDEX \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--dataDir /home/datashare/Datashare/What's happening here:
Datashare starts in "CLI" mode
We ask to process both SCAN and INDEX stages at the same time
The SCAN stage feeds a queue in memory with file to add
The INDEX stage pulls files from the queue to add them to ElasticSearch
We tell Datashare to use the elasticsearch service
Files to add are located in /home/datashare/Datashare/ which is a directory mounted from the host machine
Alternatively, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the Redis:
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage SCAN \
--queueType REDIS \
--queueName "datashare:queue" \
--redisAddress redis://redis:6379 \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--dataDir /home/datashare/Datashare/Once the operation is done, we can easily check the content of queue created by Datashare in Redis. In this example we only display the 20 first files in the datashare:queue:
docker compose exec redis redis-cli lrange datashare:queue 0 20The INDEX stage can now be executed in the same container:
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage INDEX \
--queueType REDIS \
--queueName "datashare:queue" \
--redisAddress redis://redis:6379 \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--dataDir /home/datashare/Datashare/Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.
Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :
Scan the existing ElasticSearch index and gather document paths to store it inside a report queue
Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it
docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage SCANIDX \
--queueType REDIS \
--reportName "report:queue" \
--redisAddress redis://redis:6379 \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--dataDir /home/datashare/Datashare/docker compose exec datashare_web /entrypoint.sh \
--mode CLI \
--stage SCAN,INDEX \
--ocr true \
--queueType REDIS \
--queueName "datashare:queue" \
--reportName "report:queue" \
--redisAddress redis://redis:6379 \
--defaultProject secret-project \
--elasticsearchAddress http://elasticsearch:9200 \
--dataDir /home/datashare/Datashare/Basic authentication with a database.
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:
Authorization: Basic dXNlcjpwYXNzd29yZA==It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then Datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.
Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):
$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 -Then you can insert the user like this in your database:
$ psql datashare
datashare=> insert into user_inventory (id, email, name, provider, details) values ('fbar', 'foo@bar.com', 'Foo Bar', 'my_company', '{"password": "fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}');If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For example if you use myindex:
$ psql datashare
datashare=> insert into user_inventory (id, email, name, provider, details) values ('fbar', 'foo@bar.com', 'Foo Bar', 'my_company', '{"password": "fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex", "local-datashare"]}}');Or you can use PostgreSQL import CSV COPY statement if you want to create them all at once.
Then when accessing Datashare, you should see this popup:
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in database:
docker run -ti ICIJ/datashare --mode SERVER \
--batchQueueType REDIS \
--dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=<username>&password=<password>' \
--sessionStoreType REDIS \
--authFilter org.icij.datashare.session.BasicAuthAdaptorFilter \
--authUsersProvider org.icij.datashare.session.UsersInDbDummy authentication provider to disable authentication
You can have a dummy authentication that always accepts basic auth. So you should see this popup:
But then whatever user or password you type, it will enter Datashare.
docker run -ti ICIJ/datashare -m SERVER \
--dataDir /home/dev/data \
--batchQueueType REDIS \
--dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=dstest&password=test'\
--sessionStoreType REDIS \
--authFilter org.icij.datashare.session.YesBasicAuthFilterThis page describes how to create and maintain your Neo4j graph up to date with your server's Datashare projects
The Neo4j related features are added to the DatashareCLI through the extension mechanism. In order to run the extended CLI, the Java CLASSPATH must be extended with the path of the datashare-extension-neo4j jar. By default, this jar is located in /home/.local/share/datashare/extensions/*, so the CLI will be run as following:
docker compose exec \
# if you are not using the default extensions directory
# you have to specify it extending the CLASSPATH variable ex:
# -e CLASSPATH=/home/datashare/extensions/* \
datashare_web /entrypoint.sh \
--mode CLI \
--ext neo4j \
... In order to create the graph, run the --fullImport command for your project:
docker compose exec \
datashare_web /entrypoint.sh \
--mode CLI \
--ext neo4j \
--full-import \
--project secret-projectThe CLI will display the import task progress and log import related information.
When new documents or entities are added or modified inside Datashare, you will need to update the Neo4j graph to reflect these changes.
To update the graph, you can just re-run the full export:
docker compose exec \
datashare_web /entrypoint.sh \
--mode CLI \
--ext neo4j \
--full-import \
--project secret-projectThe update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
To detect whether a graph update is needed, go to the 'Projects' page and open your project:
Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...
...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
You can now explore your graph using your favorite visualization tool.



You need an internet connection to install Datashare.
You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.
You don't need internet connection to:
Add documents to Datashare
Find named entities (except for the first time you use an NLP options - see above)
Search and explore documents
Download documents
This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.
👷♀️ This page is currently being written by Datashare team.
👷♀️ This page is currently being written by Datashare team.
Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with Apache Tika and Tesseract OCR, this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.
Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.
Examples:
datashare --mode CLI --stage SCAN --redisAddress redis://redis:6379 --busType REDIS
datashare --mode CLI --stage INDEX --redisAddress redis://redis:6379 --busType REDISDistribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlarge instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.
For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.\
Datashare offers --parallelism and --parserParallelism options to enhance processing speed.
Example (for g4dn.8xlarge with 32 CPUs):
datashare --mode CLI --stage INDEX --parallelism 14 --parserParallelism 14
datashare --mode CLI --stage NLP --parallelism 14 --nlpParallelism 14ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.
You can fine-tune the JAVA_OPTS environment variable based on your system's configuration to optimize Java Virtual Machine memory usage.
&#xNAN;Example (for g4dn.8xlarge8with 120 GB Memory):
JAVA_OPTS="-Xms10g -Xmx50g" datashare --mode CLI --stage INDEXIf the document language is known, explicitly setting it can save processing time.
Use --language for general language setting (e.g., FRENCH, ENGLISH).
Use --ocrLanguage for OCR tasks to specify the Tesseract model (e.g., fra, eng).
Example:
datashare --mode CLI --stage INDEX --language FRENCH --ocrLanguage fra
datashare --mode CLI --stage INDEX --language CHINESE --ocrLanguage chi_sim
datashare --mode CLI --stage INDEX --language GREEK --ocrLanguage ellOCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false.
Example:
datashare --mode CLI --stage INDEX --ocr falseLarge PST files or archives can hinder processing efficiency. We recommend extracting these files before processing with Datashare. If they are too many of them, keep in mind that Datashare will be able to extract them anyway.
Example of splitting Outlook PST files in multiple .eml files with readpst:
readpst -reD <Filename>.pstProjects are collections of documents. Datashare displays statistics about each projects.
Expand the menu to go to 'Projects' > 'All projects':
Search in projects' names using the search bar on the right:
Sort your projects by clicking the top right Settings icon:
In the Page settings, choose a sort by option, change the number of projects per page or the layout:
To explore a project, close the Settings and click on the name of the project:
You can now .
Filters are on the left of the search bar. You can contextualize, exclude and reset them. Active filters are displayed in the search breadcrumb.
Open 'Filters' on the left of the search bar:
'Indexing dates' arethe dates when the documents were added to Datashare.
'Extraction levels' regard embedded documents:
The 'file on disk' is level zero
If a document is attached to (or contained in) a file on disk, its extraction level is '1st'
If a document is attached to (or contained in) a document itself contained in a file on disk, its extraction level is '2nd'
And so on
If you asked Datashare to 'Find entities' and the task was complete, you will see names of people, organizations, locations and e-mail adresses in the filters. These are the entities automatically detected by Datashare:
Tick the 'Exclude' checkbox to select all items except those selected.
In the search breadcrumb, you see that the excluded filters are strikethrough:
In most filters, tick 'Contextualize' to update the number of documents indicated in the filters so they reflect the results.
The filter will only count what you selected, it will reflect the results of your current selection:
To reset all filters at the same time, open the search breadcrumb:
Click 'Clear filters':
Search with the main search bar and configure settings to display your search's results.
You must have added documents in Datashare before. Check how for , and .
Expand the menu to go to 'Search' > 'Documents':
Make room by closing the menu:
Type terms in the search bar and press Enter:
If you type several terms separated by space, as the default operator is OR, Datashare will search for all documents containing at least one of the searched terms.
For instance, Datashare finds documents containing either 'ikea' or 'paris' or both terms here:
As you type a term, Datashare suggest linked entities - only if a task to find entities in this project was completed.
Press Esc on your keyboard to close the dropdown or click on one of the entities to replace your term in the search bar:
Search within a specific field only, by using the dropdown 'All fields':
To see your queries in the search breadcrumb, click on the icon on the left of the search bar:
If you'd like to remove all searched terms from the search bar, click 'Clear query':
To change the page settings, click the Settings icon on the top right:
You can change Sort by, Documents per page, Layout and also Properties:
Ticking these properties will change which document's metadata are displayed in the results, in the document cards, in all 3 layouts (List, Grid, Table):
You can now make your search more precise .
To make your searches more precise, use operators in the main search bar.
To have all documents mentioning an exact phrase, you can use double quotes. Use straight double quotes ("example"), not curly double quotes (“example”).
"Alicia Martinez’s bank account in Portugal"
To have all documents mentioning at least one of the queried terms, you can use a simple space between your queries (as OR is the default operator in Datashare) or OR . You need to write OR with all letters uppercase.
Alicia Martinez
Alicia OR Martinez
To have all documents mentioning all the queried terms, you can use AND between your queried words. You need to write AND with all letters uppercase.
Alicia AND Martinez
+Alicia +Martinez
To have all documents NOT mentioning some queried terms, you can use NOT before each word you don't want. You need to write NOT with all letters uppercase.
NOT Martinez
!Martinez
-Martinez
Parentheses should be used whenever multiple operators are used together and you want to give priority to some.
((Alicia AND Martinez) OR (Delaware AND Pekin) OR Grey) AND NOT "parking lot"
You can also combine these with regular expressions (regex) between two slashes ().
If you search faithf?l, the search engine will look for all words with all possible single character between the second f and the l in this word. It also works with * to replace multiple characters.
Alicia Martin?z
Alicia Mar*z
You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), you can use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you type an exact phrase (in double quotes) and use fuzziness, then the meaning of the fuzziness changes. Now, the fuzziness means the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
"the cat is blue" -> "the small cat is blue" (1 insertion = fuzziness is 1)
"the cat is blue" -> "the small is cat blue" (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: ).
"fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase quick fox would be considered more relevant than quick brown fox(source: ).
Use the boost operator ^ to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes:
quick^2 fox
The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups:
"john smith"^2 (foo bar)^4
(source: )
"A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern." ().
1. You can use Regex in Datashare. Regular expressions (Regex) in Datashare need to be written between 2 slashes.
/.*..*@.*..*/
The example above will search for any expression which is structured like an email address with a dot between two expressions before the @ and a dot between two expressions after the @ like in 'first.lastname@email.com' for instance.
2. Regex can be combined with standard queries in Datashare :
("Ada Lovelace" OR "Ado Lavelace") AND paris AND /.*..*@.*..*/
3. You need to escape the following characters by typing a backslash just before them (without space): # @ & < > ~
/.*..*@.*..*/ (the @ was escaped by a backslash \ just before it)
4. Important: Datashare relies on Elastic's Regex syntax as explained. Datashare uses . A consequence of this is that spaces cannot be searched as such in Regex.
We encourage you to use the AND operator to work around this limitation and make sure you can make your search.
If you're looking for French International Bank Account Number (IBAN) that can or cannot contain space and contain FR followed by numbers and/or letters (it could be FR7630001007941234567890185 ou FR76 3000 4000 0312 3456 7890 H43 for example), you can then search for:
/FR[0-9]{14}[0-9a-zA-Z]{11}/ OR (/FR[0-9]{2}.*/ AND /[0-9]{4}.*/ AND /[0-9a-zA-Z]{11}.*/)
Here are a few examples of useful Regex:
You can search for /Dimitr[iyu]/ instead of searching for Dimitri OR Dimitry OR Dimitru. It will find all the Dimitri, Dimitry or Dimitru.
You can search for /Dimitr[^yu]/ if you want to search all the words which begin with Dimitr and do not end with either y nor u.
You can search for /Dimitri<1-5>/ if you want to search Dimitri1, Dimitri2, Dimitri3, Dimitri4 or Dimitri5.
Other common Regex examples:
phone numbers with "-" and/or country code like +919367788755, 8989829304, +16308520397 or 786-307-3615 for instance: /[\+]?[(]?[0-9]{3}[)]?[-\s.]?[0-9]{3}[-\s.]?[0-9]{4,6}/
emails (): /[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+/
credit cards: /(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35[0-9]{3})[0-9]{11})/
You can find many other examples . More generally, if you use a regex found on internet, beware that the syntax is not necessarily compatible with elasticsearch's. For example \d, \S and the like .
In 'Search' > 'Documents', open a document and go to the 'Metadata' tab:
Click a metadata's search icon to search documents with same properties:
See the query in the main search bar. It contains the field name, two dots and the searched value:
So for example, if you are looking for documents that:
Contains term1, term2 and term3
And were created after 2010
you can use the 'Date' filter or type in the search bar:
term1 AND term2 AND term3 AND metadata.tika_metadata_creation_date:>=2010-01-01
Explanations:
'metadata.tika_metadata_creation_date:' means that we filter with creation date
'>="'means 'since January 1st included'
'2010-01-01' stands for January 2010 and the search will include January 2010
For other searches:
'<' will mean 'strictly before (with January 1st excluded)'
no character will mean 'at this exact date'
Ranges: You can also search for numbers in a range. Ranges can be specified for date, numeric or string fields among the ones you can find by clicking the magnifying glass when you hover the fields in a document's tab 'Metadata'. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}. For more details, please refer to .
OAuth2 authentication with a third-party id service
Explore the document's data through different tabs.
Open a document in 'Search' > 'Documents' > one document and click the icon with in and out arrows (this applies to the List layout while in Grid and Table layout, the documents always open in full-screen view):
You now see the document in full screen view and can go to the next document in your results by using the pagination carousel on the top of the screen:
Open a document in 'Search' > 'Documents' > one document
Stay on the first tab called 'Text'. This tab shows the text as extracted from your document by Datashare.
Click on the search bar or press Command (⌘) / Control + F
Type the terms you're searching for
Press ENTER to go from one occurrence to the next one
Presse SHIFT + ENTER to go from one occurrence to the previous one
Go to the 'View' tab to see the original document.
Note: this visualization of the document is available only for some file types: images, PDF, CSV, xlsx and tiff but not other file types like Word documents or e-mails for instance.
Go to the 'Metadata' tab and click on 'X documents in the same folder' or 'Y children documents':
You see the list of documents. To open all the documents in the same folder or all the children documents, click 'Search all' below. There is no 'Search all' button if there is no documents, as for the children documents below:
Go the 'Metadata' tab to explore all the properties of the document:
If a metadata is interesting to you and you'd like to know if other documents in your project share the same metadata, click the search icon:
You can also copy or pin a metadata.
In the 'Entities' tab, only if you previously run tasks to in Datashare, you read the name of people, organizations, locations and e-mail adresses, along with the number of their occurrences in the document:
Hover one entity to see a popover with all their mentions in context in the document by clicking on the arrows:
Go to the 'Info' tab to check how the entity was extracted:
This page explains how to leverage Neo4j to explore your Datashare projects.
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature. To download a specific version, click on 'All platforms and versions' .
If you are not familiar with graph and Neo4j, take a look at the following resources:
Find out
Learn
Check out
is a graph database technology which lets you represent your data as a graph.
Inside Datashare, Neo4j lets you connect entities between them through documents in which they appear.
After creating a graph from your Datashare project, you will be able to explore this graph and visualize these kinds of relationships between you project entities:
In the above graph, we can see 3 e-mail document nodes in orange, 3 e-mail address nodes in red, 1 person node in green and 1 location node in yellow. Reading the relationship types on the arrows, we can deduce the following information from the graph:
shapp@caiso.com emailed 20participants@caiso.com, the sent email has an ID starting with f4db344...
One person named vincent is mentioned inside this email, as well as the california location
Finally, the e-mail also mentions the dle@caiso.com e-mail address which is also mentioned in 2 other e-mail documents (with ID starting with 11df197... and 033b4a2...)
The Neo4j graph is composed of :Document nodes representing Datashare documents and :NamedEntity nodes representing entities mentioned in these documents.
The :NamedEntity nodes are additionally annotated with their entity types: :NamedEntity:PERSON, :NamedEntity:ORGANIZATION, :NamedEntity:LOCATION, :NamedEntity:EMAIL...
In most cases, an entity :APPEARS_IN a document, which means that it was detected in the document content. In the particular case of e-mail documents and EMAIL addresses, it is most of the time possible to identify richer relationships from the e-mail metadata, such as who sent (:SENT relationship) and who received (:RECEIVED relationship) the e-mail.
When an :EMAIL address entity is neither :SENT or :RECEIVED, like it is the case in the above graph for dle@caiso.com, it means that the address was mentioned in the e-mail document body.
When a document is embedded inside another document (as an e-mail attachment for instance), the child document is connected to its parent through the :HAS_PARENT relationship.
The creation of a Neo4j graph inside Datashare is supported through a plugin. To use the plugin to create a graph, follow these instructions:
When using Datashare
When Datashare is running
After the graph is created, open the menu, go to the 'Projects' page, select your project and go to the Graph tab.
You should be able to visualize a new Neo4j widget displaying the number of documents and entities found inside the graph:
Depending on your access to the Neo4j database behind Datashare, you might need to export the Neo4j graph and import it locally to access it from .
Exporting and importing the graph into your own database is also useful when you want to perform write operations on your graph without any consequences on Datashare.
If you have read access to the Neo4j database (it should be the case if you are running Datashare on your computer), you will be able to plug to it and start exploring.
If you can't have read access to the database, you will need to export it and import it into your own Neo4j instance (running on your laptop for instance).
If it's possible, ask you system administrator for a DB dump obtained using the .
In case you don't have access to the DB and can't be provided with a dump, you can export the graph from inside. Be aware that limits might be applied on the size of the exported graph.
To export the graph, open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Cypher shell' export format and at the end of the form, click the 'Export' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.
DB import
Depending on , use one of the following ways to import your graph into your DB:
Docker
Identify your Neo4j instance container ID:
Copy your the graph dump inside your Neo4j container import directory:
Import the dumped file using the command:
Neo4j Desktop import
Open 'Cypher shell':
Copy your the graph dump inside your neo4j instance import directory:
Import the dumped file using the command:
You will now be able to explore the graph imported in your own Neo4j instance.
Once your graph is created and you can access it (see if you can't access the Datashare's Neo4j instance), you will be able to use your favorite tool to extract meaningful information from it.
Once you , you can use different tools to visualize and explore it. You can start by connecting the to your DB.
is a simple and powerful tool developed by Neo4j to quickly visualize and query graphs, if you run Neo4j Enterprise Edition. Bloom lets you navigate and explore the graph through a user interface similar to the one below:
Neo4j Bloom is accessible from inside Neo4j Desktop app.
Find out more information about how to use Neo4j Bloom to explore your graph with:
Bloom's
Bloom's
about graph exploration with Bloom
The lets you run queries on your graph to explore it and retrieve information from it. Cypher is like SQL for graphs, running Cypher queries inside the Neo4j browser lets you explore the results as shown below:
The Neo4j Browser is available for both Enterprise and Community distributions. You can access it:
Inside the Neo4j Desktop app when running Neo4j from the
At when running Neo4j
is a proprietary software which, similarly to Neo4j Bloom, lets you visualize and query your graph through a powerful UI.
Find out more information about Linkurious:
is a simple open-source visualization software. It is possible to export graphs from Datashare into the and import them into Gephi.
Find out more information about:
How to
Gephi
How to with Gephi
To export the graph in the , open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Graph ML' export format and at the end of the form, click the 'Export' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.
You will now be able to by opening the exported GraphML file in it.
A project is a collection of documents. Datashare displays statistics about each projects.
Expand the menu, open 'All projects' and click on the name of the project that you want to explore:
If you'd like to pin this project in the menu for an easy access, click 'Pin to menu':
Your project is now pinned in the menu:
In a project page, in the first tab called 'Insights', you find statistics and a bar chart displaying the number of documents by creation date.
Filter this chart by path by clicking 'Select path':
Click on one bar for a year or month to see all the corresponding documents:
On the 'Languages', 'File Types' and 'Authors' widgets, you see stats:
Search all documents with a specific criteria, for instance here with the French language:
Finally, in the server collaborative mode, you see the Latest recommended documents, that is to say the documents marked as recommended by other members of the project:
You can now .
Batch searches allow to get the results of each query of a list all at once: instead of searching each query one by one, upload a list, set options/filters and see the matching documents.
Open a spreadsheet (LibreOffice, Framacalc, Excel, Google Sheets, Numbers, ...)
Write your queries in the first column of the spreadsheet, typing one query per line:
Do not put line break(s) in any of your cells.
To delete all line breaks in your spreadsheet, use 'Find and replace all': find all '\n' and replace them by nothing or a space.
Write 2 characters minimum in each query. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to a 'failure'.
If you have blank cells in your spreadsheet...
...the CSV, which stand for 'Comma-separated values', will translate these blank cells into semicolons (the 'commas'). You will thus see semicolons in your batch search results:
To avoid that, remove blank cells in your spreadsheet before exporting it as a CSV.
If there is a comma in one of your cells (like in 'Jane, Austen' below), the CSV will put the content of the cell in double quotes so it will search for the exact phrase in the documents:
Remove all commas in your spreadsheet if you want to avoid exact phrase search.
Want to search only in some documents? Use the 'Filters' step in the batch search's form (see below). Or describe fields directly in your queries in the CSV. For instance, if you want to search only in some documents with certain tags, write your queries like this:
Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)
Use operators in your CSV: AND NOT * ? ! + - and other operators do work in batch searches as they do in the regular search bar but only if "Do phrase match" at step 3 is turned off. You can thus turn it off and write your queries like this for instance:
Paris NOT Barcelona AND Taipei
Reserved characters (^ " ? ( [ *), when misused, can lead to failures because of syntax errors.
Searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.
Export your spreadsheet of queries in a CSV format:
Important: Use the in your spreadsheet software's settings.
LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':
Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, .
Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".
Open the menu, go to 'Tasks', open 'Batch searches' and click the 'Plus' button at the top right:
Alternatively, in the menu next to 'Batch searches', click the 'Plus' button :
The form to create a batch search opens:
Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query. In that case, it won't apply any operators (AND OR, etc) that would be in the queries. If 'Do phrase match' is off, queries are searched without double quotes and with potential operators.
What is fuzziness? When you run a , you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
What are proximity searches? When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Once you filled all steps, click 'Create' and wait for the batch search to complete.
In the menu, click 'Batch searches' and click the name of the batch search to open it:
See the number of matching documents per query:
Sort the queries by number of matching documents or by query position using the page settings (icon at the top right of the screen). The query position will put the query in their original order as you put them in the CSV.
To explore a query's matching documents, click its name and see the list of matching documents:
Click a document's name to open it. Use the page settings or the column's names to sort documents.
If you've added new files in Datashare after you launched a batch search, you might want to relaunch the batch search to search in the new documents too.
The relaunched batch search will apply to newly indexed documents and previously indexed documents (not only the newly indexed ones).
In 'Batch searches', go at the end of the table and click the 'Relaunch' icon:
Or click 'Relaunch' in the batch search page below its name on the right panel:
Change its name, description and decide to delete current batch search after relaunch or not:
See your relaunched batch search in the list of batch searches:
Failures in batch searches can be due to several causes.
Go to 'Tasks' > 'Batch searches' > open the batch search with a failure status and click the 'Red cross icon' button on the right panel:
Check the first failure-generating query in the error window:
Here it says:
The first line contained a comma while it shouldn't. Datashare interpreted this query as a syntax error, it thus failed so the batch search stopped.
Check .
We recommend to remove the commas, as well as any reserved characters, in your CSV using 'Find and replace all' features in your spreadsheet software and re-create the batch search.
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: check how for , or .
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
One of your queries can lead to a 'Data too large' error.
It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.
We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.
Turn the documents into starred, tag them or, in server mode, recommend them to project's other members.
Click the star icon either at the right of the document's card or at the top right of the document:
Click on the same icons to unstar.
Open the selection mode by clicking the multiple cards icon on the left of the pagination:
Select the documents you want to star:
Click the star filled icon:
To unstar documents, click the three-dot icon if necessary and click Unstar:
Open the filters by clicking the 'Filters' button on the left of the search bar:
In the 'User data' category, open 'Starred' and tick the 'Starred' checkbox:
Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the hashtag icon:
It opens the Tags panel on the left:
Type your tag and press Enter or click 'Add':
Your tag is now displayed in the 'Added by you' category:
Remove your tag, or others' tags, by clicking their cross icon:
Open the selection mode by clicking the multiple cards icon on the left of the pagination:
Select the documents you want to tag:
Click the three-dot icon if necessary and click 'Tag':
Type your tag or type multiple tags by separating them with comma and click 'Add':
Remove your tag, or others' tags, by clicking their cross icon on each single document (you cannot untag multiple documents):
Open the filters by clicking the 'Filters' button on the left of the search bar:
In the 'User data' category, open 'Tags' and tick the 'Tag' checkboxes for tagged documents you want to filter:
Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the eyes icon:
It opens the Recommendations panel on the left:
Click on the 'Mark as recommended' button:
The document is now marked as recommended by you:
Click 'Unmark as recommended' to unmarked it as recommended.
Open the filters by clicking the 'Filters' button on the left of the search bar:
In the 'User data' category, open 'Recommended by' and tick the 'Username' checkboxes for documents recommended by the users you want to filter:
In local mode, you cannot remove a single document or a selection of documents from Datashare. But you can remove all your projects and documents from Datashare.
Open the menu and on the bottom of the menu, click the trash icon:
A confirmation window opens. The action cannot be undone. It removes all the projects and their documents from Datashare. Click 'Yes' if you are sure:
For advanced users - if you'd like to do it with the Terminal, here are the instructions:
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
The most complex operation is OCR (we use ) so if your documents don't contain many images, it might be worth deactivating it (--ocr false).











































docker ps | grep neo4j # Should display your running neo4j container IDdocker cp \
<export-path> \
<neo4j-container-id>:/var/lib/neo4j/imports/datashare-graph.dumpdocker exec -it <neo4j-container-id> /bin/bash
./bin/cypher-shell -f imports/datashare-graph.dump cp <export-path> imports./bin/cypher-shell -f imports/datashare-graph.dump 














Unexpected char 106 at (line no=1, column no=81, offset=80)



















































In Datashare, for technical reasons, it is not possible to open the 10,000th document.
Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.
As it is not possible to fix this, here are some tips:
Refine your search: use filters to narrow down your results and ensure you have less than 10,000 matching documents
Change the sorting of your results: use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring
Search your query in a batch search: you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents
You can send an email to datashare@icij.org.
When reporting a bug, please share:
Your OS (Mac, Windows or Linux) and version
The problem, with screenshots
The actions that led to the problem
Or you can post an issue with your logs on Datashare's GitHub: https://github.com/ICIJ/datashare/issues
docker run -ti ICIJ/datashare:version --mode SERVER \
--oauthClientId 30045255030c6740ce4c95c \
--oauthClientSecret 10af3d46399a8143179271e6b726aaf63f20604092106 \
--oauthAuthorizeUrl https://my.oauth-server.org/oauth/authorize \
--oauthTokenUrl https://my.oauth-server.org/oauth/token \
--oauthApiUrl https://my.oauth-server.org/api/v1/me.json \
--oauthCallbackPath /auth/callbackYes, you can download a document from Datashare.
Open the menu > 'Search' > 'Documents' and click on the download icon on the right of documents' cards:
...or on the top right of an opened document:
You can also batch download all the documents that match a search. It is limited to 100.00MB.
Open the menu > 'Search' > 'Documents', make queries and apply filter. Once all the results of a specific search are relevant to you, click on the download icon on the right of results:
Find your batch downloads as zip files in the menu > 'Tasks' > 'Batch downloads':
Click on a batch download's name to download it:
If you can't download a document, it means that:
either Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on .
or you are using the server collaborative mode and the admins prevented users from downloading documents





You can use Datashare with multiple users accessing a centralized database on a server.
Warning: to put the server mode in place and to maintain it requires some technical knowledge.
Please find the documentation here.
Warning: this requires some technological knowledge.
You can make Datashare follow soft links : add --followSymlinks when Datashare is launched.
If you're on Mac or Windows, you must change the launch script.
If you're on Linux, you can add the option after the Datashare command.
Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:
Please find all the use cases in Datashare Tarentula's GitHub documentation.
In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you run a batch search, you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
👷♀️ This page is currently being written by Datashare team.
Pipelines of Natural Language Processing are tools that automatically identify entities in your documents. You can only choose one model at a time for one entity detection task.
Open the menu > 'Tasks' > 'Entities' and follow these instructions. Select 'CoreNLP' if you want to use the model with the highest probability of working in most of documents.
In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)
the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elastic).
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than quick brown fox(source: Elastic).
When you run a batch search, if you turn 'Do phrase matches' on, you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)
the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators: read the list of syntax errors by clicking here.
You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. Check how to create a batch search.
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your quferies as some errors can remain after correcting the first one.
Example of a syntax error message:
SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-start Datashare: check how for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
Datashare can display 'View' for some file types only: images, PDF, CSV, xlsx and tiff. Other document types are not supported yet.
Shortcuts help do some actions faster.
Open the menu > 'Search' > 'Documents' and click the keyboard icon at the bottom of the menu:
It opens a window with the shortcuts for your OS (Mac, Windows, Linux):
Click on 'See all shortcuts' to reach the full page view:
1. Go to Applications
2. Click right on 'Datashare' and click 'Move to Bin'
Follow the steps here: https://support.microsoft.com/en-us/windows/uninstall-or-remove-apps-and-programs-in-windows-10-4b55f974-2cc6-2d2b-d092-5905080eaf98
Use the following command:
sudo apt remove datashare-dist
This can be due to some syntax errors in the way you wrote your query.
Here are the most common errors that you should correct:
You cannot start a query with AND all uppercase. AND is reserved as a search operator.
You cannot start a query with OR all uppercase. OR is reserved as a search operator.
You cannot start or type a query with only one double quote. Double quotes are reserved as a search operator for exact phrase.
You cannot start or type a query with only one parenthesis. Parenthesis are reserved for combining operators.
You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).
You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for fuzziness or proximity searches.
You cannot end a query with question mark (!). Question mark is reserved as a search operator for excluding a term.
You cannot start a query with caret (^) or write one which contains caret. Caret is reserved as a boosting operator.
You cannot use square brackets except for searching for ranges.
If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of losing your previously selected filters, open the menu > 'Search' > 'Documents', open the search breadcrumb on the left of the search bar, click 'Clear filters'.
In 'Tasks' > 'Documents', in the Progress column, if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you added, it can take multiple hours.



















![Screenshot of Datashare's search page with '[ikea]' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'](https://icij.gitbook.io/datashare/~gitbook/image?url=https%3A%2F%2F2881303961-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252F-LWCyd3pDXO_H4jk9DgG%252Fuploads%252Fgit-blob-362831eea4e949ee23b7e3ba77622a52915944c4%252FScreenshot%25202025-06-17%2520at%252016.19.13.png%3Falt%3Dmedia&width=768&dpr=4&quality=100&sign=d762852f&sv=2)






The Datashare API is fully defined using the OpenAPI 3.0 specification and automatically generated after every Datashare release.
The OpenAPI spec is a language-agnostic, machine-readable document that describes all of the API’s endpoints, parameter and response schemas, security schemes, and metadata. It empowers developers to discover available operations, validate requests and responses, generate client libraries, and power interactive documentation tools.
You can download the latest version of the API definition in JSON or explore an instantly browsable, developer-friendly interface with Redoc.
What if you want to add features to Datashare backend?
Unlike plugins that are providing a way to modify the Datashare frontend, extensions have been created to extend the backend functionalities. There are two extension points that have been defined :
NLP pipelines : you can add a new java NLP pipeline to Datashare
HTTP API : you can add HTTP endpoints to Datashare and call the Java API you need in those endpoints
Since version 7.5.0, instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the extensions they need or want, and have a fully customized installation of our search platform.
When starting, Datashare can receive an extensionsDir option, pointing to your extensions' directory. In this example, let's call it /home/user/extensions:
mkdir /home/user/extensions
datashare --extensionsDir=/home/user/extensionsYou can list official Datashare extensions like this :
$ datashare -m CLI --extensionList
2020-08-29 09:27:51,219 [main] INFO Main - Running datashare
extension datashare-extension-nlp-opennlp
OPENNLP Pipeline
7.0.0
https://github.com/ICIJ/datashare-extension-nlp-opennlp/releases/download/7.0.0/datashare-nlp-opennlp-7.0.0-jar-with-dependencies.jar
Extension to extract NER entities with OPENNLP
NLP
...You can add a regular expression to --extensionList. You can filter the extension list if you know what you are looking for.
You can install an extension with its id and providing where the Datashare extensions are stored:
$ datashare -m CLI --extensionInstall datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions"
2020-08-29 09:34:30,927 [main] INFO Main - Running datashare
2020-08-29 09:34:32,632 [main] INFO Extension - downloading from url https://github.com/ICIJ/datashare-extension-nlp-mitie/releases/download/7.0.0/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jar
2020-08-29 09:34:36,324 [main] INFO Extension - installing extension from file /tmp/tmp218535941624710718.jar into /home/user/extensionsThen if you launch Datashare with the same extension location, the extension will be loaded.
When you want to stop using an extension, you can either remove by hand the jar inside the extensions folder or remove it with datashare --extensionDelete :
$ datashare -m CLI --extensionDelete datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions/"
2020-08-29 09:40:11,033 [main] INFO Main - Running datashare
2020-08-29 09:40:11,249 [main] INFO Extension - removing extension datashare-extension-nlp-mitie jar /home/user/extensions/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jarYou can create a "simple" java project like https://github.com/ICIJ/datashare-extension-nlp-opennlp (as simple as a java project can be right), with you preferred build tool.
You will have to add a dependency to the last version of datashare-api.jar to be able to implement your NLP pipeline.
With the datashare API dependency you can then create a class implementing Pipeline or extending AbstractPipeline. When Datashare will load the jar, it will look for a Pipeline interface.
Unfortunately, you'll have also to make a pull request to datashare-api to add a new type of pipeline. We will remove this step in the future.
Build the jar with its dependencies, and install it in the /home/user/extensions then start datashare with the extensionsDir set to /home/user/extensions. Your plugin will be loaded by datashare.
Finally, your pipeline will be listed in the available pipelines in the UI, when doing NER.
For making a HTTP extension it will be the same as NLP, you'll have to make a java project that will build a jar. The only dependency that you will need is fluent-http because datashare will look for fluent http annotations @Get, @Post, @Put...
For example, we can create a small class like :
package org.myorg;
import net.codestory.http.annotations.Get;
import net.codestory.http.annotations.Prefix;
@Prefix("myorg")
public class FooResource {
@Get("foo")
public String getFoo() {
return "hello from foo extension";
}
}Build the jar, copy it to the /home/user/extensions then start datashare:
$ datashare --extensionsDir /home/user/extensions/
# ... starting logs
2020-08-29 11:03:59,776 [Thread-0] INFO ExtensionLoader - loading jar /home/user/extensions/my-extension.jar
2020-08-29 11:03:59,779 [Thread-0] INFO CorsFilter - adding Cross-Origin Request filter allows *
2020-08-29 11:04:00,314 [Thread-0] INFO Fluent - Production mode
2020-08-29 11:04:00,331 [Thread-0] INFO Fluent - Server started on port 8080et voilà 🔮 ! You can query your new endpoint. Easy, right?
$ curl localhost:8080/myorg/foo
hello from foo extensionYou can also install and remove extensions with the Datashare CLI.
Then you can install it with:
$ datashare -m CLI --extensionInstall /home/user/src/my-extension/dist/my-extension.jar --extensionsDir "/home/user/extensions"
2020-07-27 10:02:32,381 [main] INFO Main - Running datashare
2020-07-27 10:02:32,596 [main] INFO ExtensionService - installing extension from file /home/user/src/my-extension/dist/my-extension.jar into /home/user/extensionsAnd remove it:
$ datashare -m CLI --extensionDelete my-extension.jar --extensionsDir "/home/user/extensions"
2020-08-29 10:45:37,363 [main] INFO Main - Running datashare
2020-08-29 10:45:37,579 [main] INFO Extension - removing extension my-extension jar /home/user/extensions/my-extension.jar👷♀️ This page is currently being written by Datashare team.
api_keyid
character varying(96)
not null
user_id
character varying(96)
not null
creation_date
timestamp without time zone
not null
api_key_pkey PRIMARY KEY, btree (id)
api_key_user_id_key UNIQUE CONSTRAINT, btree (user_id)
batch_searchuuid
character(36)
not null
name
character varying(255)
description
character varying(4096)
user_id
character varying(96)
not null
batch_date
timestamp without time zone
not null
state
character varying(8)
not null
published
integer
not null
0
phrase_matches
integer
not null
0
fuzziness
integer
not null
0
file_types
text
paths
text
error_message
text
batch_results
integer
0
error_query
text
query_template
text
nb_queries
integer
0
uri
text
nb_queries_without_results
integer
batch_search_pkey PRIMARY KEY, btree (uuid)
batch_search_date btree (batch_date)
batch_search_nb_queries btree (nb_queries)
batch_search_published btree (published)
batch_search_user_id btree (user_id)
batch_search_pkey PRIMARY KEY, btree (uuid)
batch_search_date btree (batch_date)
batch_search_nb_queries btree (nb_queries)
batch_search_published btree (published)
batch_search_user_id btree (user_id)
Referenced by:
TABLE batch_search_project CONSTRAINT batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)
batch_search_projectsearch_uuid
character(36)
not null
prj_id
character varying(96)
not null
batch_search_project_unique UNIQUE, btree (search_uuid, prj_id)
batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)
batch_search_querysearch_uuid
character(36)
not null
query_number
integer
not null
query
text
not null
query_results
integer
0
batch_search_query_search_id btree (search_uuid)
idx_query_result_batch_unique UNIQUE, btree (search_uuid, query)
batch_search_resultsearch_uuid
character(36)
not null
query
text
not null
doc_nb
integer
not null
doc_id
character varying(96)
not null
root_id
character varying(96)
not null
doc_path
character varying(4096)
not null
creation_date
timestamp without time zone
content_type
character varying(255)
content_length
bigint
prj_id
character varying(96)
batch_search_result_prj_id btree (prj_id)
batch_search_result_query btree (query)
batch_search_result_uuid btree (search_uuid)
documentid
character varying(96)
not null
path
character varying(4096)
not null
project_id
character varying(96)
not null
content
text
metadata
text
status
smallint
extraction_level
smallint
language
character(2)
extraction_date
timestamp without time zone
parent_id
character varying(96)
root_id
character varying(96)
content_type
character varying(256)
content_length
bigint
charset
character varying(32)
ner_mask
smallint
document_pkey PRIMARY KEY, btree (id)
document_parent_id btree (parent_id)
document_status btree (status)
document_tagdoc_id
character varying(96)
not null
label
character varying(64)
not null
prj_id
character varying(96)
user_id
character varying(255)
creation_date
timestamp without time zone
not null
'1970-01-01 00:00:00'::timestamp without time zone
document_tag_doc_id btree (doc_id)
document_tag_label btree (label)
document_tag_project_id btree (prj_id)
idx_document_tag_unique UNIQUE, btree (doc_id, label)
document_user_recommendationdoc_id
character varying(96)
not null
user_id
character varying(96)
not null
prj_id
character varying(96)
creation_date
timestamp without time zone
now()
document_user_mark_read_doc_id btree (doc_id)
document_user_mark_read_project_id btree (prj_id)
document_user_mark_read_user_id btree (user_id)
idx_document_mark_read_unique UNIQUE, btree (doc_id, user_id, prj_id)
document_user_stardoc_id
character varying(96)
not null
user_id
character varying(96)
not null
prj_id
character varying(96)
document_user_star_doc_id btree (doc_id)
document_user_star_project_id btree (prj_id)
document_user_star_user_id btree (user_id)
idx_document_star_unique UNIQUE, btree (doc_id, user_id, prj_id)
named_entityid
character varying(96)
not null
mention
text
not null
offsets
text
not null
extractor
smallint
not null
category
character varying(8)
doc_id
character varying(96)
not null
root_id
character varying(96)
extractor_language
character(2)
hidden
boolean
named_entity_pkey PRIMARY KEY, btree (id)
named_entity_doc_id btree (doc_id)
noteproject_id
character varying(96)
not null
path
character varying(4096)
note
text
variant
character varying(16)
blur_sensitive_media
boolean
not null
false
idx_unique_note_path_project UNIQUE, btree (project_id, path)
note_project btree (project_id)
projectid
character varying(255)
not null
path
character varying(4096)
allow_from_mask
character varying(64)
label
character varying(255)
publisher_name
character varying(255)
''::character varying
maintainer_name
character varying(255)
''::character varying
source_url
character varying(2048)
''::character varying
logo_url
character varying(2048)
''::character varying
creation_date
timestamp without time zone
now()
update_date
timestamp without time zone
now()
description
character varying(4096)
''::character varying
project_pkey PRIMARY KEY, btree (id)
taskid
character varying(96)
not null
name
character varying(128)
not null
state
character varying(16)
not null
user_id
character varying(96)
group_id
character varying(128)
progress
double precision
0
created_at
timestamp without time zone
not null
completed_at
timestamp without time zone
retries_left
integer
max_retries
integer
args
text
result
text
error
text
task_pkey PRIMARY KEY, btree (id)
task_created_at btree (created_at)
task_group btree (group_id)
task_name btree (name)
task_state btree (state)
task_user_id btree (user_id)
user_historyid
integer
not null
generated by default as identity
creation_date
timestamp without time zone
not null
modification_date
timestamp without time zone
not null
user_id
character varying(96)
not null
type
smallint
not null
name
text
uri
text
not null
user_history_pkey PRIMARY KEY, btree (id)
idx_user_history_unique UNIQUE, btree (user_id, uri)
user_history_creation_date btree (creation_date)
user_history_type btree (type)
user_history_user_id btree (user_id)
user_history_pkey PRIMARY KEY, btree (id)
idx_user_history_unique UNIQUE, btree (user_id, uri)
user_history_creation_date btree (creation_date)
user_history_type btree (type)
user_history_user_id btree (user_id)
Referenced by:
TABLE user_history_project CONSTRAINT user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)
user_history_projectuser_history_id
integer
not null
prj_id
character varying(96)
not null
user_history_project_unique UNIQUE, btree (user_history_id, prj_id)
user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)
user_inventoryid
character varying(96)
not null
email
text
name
character varying(255)
provider
character varying(255)
details
text
'{}'::text
user_inventory_pkey PRIMARY KEY, btree (id)
What if you want to integrate text translations to Datashare’s interface? Or make it display tweets scraped with Twint? Ask no more: there is plugins for that!
Since version 5.6.1, instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform.
When starting, Datashare can receive a pluginsDir option, pointing to your plugins' directory. In this example, this directory is called ~/Datashare Plugins:
mkdir ~/Datashare\ Plugins
datashare --pluginsDir=~/Datashare\ PluginsYou can list official Datashare plugins like this :
$ datashare -m CLI --pluginList ".*"
2020-07-24 10:04:59,767 [main] INFO Main - Running datashare
plugin datashare-plugin-site-alert
Site Alert
v1.2.0
https://github.com/ICIJ/datashare-plugin-site-alert
A plugin to display an alert banner on the Datashare demo instance.
...The string given to --pluginList is a regular expression. You can filter the plugin list if you know what you are looking for.
You can install a plugin with its id and providing where the Datashare plugins are stored:
$ datashare -m CLI --pluginInstall datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
2020-07-24 10:15:46,732 [main] INFO Main - Running datashare
2020-07-24 10:15:50,202 [main] INFO PluginService - downloading from url https://github.com/ICIJ/datashare-plugin-site-alert/archive/v1.2.0.tar.gz
2020-07-24 10:15:50,503 [main] INFO PluginService - installing plugin from file /tmp/tmp7747128158158548092.gz into /home/dev/Datashare PluginsThen if you launch Datashare with the same plugin location, the plugin will be loaded.
When you want to stop using a plugin, you can either remove by hand the directory inside the plugins folder or remove it with datashare --pluginDelete :
$ datashare -m CLI --pluginDelete datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
2020-07-24 10:20:43,431 [main] INFO Main - Running datashare
2020-07-24 10:20:43,640 [main] INFO PluginService - removing plugin base directory /home/dev/Datashare Plugins/datashare-plugin-site-alert-1.2.0To inject plugins, Datashare will look for a Node-compatible module in ~/Datashare Plugins. This way we can rely on NPM/Yarn to handle built packages. As described in NPM documentation, it can be:
* A folder with a package.json file containing a "main" field.
* A folder with an index.js file in it.Datashare will read the content of each module in the plugins directory to automatically inject them in the user interface. The backend will serve the plugin files. The entrypoint of each plugin (usually the main property of package.json) is injected with a <script> tag, right before the closing </body> tag.
Create a hello-world directory with a single index.js:
mkdir ~/Datashare\ Plugins/hello-world
echo "console.log('Welcome to %s', datashare.config.get('app.name'))" > ~/Datashare\ Plugins/hello-world/index.jsReload the page, open the console: et voilà 🔮! Easy, right?
Now you would like to develop your plugin in your repository and not necessarily in Datashare Plugins folder.
You can have your code under, say ~/src/my-plugin and deploy it into Datashare with the plugin API. To do so, you'll need to make a zip or a tarball, for example in ~/src/my-plugin/dist/my-plugin.tgz.
The tarball could contain :
$ tar tvzf ~/src/my-plugin/dist/my-plugin.tgz
drwxr-xr-x dev/dev 0 2020-07-22 11:51 my-plugin/
-rw-r--r-- dev/dev 31 2020-07-21 14:07 my-plugin/main.js
-rw-r--r-- dev/dev 19 2020-07-21 14:07 my-plugin/package.jsonThen you can install it with:
$ datashare -m CLI --pluginInstall ~/src/my-plugin/dist/my-plugin.tgz --pluginsDir "~/Datashare Plugins"
2020-07-27 10:02:32,381 [main] INFO Main - Running datashare
2020-07-27 10:02:32,596 [main] INFO PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare PluginsAnd remove it:
$ datashare -m CLI --pluginDelete my-plugin --pluginsDir "~/Datashare Plugins"
2020-07-27 10:02:32,381 [main] INFO Main - Running datashare
2020-07-27 10:02:32,596 [main] INFO PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare PluginsIn that case my-plugin is the base directory of the plugin (the one that is in the tarball).
To allow external developers to add their own components, we added markers in strategic locations of the user interface where a user can define new Vue Component. These markers are called "hooks":
datashare.config.set('hooksDebug', true).To register a new component to a hook, use the following method:
// `datashare` is a global variable
datashare.registerHook({ target: 'app-sidebar.menu:before', definition: 'This is a message written with a plugin' })Or with a more complex example:
// It's usualy safer to wait for the app to be ready
document.addEventListener('datashare:ready', ({ detail }) => {
// Alert is a Vue component meaning it can have computed properties, methods, etc...
const Alert = {
computed: {
weekday () {
const today = new Date()
return today.toLocaleDateString('en-US', { weekday: 'long' })
}
},
template: `<div class="text-center bg-info p-2 width-100">
It's {{ weekday }}, have a lovely day!
</div>`
}
// This is the most important part of this snippet:
// we register the component on the a given `target`
// using the core method `registerHook`.
detail.core.registerHook({ target: 'landing.form:before', definition: Alert })
})Datashare Playground delivers a collection of Bash scripts (free of external dependencies) that streamline interaction with a Datashare instance’s Elasticsearch index and Redis queue.
From cloning or replacing whole indices and reindexing specific directories, to adjusting replica settings, monitoring or cancelling long-running tasks, and queuing files for processing, Playground implements each capability through intuitive shell scripts organized under the elasticsearch/ and redis/ directories.
To get started, set ELASTICSEARCH_URL and REDIS_URL in your environment (or add them to a .env file at the repo root). For a comprehensive guide to script options, directory layout, and example workflows, see the full documentation on Github:
Some Datashare updates can bring some fixes and improvements on the index. The index has to be reindexed accordingly.
1. Create a temporary empty index and specify the desired Datashare version number:
./elasticsearch/index/create.sh <temporary_index> <ds_version_number>2. Reindex all documents (under "/" path) from the original index under a temporary one:
This step can take some time if your index has plenty of documents.
./elasticsearch/documents/reindex.sh <original_index> <temporary_index> /3. Replace the old index by the new one:
./elasticsearch/index/replace.sh <temporary_index> <original_index>./elasticsearch/index/delete.sh <temporary_index>Datashare Tarentula is a powerful command-line toolbelt designed to streamline bulk operations against any Datashare instance.
Whether you need to count indexed files, download large datasets, batch-tag records, or run complex Elasticsearch aggregations, Tarentula provides a consistent, scriptable interface with flexible query support, and Docker compatibility.
It also exposes a Python API for embedding automated workflows directly into your data pipelines.
With commands like count, download, aggregate, and tagging-by-query, you can handle millions of records in a single invocation, or integrate Tarentula into CI/CD pipelines for reproducible data tasks.
You can install Tarentula with your favorite package manager:
pip3 install --user tarentulaOr alternatively with Docker:
docker run icij/datashare-tarentulaFor the complete list of commands, options, and example, read the documentation or Github:
Datashare's frontend is build with Vue 3 and Bootstrap 5. We document all component of the interface on a dedicated Storybook:
To facile the creation of plugin, each component can be imported directly from the core:
// It's usualy safer to wait for the app to be ready
document.addEventListener('datashare:ready', async () => {
// This load the ButtonIcon component asynchronously
const ButtonIcon = await datashare.findComponent('Button/ButtonIcon')
// Than we create a dummy component. For the sake of simplicity we use
// Vue 3's option API but we strongly encourage you to build your plugins
// with Vite and use the option API.
const definition = {
components: {
ButtonIcon,
},
methods: {
sayHi() {
alert('Hi!')
}
},
template: `
<button-icon @click="sayHi()" icon-left="hand-waving">
Say hi
</button-icon>
`
}
// Finally, we register the component's definition in a hook.
datashare.registerHook({ target: 'app-sidebar-sections:before', definition })
})In the example you learn that:
Datashare launch must be awaited with "datashare:ready"
You can asynchronously import components with datashare.findComponent
Component can be registered on targeted locations with a "hook"
All icons from Phosphor are available and loaded automatically