Only this pageAll pages
Powered by GitBook
1 of 90

Datashare

Loading...

Loading...

Loading...

Loading...

Loading...

On your computer

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

On your server

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Usage

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Developers

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

About Datashare

Datashare allows you to search in your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).

What is Datashare?

Welcome to Datashare - a self-hosted documents search software. It is a free and open-source software developed by the International Consortium of Investigative Journalists (ICIJ). Initially created to combine multiple named-entity recognition pipelines, this tool is now a fully-featured search interface to dig into your documents.

With the help of several open-source tools (Extract, Apache Tika, Apache Tesseract, CoreNLP, OpenNLP, Elasticsearch, and more), Datashare can be used on one single personal computer, as well as on 100 interconnected servers.

Who uses it?

Datashare is developed by the ICIJ, a collective of investigative journalists. Datashare is built at the top of technologies and methods already tested with investigations like the Panama Papers or the Luanda Leaks.

Seeing the growing interest for ICIJ's technology, we decided to open source this key component of our investigations so a single journalist as well as big media organizations could use it for their own documents.

Datashare is free so anyone can use it and find is useful.

Curious to know more about how we use Datashare?

  • How ICIJ analysed 715,000 Luanda Leaks records

  • Help test and improve our latest journalism tool

  • How Datashare project will help journalists breach borders

Where can I see Datashare in action?

We setup a demo instance of Datashare with a small set of documents from the LuxLeaks investigation (2014). When using this instance, you will be assigned a temporary user which can star, tag, recommend and explore documents.

Launch your own batch search on Datashare's

Can I run Datashare on my server?

Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to the server documentation to know how it works.

Can I customize Datashare?

When building Datashare, one of our first decisions was to use Elasticsearch to create an index of documents. It would be fair to describe Datashare as a nice looking web interface for Elasticsearch. We want our search platform to be user-friendly while keeping all the powerful Elasticsearch features available for advanced users. This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers who want to query the index directly with our API.

We implemented the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user can pick the plugins they need or want, and have a fully customized installation of our search platform. Please have a look at the documentation.

In which languages is Datashare available?

This project is currently available in English, French and Spanish. You can help improve and complete translations on Crowdin.

demo
Image showing Datashare logo, the tagline 'Find stories in any files' and a screenshot of a page of the software
Image of a screenshot of the batch search page of Datashare

Install on Mac

These pages will help you set up and install Datashare on your computer.

Install Neo4j plugin

Install the Neo4j plugin

Install the Neo4j plugin following these instructions.

Configure the Neo4j plugin

1. At the bottom of the menu, click on the 'Settings' icon:

2. Make sure the following settings are properly set:

  • Neo4j Host should be localhost or the address where your Neo4j instance is running

  • Neo4j Port should be the port where your Neo4j instance is running (7687 by default)

  • Neo4j User should be set to your Neo4j user name (neo4j by default)

  • Neo4j Password should only be set if your Neo4j user is using password authentication

3. When running Neo4j Community Edition, set the Neo4j Single Project value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.

4. Restart Datashare to apply the changes. Check how for Mac, Windows or Linux.

5. Go to 'Projects' > your project's page > the Graph tab. You should see the Neo4j widget. After a little while, its status should be RUNNING:

You can now create the graph.

Screenshot of Datashare's homepage with the Settings icon at the bottom of the menu highlighted
Screenshot of a Project's page on the Graph tab with the Running status highlighted

Why results from a simple search and a batch search can be slightly different?

If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.

Why?

For technical reasons, Datashare processes both queries in 2 different ways:

a. Search bar (a simple search processed in the browser):

The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.

b. Batch search (several searches processed by the server):

  1. Datashare's server processes each of the batch search's queries

  2. Each query is sent to Elasticsearch. The results are saved into a database

  3. When the batch search is finished, you get the results from Datashare

  4. Datashare sends back the results stored into the database/

Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.

A diagram with the title 'Query from navigator'

How to run Neo4j?

This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)

Run Neo4j inside docker

1. enrich the services section of the docker-compose.yml of the install with Docker page, with the following neo4j service:

...
services:
    neo4j:
      image: neo4j:5-community
      environment:
        NEO4J_AUTH: none
        NEO4J_PLUGINS: '["apoc"]'
      ports:
        - 7474:7474
        - 7687:7687
      volumes:
        - neo4j_conf:/var/lib/neo4j/conf
        - neo4j_data:/var/lib/neo4j/data

make sure not to forget the APOC plugin (NEO4J_PLUGINS: '["apoc"]').

2. enrich the volumes section of the docker-compose.yml of the install with Docker page, with the following neo4j volumes:

volumes:
  ...
  neo4j_data:
    driver: local
  neo4j_conf:
    driver: local

3. Start the neo4j service using:

docker compose up -d neo4j

Run Neo4j Desktop

  1. install with Neo4j Desktop, follow installation instructions found here

  2. create a new local DBMS and save your password for later

  3. if the installer notifies you of any ports modification, check the DBMS settings and save the server.bolt.listen_address for later

  4. make sure to install the APOC Plugin

Additional options

Additional options to install neo4j are listed here.

Definitions

👷‍♀️ This page is currently being written by Datashare team.

What is an entity?

An entity in Datashare is the name of people, organizations or locations or an email address.

Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically detect entities in your documents.

You can filter documents by their entities and see all the entities mentioned in a document.

What if tasks are 'running' but not completing?

You started tasks, and they are running as you can see on 'http://localhost:8080/#/indexing' but they are not completing.

There are two possible causes:

  • If you see a progress of less than 100%, please wait.

  • If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on Datashare Github with the application logs.

What do I do if Datashare opens a blank screen in my browser?

If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:

  1. First wait 30 seconds and reload the page.

  2. If the screen remains blank, restart Datashare following instructions for Mac, Windows or Linux.

  3. If you still see a blank screen, please uninstall and reinstall Datashare

To uninstall Datashare:

On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.

On Windows, please follow these steps.

On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.

To reinstall Datashare, see 'Install Datashare' for Mac, Windows or Linux.

I see entities in the filters but not in the documents

Datashare's filters keep the entities (people, organizations, locations, e-mail addresses) previously found.

"Old" named entities can remain in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later.

In the future, removing the documents from Datashare before indexing new ones will remove the entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can follow .

these instructions
Screenshot of Mac's 'Applications' window with an arrow pointing at Datashare logo

Datashare doesn't open. What should I do?

It can be due to extensions priorly installed. The tech team is fixing the issue. In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:

  • On Mac

rm -rf ~/Library/datashare/plugins ~/Library/datashare/extensions
  • On Linux

rm -rf ~/.local/share/datashare/plugins ~/.local/share/datashare/extensions
  • On Windows

del /S %APPDATA%\Datashare\Extensions  %APPDATA%\Datashare\Plugins

Press Enter. Open Datashare again.

Running modes

Datashare runs using different modes with their own features.

Mode
Category
Description

LOCAL

Web

To run Datashare on a single computer for a single user.

SERVER

Web

To run Datashare on a server for multiple users.

CLI

CLI

To index documents and analyze them directly .

TASK_RUNNER

Daemon

To execute async tasks (, batch downloads, scan, index, NER extraction, ...)

Web modes

There are two modes:

In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.

In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.

Comparaison between modes

The running modes offer advantages and limitations. This matrix summarizes the differences:

local

server

Multi-users

❌

✅

Multi-projects

✅

✅

Access-control

❌

✅

Indexing UI

✅

❌

Plugins UI

✅

❌

Extension UI

✅

❌

HTTP API

✅

✅

API Key

✅

✅

Single JVM

✅

❌

Tasks execution

✅

❌

When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.

CLI mode

In cli mode, Datashare starts without a web server and allows user to perform tasks over their documents. This mode can be used in conjunction with both local and server modes, while allowing users to distribute heavy tasks between several servers.

If you want to learn more about which tasks you can execute in this mode, checkout the stages documentation.

Daemon modes

Those modes are intended to be used for action that requires to "wait" for pendings tasks.

In batch download mode, the daemon waits for a user to request a batch download of documents. When a request is received, the daemon starts a task to download the document matching the user search, and bundle them into a zip file.

In batch search mode, the daemon waits for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can upload a list of search terms (in CSV format). The daemon will then start a task to search all matching documents and store every occurrences in the database.

How to change modes

Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mode is the embedded mode. Yet when starting Datashare in command line, you can explicitly specify the running mode. For instance on Ubuntu/Debian:

datashare \
  # Switch to SERVER mode
  --mode SERVER \
  # Dummy session filter to creates ephemeral users
  --authFilter org.icij.datashare.session.YesCookieAuthFilter \
  # Name of the default project for every user
  --defaultProject local-datashare \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # URI of Redis 
  --redisAddress redis://redis:6379 \
  # store user sessions in Redis.
  --sessionStoreType REDIS
in the command-line
batch searches

CLI stages

When running Datashare from the command-line, pick which "stage" to apply to analyse your documents.

The CLI stages are primarily intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heavy tasks between servers.

1. SCAN

This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.

datashare --mode CLI \  
  # Select the SCAN stage
  --stage SCAN \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Redis 
  --redisAddress redis://redis:6379

2. INDEX

The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue created in the previous step, then use a combination of Apache Tika and Tesseract to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurrent value at the time. This allows users to distribute this command on several servers.

datashare --mode CLI \
  # Select the INDEX stage
  --stage INDEX \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # Enable OCR \
  --ocr true
  # URI of Redis 
  --redisAddress redis://redis:6379

3. NLP

Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entity mentions in ElasticSearch, it will also mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be parallelized as well on several servers.

datashare --mode CLI \
  # Select the NLP stage
  --stage NLP \
  # Use CORENLP to detect named entities
  --nlpp CORENLP \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 

Ask for help

To report a bug, please post an issue in our GitHub detailing your logs with:

  • Your Operating System (Mac, Windows or Linux)

  • The version of your Operating System

  • The version of Datashare

  • Screenshots of your issue

  • A description of your issue

If, for confidentiality reasons, you don't want to open an issue on Github, please write to datashare@icij.org.

Concepts

This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.

Install on Windows

These pages will help you set up and install Datashare on your computer.

About the local mode

In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines.

The software allows users to search into their documents within their own local environment, without relying on external servers or cloud infrastructure.

This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.

Add documents to Datashare

Datashare provides a folder on your Mac to collect documents you want to have in Datashare.

1

Find your Datashare folder on your Mac

Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':

On the menu bar at the top of your computer, click 'Go' and 'Home' (the house icon):

You will see a folder called 'Datashare':

If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:

2

Add documents to your Datashare folder on your Mac

Copy or drop the documents that you want to add to Datashare in this Datashare folder.

3

Launch Datashare

Open your Applications. You should see Datashare. Double-click on it:

4

In the menu, in 'Tasks', open 'Documents'

Expand the menu on the left:

Expand the menu

In 'Tasks', open 'Documents':

Open the 'Documents' page

On the top right, click the 'Plus' button:

Click the 'Plus' button
5

Choose your options

  • Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.

  • Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.

  • Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.

  • Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.

  • Skip already indexed documents if you'd like.

  • Click 'Add'

Form for adding documents
6

Watch the progress of your document addition

Two extraction tasks are now running:

  • The first is the scanning of your Datashare folder - it sees if there are documents to analyze. It is called 'Scan folders'.

  • The second is the indexing of these files. It is called 'Index documents'.

Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.

But you can start searching in your documents without having to wait for all tasks to be done.

You can now search documents in Datashare.

Screenshot of Mac's dock where the Finder is active in first position
Screenshot of Mac's Finder window with a dropdown menu below the 'Go' entry with the 'Home' entry highlighted
Screenshot of Mac's Home window with an arrow pointing at the 'Datashare' folder in the list
Screenshot of Mac's Home window highligting 'Datashare' entry located in the 'Favorites'
Screenshot of Mac's Applications window with an arrow pointing Datashare's logo
Screenshot of Datashare's homepage highlighting the top icon in the left menu top to expand it
Screenshot of Datashare's homepage with the left menu open highlighting the 'Documents' entry in the 'Tasks' category
Screenshot of Datashare's Documents page highlighting the 'Plus' button at the top right corner
Screenshot of Datashare's 'Add Documents' page with the form showing 5 options, a 'Reset' and an 'Add' buttons
Screenshot of Datashare's Documents page highlighting two lines in a table, one for 'Scan folders' and another one for 'Index documents'

Install on Linux

These pages will help you set up and install Datashare on your computer.

Install Datashare

You must have Windows 7 Service Pack 2 or any newer version.

1

Uninstall any prior standard version

Before we start, please uninstall any prior standard version of Datashare if you had already installed it. You can follow these steps: https://www.laptopmag.com/articles/uninstall-programs-windows-10

2

Download Datashare

Go to datashare.icij.org and click 'Download for Windows':

The file 'datashare-X.Y.Z.exe' is now downloaded. You can find it in your Downloads.

Double-click on the name of the file in order to execute it.

3

Allow Datashare

As Datashare is not signed, this popup asks for your permission. Don't click 'Don't run' but click 'More info':

Click 'Run anyway':

It asks if you want to allow the app to make changes to your device. Click 'Yes':

4

Install Datashare

On the Installer Wizard, as you need to download and install OpenJDK11 if it is not installed on your device, click 'Install':

The following windows with progress bars will be displayed:

Choose a language and click 'OK':

5

Install Tesseract OCR

To install Tesseract OCR, click the following buttons on the Installer Wizard's windows:

Untick 'Show README' and click 'Finish':

Finally, click 'Close' to close the installer of TesseractOCR.

6

Install Datashare.jar

It is now downloading the back-end and the front-end, Datashare.jar:

When it is finished, click 'Close':

You can now start Datashare.

Install Datashare

The installer will take care of checking that your system have all the dependencies to run Datashare. Because this software use Apache Tesseract (to perform Optical Character Recognition, OCR) and Mac doesn't support them out-of-box, heavy dependencies must be downloaded. If your system have none of those dependencies, the first installation of Datashare can take up to 30 minutes.

The installer will set up:

  • Xcode Command Line Tools (if neither XCode or Xcode Command Line Tools are installed)

  • Homebrew (if neither Homebrew or MacPorts are installed)

  • Apache Tesseract with MacPorts or Homebrew

  • Java JRE 17

  • Datashare executable

Note: Previous versions of this document referred to a "Docker Installer". We do not provide this installer anymore but Datashare is still published on the Docker Hub and supported with Docker.

Installation fails:

  • Error while installing Homebrew or MacPorts: you can manually install Homebrew first and then restart the installer.

  • "System Software from application was blocked from loading" : Check in your Mac's "System Settings" > "privacy & security" if you have a section with this mention "System software from application 'Datashare' was blocked from loading" or something similar related to Datashare. If you have this section you'll have to click "Allow" to be able to install datashare.

  • For any other issue check our Github issues or create a new one with your setup (macOs version) and installer logs (Command+L when the installer is launched and failed).

1

Download Datashare

Go to datashare.icij.org and click 'Download for Mac'.

2

Start the installer

In Finder, go to your 'Downloads' directory and double-click 'datashare-X.Y.Z.pkg':

3

Go through the Datashare Installer

Click 'Continue', 'Install', enter your password and 'Install Software':

The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.

You can see what it actually does by typing command+L: it will open a window which logs every action made.

In the end, you should see this screen:

You can now safely close this window.

You can now start Datashare.

datashare.icij.org
Screenshot of the homepage of datashare.icij.org highlighting the 'Download for Windows' button
Screenshot of Windows' window saying 'Windows protected your PC' with the text "Windows Defender SmartScreen prevented an unrecognized app from starting. Running this app might put your PC at risk. More info (which is a link)" and a button 'Don't run'
Screenshot of Windows' window saying 'Windows protected your PC' with 2 buttons: 'Run anyway' and 'Don't run'
Screenshot of Windows' window with the question 'Do you want to allow this app from an unknown producer to make changes to your device?' with 2 buttons: 'Yes' (which is highlighted) and 'No'
Screenshot of Windows' window with the title 'Welcome to the ICIJ Setup Wizard' with 2 buttons: 'Install' (which is highlighted) and 'Cancel'
Screenshot of Windows' window saying 'Please wait (...) Datashare is being installed' with a progress bar and a 'Cancel' button
Screenshot of Windows' window saying 'Please wait (...) Tesseract is being installed' with a progress bar and a 'Cancel' button
Screenshot of Windows' window saying 'Please wait (...) Datashare is being installed' and 'Please wait while Setup is loading'
Screenshot of Windows' window saying 'Please wait (...) Datashare is being installed' containing another window which says 'Please select a language' with a dropdown with 'English' selected' with 2 buttons: 'Ok' (which is highlighted) and 'Cancel'
Screenshot of Windows' window saying 'Welcome to the Tessearct-OCR Setup Wizard' with 2 buttons: 'Next (which is highlighted) and 'Cancel'
Screenshot of Windows' window saying 'Licence agreement' with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
Screenshot of Windows' window showing 2 radiobuttons: 'Install for anyone using this computer' (which is selected) and 'Install just for me' and with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
Screenshot of Windows' window showing some pre-ticked options with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
Screenshot of Windows' window showing a pre-ticked 'Destination Folder' (C:\Program Files (x86)\Tesseract-OCR) with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
Screenshot of Windows' window saying 'Choose Start Menu Folder' with 3 buttons: 'Back', 'Install' (which is highlighted) and 'Cancel'
Screenshot of Windows' window saying 'Installation Complete' with 3 buttons: 'Back', 'Install' (which is highlighted) and 'Cancel'
Screenshot of Windows' window saying 'Completing the Tesseract-OCR Setup Wizard' with 3 buttons: 'Back', 'Finish' (which is highlighted) and 'Cancel'
Screenshot of Windows' window saying 'ICIJ Datashare Setup' with a progress bar and a 'Cancel' button
Screenshot of Windows' window saying 'ICIJ Datashare Setup' with a completed progress bar with 3 buttons: 'Back', 'Close' (which is highlighted) and 'Cancel'
datashare.icij.org
Screenshot of the homepage of datashare.icij.org highlighting the 'Download for Mac' button
Screenshot of the Downloads window on Mac showing the installer package of Datashare
Screenshot of the Mac installer's first step to install Datashare: 'Introduction'
Screenshot of the Mac installer's third step to install Datashare: 'Installation Type''
Screenshot of the Mac installer's step to install Datashare when username and password are asked
Screenshot of the Mac installer's last step to install Datashare: 'Summary' saying 'The installation was successful.'with a blue 'Close' button

Install with Docker

This page will help you set up and install Datashare within a Docker.

Prerequisites

Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.

Read more about how to install Docker on your system

Starting Datashare with a single container

To start Datashare within a Docker container, you can use this command:

docker run --mount src=$HOME/Datashare,target=/home/datashare/data,type=bind -p 8080:8080 icij/datashare:11.1.9 --mode EMBEDDED

Make sure the Datashare folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.

Starting Datashare with multiple containers

Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).

By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.

Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store, will be retained even if the containers are restarted or redeployed.

This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.

To start Datashare with Docker Compose, you can use the following docker-compose.yml file:

version: "3.7"
services:

  datashare:
    image: icij/datashare:18.1.3
    hostname: datashare
    ports:
      - 8080:8080
    environment:
      - DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
    volumes:
      - type: bind
        source: ${HOME}/Datashare
        target: /home/datashare/data
      - type: volume
        source: datashare-models
        target: /home/datashare/dist
    command: >-
      --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password 
      --mode LOCAL
      --tcpListenPort 8080
    depends_on:
      - postgresql
      - redis
      - elasticsearch

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
    restart: on-failure
    volumes:
      - type: volume
        source: elasticsearch-data
        target: /usr/share/elasticsearch/data
        read_only: false
    environment:
      - "http.host=0.0.0.0"
      - "transport.host=0.0.0.0"
      - "cluster.name=datashare"
      - "discovery.type=single-node"
      - "discovery.zen.minimum_master_nodes=1"
      - "xpack.license.self_generated.type=basic"
      - "http.cors.enabled=true"
      - "http.cors.allow-origin=*"
      - "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"

  redis:
    image: redis:4.0.1-alpine
    restart: on-failure

  postgresql:
    image: postgres:12-alpine
    environment:
      - POSTGRES_USER=datashare
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=datashare
    volumes:
      - type: volume
        source: postgresql-data
        target: /var/lib/postgresql/data

volumes:
  datashare-models:
  elasticsearch-data:
  postgresql-data:

Apple Silicon (M1/M2/M3) users:

If you encounter the error Error response from daemon: no matching manifest for linux/arm64/v8 in the manifest list entries when pulling the redis Docker image, add the following line to the redis service in your docker-compose.yml:

platform: linux/x86_64

This forces Docker to use the x86_64 image, which is necessary because some Redis images do not provide ARM64 builds.

Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:

docker-compose up -d

The -d flag runs the containers in detached mode, allowing them to run in the background.

Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this operation with:

docker-compose logs -f datashare

Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.

That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.

To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:

docker-compose down

This will stop and remove the containers, freeing up system resources.

Install Datashare

Currently, only a .deb package for Debian/Ubuntu is provided.

If you want to run it with another Linux distribution, you can download the latest version of the Datashare jar here: https://github.com/ICIJ/datashare/releases/latest

And adapt the following launch script to your environment: https://github.com/ICIJ/datashare/blob/master/datashare-dist/src/main/deb/bin/datashare.

1

Download Datashare

Go to datashare.icij.org and click 'Download for Linux':

Save the Debian package as a file:

Save as file
2

Install the package

$ sudo apt install /dir/to/debian/package/datashare-dist_7.2.0_all.deb
3

Run Datashare

$ datashare

You can now start Datashare.

Start Datashare

Find the application on your computer and run it locally in your browser.

Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)

A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.

Keep this Terminal window open as long as you use Datashare.

Datashare should now automatically open in your default internet browser. If it doesn’t, type 'localhost:8080' in your browser.

Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: Can I use Datashare with no internet connection?).

Datashare's homepage

You can now add documents to Datashare.

Start Datashare

Find the application on your computer and run it locally on your browser.

Start Datashare by launching it from the command-line:

datashare

Datashare should now automatically open in your default internet browser. If it doesn’t, type 'localhost:8080' in your browser.

Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see: Can I use Datashare with no internet connection?).

Datashare's homepage

It's now time to add documents to Datashare.

Screenshot of the homepage of datashare.icij.org highlighting the 'Download for Linux' button
Screenshot of a Linux' window saying 'What should Firefox do with this file?' with 2 radiobuttons: 'Open with Archive Manager' and "Save File' (selected) with 2 buttons: 'Cancel' and 'OK'
Screenshot of Windows' homepage with an open menu with the entry 'ICIJ' > 'Datashare 1.3' highlighted
Screenshot of Windows' homepage with a Terminal Window showing logs of Datashare's starting process
Screenshot of Datashare's homepage, the projects' page with one project called 'Default'
Screenshot of the homepage of Datashare, the projects' page with one project called 'Default'

Neo4j

This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your computer.

Prerequisites

Get Neo4j up and running

Follow the instructions of the dedicated FAQ page to get Neo4j up and running.

We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.

Add entities

If it's not done yet find entities and extract names of people, organizations and locations as well as email addresses.

If your project contains emails, make sure to also extract email addresses.

Next step

You can now run Datashare with the Neo4j plugin.

About the server mode

In server mode, Datashare operates as a centralized server-based system. Users can access the platform through a web interface, and the documents are stored and processed on Datashare's servers.

This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.

Launch configuration

Datashare is launched with --mode SERVER and you have to provide:

  • The external elasticsearch index address elasticsearchAddress

  • A Redis store address redisAddress

  • A Redis data bus address messageBusAddress

  • A database JDBC URL dataSourceUrl

  • The host of Datashare (used to generate batch search results URLs) rootHost

  • An authentication mechanism and its parameters

Example:

docker run -ti ICIJ/datashare:version --mode SERVER \
    --redisAddress redis://my.redis-server.org:6379 \
    --elasticsearchAddress https://my.elastic-server.org:9200 \
    --messageBusAddress my.redis-server.org \
    --dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
    --rootHost https://my.datashare-server.org
    # ... +auth parameters (see authentication providers section)

Install plugins and extensions

This page explains how to locally add plugins and extensions to Datashare.

Plugins are front-end modules to add new features in Datashare's user interface.

Extensions are back-end modules to add new features to store and manipulate data with Datashare.

Add plugins to Datashare's front-end

1

At the bottom of the menu, click the 'Settings' icon:

2

Open the 'Plugins' tab:

3

Choose the plugin you want to add and click 'Install':

If you want to install a plugin from an URL, click 'Install from a URL':

4

Your plugin is now installed:

5

Refresh your page to see your new plugin activated in Datashare.

Add extensions to Datashare's back-end

1

At the bottom of the menu, click the 'Settings' icon:

2

Open the 'Extensions' tab:

3

Choose the extension you want to add and click 'Install':

If you want to install an extension from an URL, click 'Install from a URL':

4

Your extension is now installed:

5

Restart Datashare to see your new extension activated in Datashare. Check how for Mac, Windows and Linux.

Update plugin or extension with latest version

When a newer version of a plugin or extension is available, get the latest version.

If it is a plugin, refresh your page to activate the latest version.

If it is an extension, restart Datashare to activate the latest version. Check how for Mac, Windows and Linux.

Create your own plugin or extension

People who can code can create their own plugins and extensions by following these steps:

  • Plugins

  • Extensions

Add documents to Datashare

Datashare provides a folder to collect documents on your computer to index in Datashare.

1

Add documents in 'Datashare Data' folder

When you open your desktop in Windows on your computer, you will see a folder called 'Datashare Data'.

Move or copy and paste the documents you want to add to Datashare to this folder:

2

Launch Datashare

You will find it in your main menu:

3

In the menu, in 'Tasks', open 'Documents'

Expand the menu on the left:

Expand the menu

In 'Tasks', open 'Documents':

Open the "Documents" page

On the top right, click the "Plus" button:

Click the "Plus" button
4

Choose your options

  • Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.

  • Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.

  • Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.

  • Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.

  • Skip already indexed documents if you'd like.

  • Click 'Add'

Form for adding documents
5

Watch the progress of your document addition

Two extraction tasks are now running:

  • The first is the scanning of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.

  • The second is the indexing of these files. It is called 'IndexTask'.

Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.

But you can start searching in your documents without having to wait for all tasks to be done.

You can now search documents in Datashare.

Add documents to Datashare

Datashare provides a folder to collect documents on your computer to index in Datashare.

1

Add documents to your 'Datashare' folder

You can find a folder called 'Datashare' in your home directory.

Move the documents you want to add to Datashare into this folder.

2

Launch Datashare

Launch Datashare and see the interface opening in your default browser.

3

In the menu, in 'Tasks', open 'Documents'

Expand the menu on the left:

Expand the menu

In 'Tasks', open 'Documents':

Open the "Documents" page

On the top right, click the 'Plus' button:

Click the "Plus" button
4

Choose your options

  • Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.

  • Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.

  • Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.

  • Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.

  • Skip already indexed documents if you'd like.

  • Click 'Add'

Form for adding documents
5

Watch the progress of your document addition

Two extraction tasks are now running:

  • The first is the scanning of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.

  • The second is the indexing of these files. It is called 'IndexTask'.

Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.

But you can start searching in your documents without having to wait for all tasks to be done.

You can now search documents in Datashare.

Find entities

This page helps you find entities (people, organizations, locations, e-mail addresses) in your documents.

Prerequisite: Your documents must be added to Datashare. Check how for Mac, Windows and Linux.

1

In the menu, in 'Tasks', click 'Entities'

2

In the menu or on the top right, click the 'Plus' button or on the page, click 'Find entities':

3

Select your options

  • Select a project where you want to find entities

  • Choose between finding names of people, organizations and locations or finding email addresses. You cannot do both simultaneously, you need to do one after the other, no matter the order.

  • Choose a Natural Language Processing model, that is to say the software which will run the entity recognition. If you want to add more models, you can check how to add them as extensions.

4

In 'Tasks' > 'Entities', watch the progress of your entity recognition:

Once they are done, you can click 'Delete done tasks' to stop displaying tasks that are completed.

5

Explore your entities in the documents

You can now start searching your entities in the documents without having to wait for all tasks to be done.

In the menu, click 'Search' > 'Documents' and open the 'Entities' tab of your documents or use the Entities filters.

Create and update Neo4j graph

This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects

Create the graph

  1. Go to 'All projects' and click on your project's name:

  1. Go to the Graph tab and in the first step 'Import', click on the 'Import' button:

You will then see a new import task running.

When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:

Update the graph

If new documents or entities are added or modified in Datashare, you will need to update the Neo4j graph to reflect these changes.

Go to 'All projects' > one project's page > the 'Graph' tab. In the first step, click on the 'Update graph' button:

To detect whether a graph update is needed, go to the 'Projects' page and open your project:

Open your project

Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...

Statistics of one project

...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:

The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.

You can now explore your graph using your favorite visualization tool.

Start Datashare

Find the Datashare application on your computer and run it locally on your browser.

Once Datashare is installed, go to 'Finder' > 'Applications', and double-click on 'Datashare':

A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening:

⇒ Important: Keep this Terminal window open as long as you use Datashare.

Once the process is done, Datashare should now automatically open in your default internet browser. If it doesn’t, type 'localhost:8080' as a URL in your browser.

Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: Can I use Datashare with no internet connection?).

Datashare's homepage

You can now add documents to Datashare.

Screenshot of a Datashare's project page with the Settings icon at the bottom of the left menu highlighted
Screenshot of a Datashare's settings page with the Plugins tab highlighted
Screenshot of Datashare's Settings page on the Plugins tab with a Plugin's 'Install' button highlighted
Screenshot of Datashare's Settings page on the Plugins tab with the field 'Install from a URL' highlighted
Screenshot of Datashare's Settings page on the Plugins tab with the installed plugin highlighted
Screenshot of a Datashare's project page with the Settings icon at the bottom of the left menu highlighted
Screenshot of a Datashare's settings page with the Extensions tab highlighted
Screenshot of Datashare's Settings page on the Extensions tab with a Extension's 'Install' button highlighted
Screenshot of Datashare's Settings page on the Extensions tab with the field 'Install from a URL' highlighted
Screenshot of Datashare's Settings page on the Extensions tab with the installed extension highlighted
Screenshot of Windows' homepage with the Datashare folder icon highlighted
Screenshot of Windows' homepage with the menu open with the entry 'ICIJ' > 'Datashare 1.3' highlighted
Screenshot of Datashare's homepage highlighting the top icon in the left menu top to expand it
Screenshot of Datashare's homepage with the left menu open highlighting the 'Documents' entry in the 'Tasks' category
Screenshot of Datashare's Documents page highlighting the 'Plus' button at the top right corner
Screenshot of Datashare's 'Add Documents' page with the form showing 5 options, a 'Reset' and an 'Add' buttons
Screenshot of Datashare's Documents page highlighting two lines in a table, one for 'Scan folders' and another one for 'Index documents'
Screenshot of Datashare's homepage highlighting the top icon in the left menu top to expand it
Screenshot of Datashare's homepage with the left menu open highlighting the 'Documents' entry in the 'Tasks' category
Screenshot of Datashare's Documents page highlighting the 'Plus' button at the top right corner
Screenshot of Datashare's 'Add Documents' page with the form showing 5 options, a 'Reset' and an 'Add' buttons
Screenshot of Datashare's Documents page highlighting two lines in a table, one for 'Scan folders' and another one for 'Index documents'
Screenshot of Datashare's Entities page with the menu's Entities entry highlighted
Screenshot of Datashare's Entities page with 3 highlights: the menu's 'Plus' button next to Entities entry, the central button 'Find entities' in the empty state and the top right 'Plus' button
Screenshot of Datashare's 'Find Entities' page with the whole form highlighted
Screenshot of Datashare's Entities page with the table which lists tasks and the entity recognition task highlighted in one line
Screenshot of Datashare's 'All projects' page with the name of one project highlighted
Screenshot of Datashare's project page on the Graph tab with the 'Import' button highlighted on the first step of the form
Screenshot of Datashare's project page on the Graph tab with the Graph statistics highligted
Screenshot of Datashare's project page on the Graph tab with 'Update' button on the first step of the form highlighted
Screenshot of Datashare's All projects page with the name of one project highlighted
Screenshot of Datashare's project page on the Insights tab with statistics highlighted
Screenshot of Datashare's project page on the Graph tab with statistics highlighted
Screenshot of Mac's Applications window where Datashare's logo is highlighted
Screenshot of Mac's terminal window with Datashare's starting logs
Screenshot of the homepage of Datashare, the projects' page with one project called 'Default'

Add entities from the CLI

This document assumes that you have installed Datashare in server mode within Docker and already added documents to Datashare.

In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.

This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.

Datashare has the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing (NLP) pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage NLP \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --nlpParallelism 2 \
  --nlpp CORENLP

What's happening here:

  • Datashare starts in "CLI" mode

  • We ask to process the NLP stage

  • We tell Datashare to use the elasticsearch service

  • Datashare will pull documents from ElasticSearch directly

  • Up to 2 documents will be analyzed in parallel

  • Datashare will use the CORENLP pipeline

Datashare will use the output queue from the previous INDEX stage (by default extract:queue:nlp in Redis) that contains all the document ids to be analyzed.

The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.

You can also use chain the 3 stages altogether:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX,NLP \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --nlpParallelism 2 \
  --nlpp CORENLP \
  --dataDir /home/datashare/Datashare/

As for the previous stages you may want to restore the output queue from the INDEX stage. You can do:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage ENQUEUEIDX,NLP \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --nlpParallelism 2 \
  --nlpp CORENLP

The added ENQUEUEIDX stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the IDs of those documents into the extract:queue:nlp queue.

Install with Docker

This page explain how to start Datashare within a Docker in server mode.

Prerequisites

Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.

Read more about how to install Docker on your system.

Starting Datashare with multiple containers

Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.

Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.

This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.

To start Datashare in server mode with Docker Compose, you can use the following docker-compose.yml file for version 20.1.4 (check latest version on https://datashare.icij.org/):

version: "3.7"
services:

  datashare:
    image: icij/datashare:20.1.4
    hostname: datashare
    ports:
      - 8080:8080
    environment:
      - DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
    volumes:
      - type: bind
        source: ${HOME}/Datashare
        target: /home/datashare/data
      - type: volume
        source: datashare-models
        target: /home/datashare/dist
    command: >-
      --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password 
      --mode LOCAL
      --tcpListenPort 8080
    depends_on:
      - postgresql
      - redis
      - elasticsearch

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
    restart: on-failure
    volumes:
      - type: volume
        source: elasticsearch-data
        target: /usr/share/elasticsearch/data
        read_only: false
    environment:
      - "http.host=0.0.0.0"
      - "transport.host=0.0.0.0"
      - "cluster.name=datashare"
      - "discovery.type=single-node"
      - "discovery.zen.minimum_master_nodes=1"
      - "xpack.license.self_generated.type=basic"
      - "http.cors.enabled=true"
      - "http.cors.allow-origin=*"
      - "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"

  redis:
    image: redis:4.0.1-alpine
    restart: on-failure

  postgresql:
    image: postgres:12-alpine
    environment:
      - POSTGRES_USER=datashare
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=datashare
    volumes:
      - type: volume
        source: postgresql-data
        target: /var/lib/postgresql/data

volumes:
  datashare-models:
  elasticsearch-data:
  postgresql-data:

Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:

docker-compose up -d

The -d flag runs the containers in detached mode, allowing them to run in the background.

Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:

docker-compose logs -f datashare_web

Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.

To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:

docker-compose down

This will stop and remove the containers, freeing up system resources.

Add documents to Datashare

If you reach that point, Datashare is up and running but you will discover very quickly that no documents is available in the search results. Next step: Add documents from the CLI.

Extract named entities

Datashare has the ability to detect email addresses, name of people, organizations and locations. You must perform the named entities extraction in the same fashion than the previous commands. Final step: Add named entities from the CLI.

Authentication providers

Authentication with Datashare in server mode is the most impacting choice that has to be made. It can be one of the followings:

  • Basic authentication with credentials stored in database (PostgreSQL)

  • Basic authentication with credentials stored in Redis

  • OAuth2 with credentials provided by an identity provider (KeyCloak for example)

  • Dummy basic auth to accept any user (⚠️ if the service is exposed to internet, it will leak your documents)

Add more languages

This page explains how to install language packages to support Optical Character Recognition (OCR) on more languages.

To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses 'language packages' especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide explains you how to manually install it on your system.

Install packages on Linux

To add OCR languages on Linux, simply use the following command:

sudo apt install tesseract-ocr-[lang]

Where `[lang]` is can be :

  • all if you want to install all languages

  • a language code (ex: fra, for French), the list of languages is available here

Install packages on Mac

The Datashare Installer for Mac checks for the existence of either MacPorts or Homebrew, which package managers are used for the installation of Tesseract. If none of those two package managers is present, the Datashare Installer will install MacPorts by default.

With MacPorts (default)

First, you must check that MacPort is installed on your computer. Please run in a Terminal:

port version

You should see an output similar to this:

If you get a command not found: port, this either means you are using Homebrew (see next section) or you did not run the Datashare installer for Mac yet.

If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):

port install tesseract-deu

The full list of supported language packages can be found on MacPorts website.

Once the installation is done, close and restart Datashare to be able to use the newly installed packages.

With Homebrew

If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!

If you want to check if Homebrew is installed, run the following command in a Terminal:

brew -v

You should see an output similar to this:

If you get a command not found: brew error, this mean Homebrew is not installed on your system. You might either use MacPorts (see previous section) or run the Datashare installer for Mac on your computer.

Install languages on Windows

Languages packages are available on Tesseract Github repository. Trained data files have to be downloaded and added into tessdata folder in Tesseract's installation folder.

*Additional languages can be also added during Tesseract's installation.

Download and add French into tessdata

The list of installed languages can be checked with Windows command prompt or Powershell with the command tesseract --list-langs.

French is listed in installed languages

Datashare has to be restarted after the language installation. Check how for Mac, Windows and Linux.

Basic with Redis

Basic authentication with Redis

Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:

Authorization: Basic dXNlcjpwYXNzd29yZA==

It is secure as long as the communication to the server is encrypted (with SSL for example).

On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.

So you have to provision users. The passwords are sha256 hex encoded. For example using bash:

$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9  -

Then insert the user like this in Redis:

$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}'

If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For exammple if you use myindex:

$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex","local-datashare"]}}'

Then you should see this popup:

basic auth popup

Example

Here is an example of launching Datashare with Docker and the basic auth provider filter backed in Redis:

docker run -ti ICIJ/datashare --mode SERVER \
    --batchQueueType REDIS \
    --dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=<username>&password=<password>' \
    --sessionStoreType REDIS \
    --authFilter org.icij.datashare.session.BasicAuthAdaptorFilter \
    --authUsersProvider org.icij.datashare.session.UsersInRedis
Screenshot of a terminal window with the text: username % port version / version 2.8.0 / username %
Screenshot of a terminal window with a text mentioning homebrew
Screenshot of the Tessdata folder showing languages files
Screenshot of the command tesseract --list-langs. with the result: 'List of available languages (3): eng fra osd'

Install Neo4j plugin

Install the Neo4j plugin

Install the Neo4j plugin using the Datashare CLI so that users can access it from the frontend:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --pluginInstall datashare-plugin-neo4j-graph-widget 

Installing the plugin installs the datashare-plugin-neo4j-graph-widget plugin inside /home/datashare/plugings and will also install the datashare-extension-neo4j backend extension inside /home/datashare/extensions. These locations can be changed by updating the docker-compose.yml.

Configure the Neo4j extension

Update the docker-compose.yml to reflect your Neo4j docker service settings.

...
services:
    datashare_web:
      ...
      environment:
        - DS_DOCKER_NEO4J_HOST=neo4j
        - DS_DOCKER_NEO4J_PORT=7687
        - DS_DOCKER_NEO4J_SINGLE_PROJECT=secret-project  # This is for community edition only

If your choose a different Neo4j user or set a password for your Neo4j user make sure to also set DS_DOCKER_NEO4J_USER and DS_DOCKER_NEO4J_PASSWORD.

When running Neo4j Community Edition, set the DS_DOCKER_NEO4J_SINGLE_PROJECT value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the DS_DOCKER_NEO4J_SINGLE_PROJECT with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.

Restart Datasahre

After installing the plugin a restart might be needed for the plugin to display:

docker compose restart datashare_web

Next step

You can now create the graph.

Neo4j

This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your server

Prerequisites

Get Neo4j up and running

Follow the instructions of the dedicated FAQ page to get Neo4j up and running.

We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'All platforms and versions' button when downloading to access versions if necessary.

Add entities

If it's not done yet add entities to your project using the Datashare CLI.

If your project contains email documents, make sure to run the EMAIL pipeline together with regular NLP pipeline. To do so add set the follow nlpp flag to --nlpp CORENLP,EMAIL.

Next step

You can now run Datashare with the Neo4j plugin.

Add documents from the CLI

This document assumes that you have installed Datashare in server mode within Docker.

In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.

This is likely to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.

Here is a simple command to scan a directory and index its files:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

What's happening here:

  • Datashare starts in "CLI" mode

  • We ask to process both SCAN and INDEX stages at the same time

  • The SCAN stage feeds a queue in memory with file to add

  • The INDEX stage pulls files from the queue to add them to ElasticSearch

  • We tell Datashare to use the elasticsearch service

  • Files to add are located in /home/datashare/Datashare/ which is a directory mounted from the host machine

Alternatively, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the Redis:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Once the operation is done, we can easily check the content of queue created by Datashare in Redis. In this example we only display the 20 first files in the datashare:queue:

docker compose exec redis redis-cli lrange datashare:queue 0 20

The INDEX stage can now be executed in the same container:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage INDEX \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.

Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :

  • Scan the existing ElasticSearch index and gather document paths to store it inside a report queue

  • Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCANIDX \
  --queueType REDIS \
  --reportName "report:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/
docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX \
  --ocr true \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --reportName "report:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Basic with a database

Basic authentication with a database.

Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:

Authorization: Basic dXNlcjpwYXNzd29yZA==

It is secure as long as the communication to the server is encrypted (with SSL for example).

On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then Datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.

Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):

$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9  -

Then you can insert the user like this in your database:

$ psql datashare
datashare=> insert into user_inventory (id, email, name, provider, details) values ('fbar', 'foo@bar.com', 'Foo Bar', 'my_company', '{"password": "fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}');

If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For example if you use myindex:

$ psql datashare
datashare=> insert into user_inventory (id, email, name, provider, details) values ('fbar', 'foo@bar.com', 'Foo Bar', 'my_company', '{"password": "fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex", "local-datashare"]}}');

Or you can use PostgreSQL import CSV COPY statement if you want to create them all at once.

Then when accessing Datashare, you should see this popup:

basic auth popup

Example

Here is an example of launching Datashare with Docker and the basic auth provider filter backed in database:

docker run -ti ICIJ/datashare --mode SERVER \
    --batchQueueType REDIS \
    --dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=<username>&password=<password>' \
    --sessionStoreType REDIS \
    --authFilter org.icij.datashare.session.BasicAuthAdaptorFilter \
    --authUsersProvider org.icij.datashare.session.UsersInDb

Dummy

Dummy authentication provider to disable authentication

You can have a dummy authentication that always accepts basic auth. So you should see this popup:

basic auth popup

But then whatever user or password you type, it will enter Datashare.

Example

docker run -ti ICIJ/datashare -m SERVER \
  --dataDir /home/dev/data \
    --batchQueueType REDIS \
    --dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=dstest&password=test'\
    --sessionStoreType REDIS \
    --authFilter org.icij.datashare.session.YesBasicAuthFilter

Create and update Neo4j graph

This page describes how to create and maintain your Neo4j graph up to date with your server's Datashare projects

Run the Neo4j extension CLI

The Neo4j related features are added to the DatashareCLI through the extension mechanism. In order to run the extended CLI, the Java CLASSPATH must be extended with the path of the datashare-extension-neo4j jar. By default, this jar is located in /home/.local/share/datashare/extensions/*, so the CLI will be run as following:

docker compose exec \
  # if you are not using the default extensions directory  
  # you have to specify it extending the CLASSPATH variable ex:
  # -e CLASSPATH=/home/datashare/extensions/* \ 
  datashare_web /entrypoint.sh \
  --mode CLI \
  --ext neo4j \
  ... 

Create the graph

In order to create the graph, run the --fullImport command for your project:

docker compose exec \
  datashare_web /entrypoint.sh \
  --mode CLI \
  --ext neo4j \
  --full-import \
  --project secret-project

The CLI will display the import task progress and log import related information.

Update the graph

When new documents or entities are added or modified inside Datashare, you will need to update the Neo4j graph to reflect these changes.

To update the graph, you can just re-run the full export:

docker compose exec \
  datashare_web /entrypoint.sh \
  --mode CLI \
  --ext neo4j \
  --full-import \
  --project secret-project

The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.

To detect whether a graph update is needed, go to the 'Projects' page and open your project:

Open your project

Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...

Statistics of one project

...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:

The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.

You can now explore your graph using your favorite visualization tool.

Screenshot of Datashare's All projects page with the name of one project highlighted
Screenshot of Datashare's project page on the Insights tab with statistics highlighted
Screenshot of Datashare's project page on the Graph tab with statistics highlighted

Can I use Datashare with no internet connection?

You need an internet connection to install Datashare.

You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.

You don't need internet connection to:

  • Add documents to Datashare

  • Find named entities (except for the first time you use an NLP options - see above)

  • Search and explore documents

  • Download documents

This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.

FAQ

👷‍♀️ This page is currently being written by Datashare team.

General

👷‍♀️ This page is currently being written by Datashare team.

Performance considerations

Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with Apache Tika and Tesseract OCR, this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.

Separate Processing Stages

Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.

Examples:

datashare --mode CLI --stage SCAN --redisAddress redis://redis:6379 --busType REDIS
datashare --mode CLI --stage INDEX --redisAddress redis://redis:6379 --busType REDIS

Distribute the INDEX Stage

Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlarge instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.

For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.\

Leverage Parallelism

Datashare offers --parallelism and --parserParallelism options to enhance processing speed.

Example (for g4dn.8xlarge with 32 CPUs):

datashare --mode CLI --stage INDEX --parallelism 14 --parserParallelism 14
datashare --mode CLI --stage NLP --parallelism 14 --nlpParallelism 14

Optimize ElasticSearch

ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.

Adjust JAVA_OPTS

You can fine-tune the JAVA_OPTS environment variable based on your system's configuration to optimize Java Virtual Machine memory usage. &#xNAN;Example (for g4dn.8xlarge8with 120 GB Memory):

JAVA_OPTS="-Xms10g -Xmx50g" datashare --mode CLI --stage INDEX

Specify Document Language

If the document language is known, explicitly setting it can save processing time.

  • Use --language for general language setting (e.g., FRENCH, ENGLISH).

  • Use --ocrLanguage for OCR tasks to specify the Tesseract model (e.g., fra, eng).

Example:

datashare --mode CLI --stage INDEX --language FRENCH --ocrLanguage fra
datashare --mode CLI --stage INDEX --language CHINESE --ocrLanguage chi_sim
datashare --mode CLI --stage INDEX --language GREEK --ocrLanguage ell

Manage OCR Tasks Wisely

OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false.

Example:

datashare --mode CLI --stage INDEX --ocr false

Efficient Handling of Large Files

Large PST files or archives can hinder processing efficiency. We recommend extracting these files before processing with Datashare. If they are too many of them, keep in mind that Datashare will be able to extract them anyway.

Example of splitting Outlook PST files in multiple .eml files with readpst:

readpst -reD <Filename>.pst

Search projects

Projects are collections of documents. Datashare displays statistics about each projects.

Expand the menu to go to 'Projects' > 'All projects':

Search in projects' names using the search bar on the right:

Sort your projects by clicking the top right Settings icon:

In the Page settings, choose a sort by option, change the number of projects per page or the layout:

To explore a project, close the Settings and click on the name of the project:

You can now .

Filter documents

Filters are on the left of the search bar. You can contextualize, exclude and reset them. Active filters are displayed in the search breadcrumb.

Filters

Open 'Filters' on the left of the search bar:

'Indexing dates' arethe dates when the documents were added to Datashare.

'Extraction levels' regard embedded documents:

  • The 'file on disk' is level zero

  • If a document is attached to (or contained in) a file on disk, its extraction level is '1st'

  • If a document is attached to (or contained in) a document itself contained in a file on disk, its extraction level is '2nd'

  • And so on

Filter by entities

If you asked Datashare to 'Find entities' and the task was complete, you will see names of people, organizations, locations and e-mail adresses in the filters. These are the entities automatically detected by Datashare:

Exclude filters

Tick the 'Exclude' checkbox to select all items except those selected.

In the search breadcrumb, you see that the excluded filters are strikethrough:

Contextualize filters

In most filters, tick 'Contextualize' to update the number of documents indicated in the filters so they reflect the results.

The filter will only count what you selected, it will reflect the results of your current selection:

Clear all filters

To reset all filters at the same time, open the search breadcrumb:

Click 'Clear filters':

Search documents

Search with the main search bar and configure settings to display your search's results.

You must have added documents in Datashare before. Check how for , and .

Search bar

Expand the menu to go to 'Search' > 'Documents':

Make room by closing the menu:

Type terms in the search bar and press Enter:

Default operator is OR

If you type several terms separated by space, as the default operator is OR, Datashare will search for all documents containing at least one of the searched terms.

For instance, Datashare finds documents containing either 'ikea' or 'paris' or both terms here:

Linked entities

As you type a term, Datashare suggest linked entities - only if a task to find entities in this project was completed.

Press Esc on your keyboard to close the dropdown or click on one of the entities to replace your term in the search bar:

Search in a field

Search within a specific field only, by using the dropdown 'All fields':

Search breadcrumb

To see your queries in the search breadcrumb, click on the icon on the left of the search bar:

If you'd like to remove all searched terms from the search bar, click 'Clear query':

Results settings

To change the page settings, click the Settings icon on the top right:

You can change Sort by, Documents per page, Layout and also Properties:

Ticking these properties will change which document's metadata are displayed in the results, in the document cards, in all 3 layouts (List, Grid, Table):

You can now make your search more precise .

Search with operators or Regex

To make your searches more precise, use operators in the main search bar.

Double quotes for exact phrase

To have all documents mentioning an exact phrase, you can use double quotes. Use straight double quotes ("example"), not curly double quotes (“example”).

"Alicia Martinez’s bank account in Portugal"

OR (or space)

To have all documents mentioning at least one of the queried terms, you can use a simple space between your queries (as OR is the default operator in Datashare) or OR . You need to write OR with all letters uppercase.

Alicia Martinez

Alicia OR Martinez

AND (or +)

To have all documents mentioning all the queried terms, you can use AND between your queried words. You need to write AND with all letters uppercase.

Alicia AND Martinez

+Alicia +Martinez

NOT (or ! or -)

To have all documents NOT mentioning some queried terms, you can use NOT before each word you don't want. You need to write NOT with all letters uppercase.

NOT Martinez

!Martinez

-Martinez

Combine operators

Parentheses should be used whenever multiple operators are used together and you want to give priority to some.

((Alicia AND Martinez) OR (Delaware AND Pekin) OR Grey) AND NOT "parking lot"

You can also combine these with regular expressions (regex) between two slashes ().

Wildcards

If you search faithf?l, the search engine will look for all words with all possible single character between the second f and the l in this word. It also works with * to replace multiple characters.

Alicia Martin?z

Alicia Mar*z

Fuzziness

You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

If you search for similar terms (to catch typos for example), you can use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.

"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).

quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

Proximity searches

When you type an exact phrase (in double quotes) and use fuzziness, then the meaning of the fuzziness changes. Now, the fuzziness means the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

Examples:

"the cat is blue" -> "the small cat is blue" (1 insertion = fuzziness is 1)

"the cat is blue" -> "the small is cat blue" (1 insertion + 2 transpositions = fuzziness is 3)

"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: ).

"fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase quick fox would be considered more relevant than quick brown fox(source: ).

Boosting operators

Use the boost operator ^ to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes:

quick^2 fox

The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups:

"john smith"^2 (foo bar)^4

(source: )

Regular expressions (Regex)

‌"A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern." ().

1. You can use Regex in Datashare. Regular expressions (Regex) in Datashare need to be written between 2 slashes.

/.*..*@.*..*/

The example above will search for any expression which is structured like an email address with a dot between two expressions before the @ and a dot between two expressions after the @ like in 'first.lastname@email.com' for instance.

2. Regex can be combined with standard queries in Datashare :

("Ada Lovelace" OR "Ado Lavelace") AND paris AND /.*..*@.*..*/

3. You need to escape the following characters by typing a backslash just before them (without space):‌ # @ & < > ~

/.*..*@.*..*/ (the @ was escaped by a backslash \ just before it)

4. Important: Datashare relies on Elastic's Regex syntax as explained. Datashare uses . A consequence of this is that spaces cannot be searched as such in Regex.

We encourage you to use the AND operator to work around this limitation and make sure you can make your search.

If you're looking for French International Bank Account Number (IBAN) that can or cannot contain space and contain FR followed by numbers and/or letters (it could be FR7630001007941234567890185 ou FR76 3000 4000 0312 3456 7890 H43 for example), you can then search for:

/FR[0-9]{14}[0-9a-zA-Z]{11}/ OR (/FR[0-9]{2}.*/ AND /[0-9]{4}.*/ AND /[0-9a-zA-Z]{11}.*/)

Here are a few examples of useful Regex:

  • You can search for /Dimitr[iyu]/ instead of searching for Dimitri OR Dimitry OR Dimitru. It will find all the Dimitri, Dimitry or Dimitru.

  • You can search for /Dimitr[^yu]/ if you want to search all the words which begin with Dimitr and do not end with either y nor u.

  • You can search for /Dimitri<1-5>/ if you want to search Dimitri1, Dimitri2, Dimitri3, Dimitri4 or Dimitri5.

Other common Regex examples:

  • phone numbers with "-" and/or country code like +919367788755, 8989829304, +16308520397 or 786-307-3615 for instance: /[\+]?[(]?[0-9]{3}[)]?[-\s.]?[0-9]{3}[-\s.]?[0-9]{4,6}/

  • emails (): /[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+/

  • credit cards: /(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35[0-9]{3})[0-9]{11})/

You can find many other examples . More generally, if you use a regex found on internet, beware that the syntax is not necessarily compatible with elasticsearch's. For example \d, \S and the like .

Search with metadata fields

1

In 'Search' > 'Documents', open a document and go to the 'Metadata' tab:

2

Click a metadata's search icon to search documents with same properties:

3

See the query in the main search bar. It contains the field name, two dots and the searched value:

So for example, if you are looking for documents that:

  • Contains term1, term2 and term3

  • And were created after 2010

you can use the 'Date' filter or type in the search bar:

term1 AND term2 AND term3 AND metadata.tika_metadata_creation_date:>=2010-01-01

Explanations:

  • 'metadata.tika_metadata_creation_date:' means that we filter with creation date

  • '>="'means 'since January 1st included'

  • '2010-01-01' stands for January 2010 and the search will include January 2010

For other searches:

  • '<' will mean 'strictly before (with January 1st excluded)'

  • no character will mean 'at this exact date'

Ranges: You can also search for numbers in a range. Ranges can be specified for date, numeric or string fields among the ones you can find by clicking the magnifying glass when you hover the fields in a document's tab 'Metadata'. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}. For more details, please refer to .

OAuth2

OAuth2 authentication with a third-party id service

This is the default authentication mode: if not provided in CLI, it will be selected. With OAuth2 you will need a third-party authorization service. The diagram below describes the workflow:

Example

Integration with KeyCloak

We made a small demo to show how it could be setup.

Explore a document

Explore the document's data through different tabs.

See a document in full-screen view

Open a document in 'Search' > 'Documents' > one document and click the icon with in and out arrows (this applies to the List layout while in Grid and Table layout, the documents always open in full-screen view):

You now see the document in full screen view and can go to the next document in your results by using the pagination carousel on the top of the screen:

Search in a document

  • Open a document in 'Search' > 'Documents' > one document

  • Stay on the first tab called 'Text'. This tab shows the text as extracted from your document by Datashare.

  • Click on the search bar or press Command (⌘) / Control + F

  • Type the terms you're searching for

  • Press ENTER to go from one occurrence to the next one

  • Presse SHIFT + ENTER to go from one occurrence to the previous one

To see all the keyboard shortcuts in Datashare, please read ''.

See original document

Go to the 'View' tab to see the original document.

Note: this visualization of the document is available only for some file types: images, PDF, CSV, xlsx and tiff but not other file types like Word documents or e-mails for instance.

Search for attachments and documents in the same folder

Attachments are called 'children documents' in Datashare.

Go to the 'Metadata' tab and click on 'X documents in the same folder' or 'Y children documents':

You see the list of documents. To open all the documents in the same folder or all the children documents, click 'Search all' below. There is no 'Search all' button if there is no documents, as for the children documents below:

Explore metadata

Go the 'Metadata' tab to explore all the properties of the document:

If a metadata is interesting to you and you'd like to know if other documents in your project share the same metadata, click the search icon:

You can also copy or pin a metadata.

Entities

In the 'Entities' tab, only if you previously run tasks to in Datashare, you read the name of people, organizations, locations and e-mail adresses, along with the number of their occurrences in the document:

Hover one entity to see a popover with all their mentions in context in the document by clicking on the arrows:

Go to the 'Info' tab to check how the entity was extracted:

Create a Neo4j graph and explore it

This page explains how to leverage Neo4j to explore your Datashare projects.

Prerequisites

We recommend using a recent release of Datashare (>= 14.0.0) to use this feature. To download a specific version, click on 'All platforms and versions' .

If you are not familiar with graph and Neo4j, take a look at the following resources:

  • Find out

  • Learn

  • Check out

The documents and entities graph

is a graph database technology which lets you represent your data as a graph.

Inside Datashare, Neo4j lets you connect entities between them through documents in which they appear.

After creating a graph from your Datashare project, you will be able to explore this graph and visualize these kinds of relationships between you project entities:

In the above graph, we can see 3 e-mail document nodes in orange, 3 e-mail address nodes in red, 1 person node in green and 1 location node in yellow. Reading the relationship types on the arrows, we can deduce the following information from the graph:

  • shapp@caiso.com emailed 20participants@caiso.com, the sent email has an ID starting with f4db344...

  • One person named vincent is mentioned inside this email, as well as the california location

  • Finally, the e-mail also mentions the dle@caiso.com e-mail address which is also mentioned in 2 other e-mail documents (with ID starting with 11df197... and 033b4a2...)

Graph nodes

The Neo4j graph is composed of :Document nodes representing Datashare documents and :NamedEntity nodes representing entities mentioned in these documents.

The :NamedEntity nodes are additionally annotated with their entity types: :NamedEntity:PERSON, :NamedEntity:ORGANIZATION, :NamedEntity:LOCATION, :NamedEntity:EMAIL...

Graph relationships

In most cases, an entity :APPEARS_IN a document, which means that it was detected in the document content. In the particular case of e-mail documents and EMAIL addresses, it is most of the time possible to identify richer relationships from the e-mail metadata, such as who sent (:SENT relationship) and who received (:RECEIVED relationship) the e-mail.

When an :EMAIL address entity is neither :SENT or :RECEIVED, like it is the case in the above graph for dle@caiso.com, it means that the address was mentioned in the e-mail document body.

When a document is embedded inside another document (as an e-mail attachment for instance), the child document is connected to its parent through the :HAS_PARENT relationship.

Create your Datashare project's graph

The creation of a Neo4j graph inside Datashare is supported through a plugin. To use the plugin to create a graph, follow these instructions:

  • When using Datashare

  • When Datashare is running

After the graph is created, open the menu, go to the 'Projects' page, select your project and go to the Graph tab.

You should be able to visualize a new Neo4j widget displaying the number of documents and entities found inside the graph:

Access your project's graph

Depending on your access to the Neo4j database behind Datashare, you might need to export the Neo4j graph and import it locally to access it from .

Exporting and importing the graph into your own database is also useful when you want to perform write operations on your graph without any consequences on Datashare.

With read access to Datashare's Neo4j database

If you have read access to the Neo4j database (it should be the case if you are running Datashare on your computer), you will be able to plug to it and start exploring.

Without read access to Datashare's Neo4j database

If you can't have read access to the database, you will need to export it and import it into your own Neo4j instance (running on your laptop for instance).

Ask for a DB dump

If it's possible, ask you system administrator for a DB dump obtained using the .

Export your graph from Datashare

In case you don't have access to the DB and can't be provided with a dump, you can export the graph from inside. Be aware that limits might be applied on the size of the exported graph.

To export the graph, open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Cypher shell' export format and at the end of the form, click the 'Export' button:

In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.

DB import

Depending on , use one of the following ways to import your graph into your DB:

Docker

  • Identify your Neo4j instance container ID:

  • Copy your the graph dump inside your Neo4j container import directory:

  • Import the dumped file using the command:

Neo4j Desktop import

  • Open 'Cypher shell':

  • Copy your the graph dump inside your neo4j instance import directory:

  • Import the dumped file using the command:

You will now be able to explore the graph imported in your own Neo4j instance.

Explore and visualize entity links

Once your graph is created and you can access it (see if you can't access the Datashare's Neo4j instance), you will be able to use your favorite tool to extract meaningful information from it.

Connect to your database

Once you , you can use different tools to visualize and explore it. You can start by connecting the to your DB.

Visualize and explore with Neo4j Bloom

is a simple and powerful tool developed by Neo4j to quickly visualize and query graphs, if you run Neo4j Enterprise Edition. Bloom lets you navigate and explore the graph through a user interface similar to the one below:

Neo4j Bloom is accessible from inside Neo4j Desktop app.

Find out more information about how to use Neo4j Bloom to explore your graph with:

  • Bloom's

  • Bloom's

  • about graph exploration with Bloom

Query the graph with Neo4j Browser

The lets you run queries on your graph to explore it and retrieve information from it. Cypher is like SQL for graphs, running Cypher queries inside the Neo4j browser lets you explore the results as shown below:

The Neo4j Browser is available for both Enterprise and Community distributions. You can access it:

  • Inside the Neo4j Desktop app when running Neo4j from the

  • At when running Neo4j

Visualize and explore with Linkurious Enterprise Explorer

is a proprietary software which, similarly to Neo4j Bloom, lets you visualize and query your graph through a powerful UI.

Find out more information about Linkurious:

Visualize with Gephi

is a simple open-source visualization software. It is possible to export graphs from Datashare into the and import them into Gephi.

Find out more information about:

  • How to

  • Gephi

  • How to with Gephi

Export your graph in the GraphML format

To export the graph in the , open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Graph ML' export format and at the end of the form, click the 'Export' button:

In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.

You will now be able to by opening the exported GraphML file in it.

Explore a project

A project is a collection of documents. Datashare displays statistics about each projects.

Expand the menu, open 'All projects' and click on the name of the project that you want to explore:

If you'd like to pin this project in the menu for an easy access, click 'Pin to menu':

Your project is now pinned in the menu:

In a project page, in the first tab called 'Insights', you find statistics and a bar chart displaying the number of documents by creation date.

Filter this chart by path by clicking 'Select path':

Click on one bar for a year or month to see all the corresponding documents:

On the 'Languages', 'File Types' and 'Authors' widgets, you see stats:

Search all documents with a specific criteria, for instance here with the French language:

Finally, in the server collaborative mode, you see the Latest recommended documents, that is to say the documents marked as recommended by other members of the project:

You can now .

Batch search documents

Batch searches allow to get the results of each query of a list all at once: instead of searching each query one by one, upload a list, set options/filters and see the matching documents.

1

Prepare a CSV list

Open a spreadsheet (LibreOffice, Framacalc, Excel, Google Sheets, Numbers, ...)

Write your queries in the first column of the spreadsheet, typing one query per line:

  • Do not put line break(s) in any of your cells.

To delete all line breaks in your spreadsheet, use 'Find and replace all': find all '\n' and replace them by nothing or a space.

  • Write 2 characters minimum in each query. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to a 'failure'.

  • If you have blank cells in your spreadsheet...

...the CSV, which stand for 'Comma-separated values', will translate these blank cells into semicolons (the 'commas'). You will thus see semicolons in your batch search results:

To avoid that, remove blank cells in your spreadsheet before exporting it as a CSV.

  • If there is a comma in one of your cells (like in 'Jane, Austen' below), the CSV will put the content of the cell in double quotes so it will search for the exact phrase in the documents:

Remove all commas in your spreadsheet if you want to avoid exact phrase search.

  • Want to search only in some documents? Use the 'Filters' step in the batch search's form (see below). Or describe fields directly in your queries in the CSV. For instance, if you want to search only in some documents with certain tags, write your queries like this:

    Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)

  • Use operators in your CSV: AND NOT * ? ! + - and other operators do work in batch searches as they do in the regular search bar but only if "Do phrase match" at step 3 is turned off. You can thus turn it off and write your queries like this for instance:

    Paris NOT Barcelona AND Taipei

  • Reserved characters (^ " ? ( [ *), when misused, can lead to failures because of syntax errors.

  • Searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.

2

Export the list as a CSV

Export your spreadsheet of queries in a CSV format:

Important: Use the in your spreadsheet software's settings.

  • LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':

  • Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, .

  • Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".

3

Create the batch search

Open the menu, go to 'Tasks', open 'Batch searches' and click the 'Plus' button at the top right:

Alternatively, in the menu next to 'Batch searches', click the 'Plus' button :

The form to create a batch search opens:

  • Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query. In that case, it won't apply any operators (AND OR, etc) that would be in the queries. If 'Do phrase match' is off, queries are searched without double quotes and with potential operators.

  • What is fuzziness? When you run a , you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

If you search for similar terms (to catch typos for example), use fuzziness.

"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).

Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

  • What are proximity searches? When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)

“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)

Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

Once you filled all steps, click 'Create' and wait for the batch search to complete.

4

Explore your results

In the menu, click 'Batch searches' and click the name of the batch search to open it:

See the number of matching documents per query:

Sort the queries by number of matching documents or by query position using the page settings (icon at the top right of the screen). The query position will put the query in their original order as you put them in the CSV.

To explore a query's matching documents, click its name and see the list of matching documents:

Click a document's name to open it. Use the page settings or the column's names to sort documents.

5

Relaunch a batch search (optional)

If you've added new files in Datashare after you launched a batch search, you might want to relaunch the batch search to search in the new documents too.

The relaunched batch search will apply to newly indexed documents and previously indexed documents (not only the newly indexed ones).

In 'Batch searches', go at the end of the table and click the 'Relaunch' icon:

Or click 'Relaunch' in the batch search page below its name on the right panel:

Change its name, description and decide to delete current batch search after relaunch or not:

See your relaunched batch search in the list of batch searches:

6

Failures

Failures in batch searches can be due to several causes.

The first query containing an error makes the batch search fail and stop.

Go to 'Tasks' > 'Batch searches' > open the batch search with a failure status and click the 'Red cross icon' button on the right panel:

Check the first failure-generating query in the error window:

Here it says:

The first line contained a comma while it shouldn't. Datashare interpreted this query as a syntax error, it thus failed so the batch search stopped.

Check .

We recommend to remove the commas, as well as any reserved characters, in your CSV using 'Find and replace all' features in your spreadsheet software and re-create the batch search.

'elasticsearch: Name does not resolve'

If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.

In that case, you need to re-open Datashare: check how for , or .

Example of a message regarding a problem with ElasticSearch:

SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'

'Data too large'

One of your queries can lead to a 'Data too large' error.

It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.

We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.

Star, tag and recommend

Turn the documents into starred, tag them or, in server mode, recommend them to project's other members.

Star documents

In server collaborative mode, starring documents is private. Other members of your projects can't see your starred documents.

Star a single document

Click the star icon either at the right of the document's card or at the top right of the document:

Click on the same icons to unstar.

Star multiple documents

Open the selection mode by clicking the multiple cards icon on the left of the pagination:

Select the documents you want to star:

Click the star filled icon:

To unstar documents, click the three-dot icon if necessary and click Unstar:

Filter starred documents

Open the filters by clicking the 'Filters' button on the left of the search bar:

In the 'User data' category, open 'Starred' and tick the 'Starred' checkbox:

Tag documents

Tags are always in lower case letters. They can contain numbers, hyphens and special characters but not commas nor semicolons (which are the keyboard shortcuts to add the tags).

In server collaborative mode, tags are public to the project's other members. You can see their tag and they can see yours.

Tag a single document

Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the hashtag icon:

It opens the Tags panel on the left:

Type your tag and press Enter or click 'Add':

Your tag is now displayed in the 'Added by you' category:

Remove your tag, or others' tags, by clicking their cross icon:

Tag multiple documents

Open the selection mode by clicking the multiple cards icon on the left of the pagination:

Select the documents you want to tag:

Click the three-dot icon if necessary and click 'Tag':

Type your tag or type multiple tags by separating them with comma and click 'Add':

Remove your tag, or others' tags, by clicking their cross icon on each single document (you cannot untag multiple documents):

Filter tagged documents

Open the filters by clicking the 'Filters' button on the left of the search bar:

In the 'User data' category, open 'Tags' and tick the 'Tag' checkboxes for tagged documents you want to filter:

Recommend a document

In server collaborative mode, recommending documents is public to the project's other members. All members can see who recommended some documents.

Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the eyes icon:

It opens the Recommendations panel on the left:

Click on the 'Mark as recommended' button:

The document is now marked as recommended by you:

Click 'Unmark as recommended' to unmarked it as recommended.

Filter recommended documents

Open the filters by clicking the 'Filters' button on the left of the search bar:

In the 'User data' category, open 'Recommended by' and tick the 'Username' checkboxes for documents recommended by the users you want to filter:

Can I remove document(s) from Datashare?

In local mode, you cannot remove a single document or a selection of documents from Datashare. But you can remove all your projects and documents from Datashare.

Open the menu and on the bottom of the menu, click the trash icon:

A confirmation window opens. The action cannot be undone. It removes all the projects and their documents from Datashare. Click 'Yes' if you are sure:

For advanced users - if you'd like to do it with the Terminal, here are the instructions:

  • If you're using Mac: rm -Rf ~/Library/Datashare/index

  • If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index

  • If you're using Linux: rm -Rf ~/.local/share/datashare/index

Do you recommend OS or machines for large corpuses?

Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.

To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).

The most complex operation is OCR (we use ) so if your documents don't contain many images, it might be worth deactivating it (--ocr false).

Apache Tesseract
explore a project
Screenshot of Datashare's homepage with the menu 'All projects' entry highlighted
Screenshot of Datashare's 'All projects' page with the search bar 'Search projects' highlighted
Screenshot of Datashare's 'All projects' page with the top right Settings icon highlighted
Screenshot of Datashare's 'All projects' page with the right panel 'Page settings' open and highlighted
Screenshot of Datashare's 'All projects' page with LuxLeaks' project's name highlighted
Screenshot of Datashare's page to search documents with the 'Filters' button at the left of the search bar highlighted
Screenshot of Datashare's page to search documents with the 'Filters' panel open and highlighted on the left of the page and on the right of the menud
Screenshot of the Filters' entities
Screenshot of Datashare's page to search documents with the 'People' filter open with 2 names ticked and the Exclude button ticked and highlighted as well as the two names in the search breadcrumb that are also strikethrough
Screenshot of Datashare's page to search documents with a filter open and the Contextualize button at the bottom of this filter highlighted
Screenshot of Datashare's page to search documents with the 'Language' filter open, the 'Contextualize' checkbox ticked and the whole filter highlighted
Screenshot of Datashare's page to search documents with the 'Your search' button on the left of the search bar highlighted
Screenshot of Datashare's page to search documents with search breadcrumb open and the 'Clear filter' button highlighted
Mac
Windows
Linux
with operators or Regex (Regular Expressions)
Screenshot of a Datashare's search documents page with the 'Documents' entry in the 'Search' category in the menu highlighted
Screenshot of a Datashare's search documents page with the menu open and its top right X icon highlighted
Screenshot of a Datashare's search documents page with highlighted 'Ikea' typed in the search bar
Screenshot of a Datashare's search documents page with highlighted 'Ikea paris' typed in the search bar
Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and a dropdown with linked entities below highligted
Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'All fields' dropdown button highlighted at the right of the search bar
Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'All fields' dropdown highlighted at the right of the search bar
Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'Your search' icon button highlighted at the left of the search bar
Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'Your search' breadcrumb open and highlighted below the search bar
Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'Your search' breadcrumb open and its 'Clear query' button highlighted
Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the top right Settings icon button highlighted
Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and and the 'Results settings' panel open and highlighted at the right of the page
Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and and the 'Results settings' panel open and its 'Properties' category highlighted at the right of the page
Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the 'Results' column open and highlighted at the left of the page in a List layout
Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the first document card highlighted at the left of the page in a Grid layout
Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the first document card highlighted at the top of the results in a Table layout
see below
tilde symbol
Elastic
Elastic
Elastic
Elastic
Wikipedia
here
the Standard tokenizer
simplified
on this site
are not understood
Elastic's page on ranges
Screenshot of Datashare's document page with the 'Metadata' tab highlighted
Screenshot of Datashare's document page in the 'Metadata' tab at scroll level of 'Content text length' with the magnifying glass icon hovered with the tooltip 'Search this metadata value' highlighted
Screenshot of Datashare's document search page with the search bar displaying 'contentTextLength:"26952"' highlighted
Use keyboard shortcuts
Find entities
Screenshot of Datashare's search in document page in List view with a document opened and its in out icon on the top right highlighted
Screenshot of Datashare's document full screen view with the pagination carousel on the top highlighted
Screenshot of Datashare's page to search documents in List view with a document open on the 'Text' tab and the search bar to search within the document highlighted
Screenshot of Datashare's page to search documents in List view with a document open on the 'View' tab which is highlighted
Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and the dropdowns 'X documents in the same folder' and 'Y children documents' highlighted
Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and the dropdowns 'X documents in the same folder' and 'Y children documents' and the 'Search all' button highlighted
Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and list of metadata highlighted
Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and the search buttons for one metadata with its tooltip 'Search this metadata value'highlighted
Screenshot of Datashare's full-screen document view on the 'Entities' tab
Screenshot of Datashare's full-screen document view on the 'Entities' tab with one entity and its popover on the 'Mentions' tab highlighted
Screenshot of Datashare's full-screen document view on the 'Entities' tab with one entity and its popover on the 'Info' tab highlighted
docker ps | grep neo4j # Should display your running neo4j container ID
docker cp \
    <export-path> \
    <neo4j-container-id>:/var/lib/neo4j/imports/datashare-graph.dump
docker exec -it <neo4j-container-id> /bin/bash
./bin/cypher-shell -f imports/datashare-graph.dump 
cp <export-path> imports
./bin/cypher-shell -f imports/datashare-graph.dump 
here
Get started with Neo4j
what is a graph database?
Neo4j fundamentals
how to use Neo4j for investigative journalism
Neo4j
on your computer
on your server
visualization tools
visualization tools
neo4j-admin database dump command
how you run Neo4j on your laptop
cypher-shell
cypher-shell
this section
access your Neo4j database
Neo4j Desktop
Neo4j Bloom
User Guide
Quick Start
This series of videos
Neo4j Browser
Cypher
Desktop app
http://localhost:7474/browser/
inside Docker
Linkurious
Linkurious User Manual
configure Linkurious with neo4j
run Linkurious inside Docker
Gephi
GraphML File Format
export your graph in the GraphML format
features
get started
GraphML file format
visualize the graph using Gephi
Screenshot of a graph showing circles in different colors with arrows between them
Screenshot of Datashare's project page on the 'Graph' tab with the 'Graph statistics' highlighted
Screenshot of Datashare's project page on the 'Graph' tab with the form to export a graph open and its second step called 'Format' highlighted
Screenshot of a window with the title 'Graph DBMS' with the three dot dropdown open and the entry 'Terminal' highlighted
desktop-shell
Screenshot of a window showing a graph with many points grouped in 1 big and 1 small circles
bloom-viz
Screeenshot of a Neo4j Browser with blue and orange circle with arrows between some of them
browser-viz
Screenshot of Datashare's project page on the 'Graph' tab with the form to export a graph open at its second step called 'Format' and the 'GraphML' radiobutton selected and highlighted
search documents
Screenshot of Datashare's 'All projects' page with LuxLeaks' project's name highlighted
Screenshot of Datashare's 'All projects' page with LuxLeaks' project's 'Pin to menu' top right button highlighted
Screenshot of Datashare's 'All projects' page with LuxLeaks' project's name pinned in the left menu and highlighted
Screenshot of a Datashare's project page on the Insights tab at the level of the 'Documents per creation date' bar chart with 'Select path' button highlighted
Screenshot of a Datashare's project page on the Insights tab at the level of the 'Documents per creation date' bar chart with one-year bar highlighted
Screenshot of a Datashare's project page on the Insights tab with the Languages, File types and Authors' widgets highlighted
Screenshot of a Datashare's project page on the Insights tab at the level of the Languages, File types and Authors' widgets with the French documents' number '11" highlighted
Screenshot of a Datashare's project page on the Insights tab with the 'Latest recommended documents' highlighted
Unexpected char 106 at (line no=1, column no=81, offset=80)
UTF-8 encoding
as explained here
batch search
Elastic
the most common syntax errors
Mac
Windows
Linux
Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell
One query per line in a spreadsheet
Screenshot of a spreadsheet cell filled with a text containing a line break and a red cross indicates it is wrong
This will lead to a "failure"
Screenshot of a spreadsheet cell filled with a text not containing a line break and a green check indicates it is right
This will lead to a "success"
Screenshot of a spreadsheet software's 'Find and replace' window with the 'Replace all' button highlighted
Use this functionality to delete all line break(s)
Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell and other columns from B to H empty and highlighted
Blank columns in a spreadsheet
Screenshot of Datashare's batch search page where each query with the female personality's surname is followed by several semicolons which are highlighted
Remove blank cells in your spreadsheet in order to avoid this.
Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell and the second cell contains 'Jane, Austen' and is highlighted
Screenshot of Datashare's batch search page where two queries are highlighted: one is 'Jane, Austen' and has 0 documents as results and the second one is 'Jane Austen' and has 2 documents as results is 'Jane Austen'
Screenshot of a window of 'Numbers' software where the menu's path File > Export to > CSV is selected
Screenshot of a window of LibreOffice software where the Export options with 'Character set: Unicode (UTF-8)' is highlighted
Screenshot of Datashare's batch searches page where the 'Plus' button on the top right is highlighted
Screenshot of Datashare's batch searches page where the 'Plus' button in the menu next to the entry 'Tasks > Batch searches' is highlighted
Screenshot of Datashare's page with a form to create a new batch search
Screenshot of Datashare's batch searches page where the first batch search's name is highlighted
Screenshot of Datashare's page for one batch search where the list of queries and their matching documents are highlighted
Screenshot of Datashare's page for one batch search's matching documents
Screenshot of Datashare's batch searches page where the last button with the 'Relaunch' icon is highlighted
Screenshot of Datashare's page for one batch search where the 'Relaunch' button in the right panel describing the batch search is highlighted
Screenshot of Datashare's page for one batch search where the 'Relaunch batch search' pop-in window is open
Screenshot of Datashare's batch searches page where the two first batch searches (one normal, one relaunched) are highlighted
Screenshot of Datashare's batch search page where the 'Failure' button in the right panel describing the batch search is highlighted
Screenshot of Datashare's batch search page where a modal window shows 'The error is' with a description of the error 'Unexpected char 106 at (line no=1, column no=81, offset=80)'
Screenshot of Datashare's search documents page in List layout with a document open and its star icon on the top right is highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode button on the left of the results' paginationis highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked and their checkboxes are highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked the star filled icon is highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode is open and 2 documents are ticked, the 'Unstar' entry in the three-dot dropdown is highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked, the 'Unstar' button is highlighted
Screenshot of Datashare's search documents page in List layout where the 'Filter' button on the left of the search bar is highlighted
Screenshot of Datashare's search documents page in List layout where the 'Filters' are open on the left and the 'Starred' filter is open and highlighted
Screenshot of Datashare's search documents page in List layout where a document is open and its 'Hashtag' (tag) button above its title is highlighted
Screenshot of Datashare's search documents page in List layout where a document is open and the 'Tags' floating panel on the left of the document is highlighted
Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the field to add filters is highlighted
Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the category 'Added by you' is highlighted
Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the Cross icons in 2 tags label are highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode button on the left of the results' paginationis highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked and their checkboxes are highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode is open and 2 documents are ticked, the 'Tag' entry in the three-dot dropdown is highlighted
Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked, the 'Tag' button is highlighted
Screenshot of Datashare's search documents page in List layout where a modal to add Tags is open
Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the Cross icons in 2 tags label are highlighted
Screenshot of Datashare's search documents page in List layout where the 'Filter' button on the left of the search bar is highlighted
Screenshot of Datashare's search documents page in List layout where the 'Filters' are open on the left and the 'Tags' filter is open and highlighted
Screenshot of Datashare's search documents page in List layout where a document is open and its 'Eyes' (recommend) button above its title is highlighted
Screenshot of Datashare's search documents page in List layout where a document is open and the 'Recommendations' floating panel on the left of the document is highlighted
Screenshot of Datashare's search documents page in List layout where a document is open and the 'Recommendations' floating panel on the left of the document is open and the button 'Mark as recommended' is highlighted
Screenshot of Datashare's search documents page in List layout where a document is open and the 'Recommendations' floating panel on the left of the document is open and where the username (you) is highlighted
Screenshot of Datashare's search documents page in List layout where the 'Filter' button on the left of the search bar is highlighted
Screenshot of Datashare's search documents page in List layout where the 'Filters' are open on the left and the 'Recommended by' filter is open and highlighted
Screenshot of Datashare's homepage with the menu and the trash icon at the bottom right of the menu highlighted
Screenshot of Datashare's homepage with a confirmation modal to delete all projects and documents highlighted
Screenshot of a bar chart showing the size in terabytes of the Panama Papers (2016) (2.6TB), the Paradise Papers (2017) (1.4TB) and the Pandora Papers (2021) (2.94TB)

What should I do if I get more than 10,000 results?

In Datashare, for technical reasons, it is not possible to open the 10,000th document.

Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.

As it is not possible to fix this, here are some tips:

  • Refine your search: use filters to narrow down your results and ensure you have less than 10,000 matching documents

  • Change the sorting of your results: use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring

  • Search your query in a batch search: you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents

How can I contact ICIJ for help, bug reporting or suggestions?

You can send an email to datashare@icij.org.

When reporting a bug, please share:

  • Your OS (Mac, Windows or Linux) and version

  • The problem, with screenshots

  • The actions that led to the problem

Or you can post an issue with your logs on Datashare's GitHub: https://github.com/ICIJ/datashare/issues

docker run -ti ICIJ/datashare:version --mode SERVER \
    --oauthClientId 30045255030c6740ce4c95c \
    --oauthClientSecret 10af3d46399a8143179271e6b726aaf63f20604092106 \
    --oauthAuthorizeUrl https://my.oauth-server.org/oauth/authorize \
    --oauthTokenUrl https://my.oauth-server.org/oauth/token \
    --oauthApiUrl https://my.oauth-server.org/api/v1/me.json \
    --oauthCallbackPath /auth/callback
repository
oauth

Can I download a document from Datashare?

Yes, you can download a document from Datashare.

Download a document

Open the menu > 'Search' > 'Documents' and click on the download icon on the right of documents' cards:

...or on the top right of an opened document:

Batch download documents

You can also batch download all the documents that match a search. It is limited to 100.00MB.

Open the menu > 'Search' > 'Documents', make queries and apply filter. Once all the results of a specific search are relevant to you, click on the download icon on the right of results:

Find your batch downloads as zip files in the menu > 'Tasks' > 'Batch downloads':

Click on a batch download's name to download it:

Can't download?

If you can't download a document, it means that:

  • either Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on .

  • or you are using the server collaborative mode and the admins prevented users from downloading documents

Datashare's Github
Screenshot of Datashare's search page in List view with the download icons in 3 document cards highlighted
Screenshot of Datashare's search page in List view with a document open and the download icons on the top right of the document highlighted
Screenshot of Datashare's search page in List view with the download icon on the top right of the result column highlighted
Screenshot of Datashare's batch downloads page with the menu open and the Tasks' entry 'Batch downloads' highlighted
Screenshot of Datashare's batch downloads page with the name of one batch download highlighted

How can we use Datashare on a collaborative mode on a server?

You can use Datashare with multiple users accessing a centralized database on a server.

Warning: to put the server mode in place and to maintain it requires some technical knowledge.

Please find the documentation here.

Can I use an external drive as data source?

Warning: this requires some technological knowledge.

You can make Datashare follow soft links : add --followSymlinks when Datashare is launched.

If you're on Mac or Windows, you must change the launch script.

If you're on Linux, you can add the option after the Datashare command.

Advanced: how can I do bulk actions with Tarentula?

Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:

  • Clean Tags by Query

  • Download

  • Export by Query

  • Tagging

  • CSV formats

  • Tagging by Query

Please find all the use cases in Datashare Tarentula's GitHub documentation.

What is fuzziness?

As a search operator

In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.

"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).

Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

In batch searches

When you run a batch search, you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.

"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).

Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

Common errors

👷‍♀️ This page is currently being written by Datashare team.

What are NLP pipelines?

Pipelines of Natural Language Processing are tools that automatically identify entities in your documents. You can only choose one model at a time for one entity detection task.

Open the menu > 'Tasks' > 'Entities' and follow these instructions. Select 'CoreNLP' if you want to use the model with the highest probability of working in most of documents.

What are proximity searches?

As a search operator

In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

Examples:

the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)

the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)

"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elastic).

Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than quick brown fox(source: Elastic).

In batch searches

When you run a batch search, if you turn 'Do phrase matches' on, you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)

the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)

Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

List of common errors leading to "failure" in Batch Searches

SearchException: query='AND ada'

One or several of your queries contains syntax errors.

It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators: read the list of syntax errors by clicking here.

You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. Check how to create a batch search.

Datashare stops at the first syntax error. It reports only the first ​error. You might need to check all your quferies as some errors can remain after correcting the first one.

Example of a syntax error message:

SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'

elasticsearch: Name does not resolve

If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.

In that case, you need to re-start Datashare: check how for Mac, Windows or Linux.

Example of a message regarding a problem with ElasticSearch:

SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'

What if the 'View' of my documents is 'not available'?

Datashare can display 'View' for some file types only: images, PDF, CSV, xlsx and tiff. Other document types are not supported yet.

Keyboard shortcuts

Shortcuts help do some actions faster.

Open the menu > 'Search' > 'Documents' and click the keyboard icon at the bottom of the menu:

It opens a window with the shortcuts for your OS (Mac, Windows, Linux):

Click on 'See all shortcuts' to reach the full page view:

How can I uninstall Datashare?

Mac

1. Go to Applications

2. Click right on 'Datashare' and click 'Move to Bin'

Windows

Follow the steps here: https://support.microsoft.com/en-us/windows/uninstall-or-remove-apps-and-programs-in-windows-10-4b55f974-2cc6-2d2b-d092-5905080eaf98

Linux

Use the following command:

sudo apt remove datashare-dist

'We were unable to perform your search.' What should I do?

This can be due to some syntax errors in the way you wrote your query.‌

Here are the most common errors that you should correct: ‌

The query starts with AND

You cannot start a query with AND all uppercase. AND is reserved as a search operator.

The query starts with OR

You cannot start a query with OR all uppercase. OR is reserved as a search operator.

The query contains only one double-quote: "

‌You cannot start or type a query with only one double quote. Double quotes are reserved as a search operator for exact phrase.

The query contains only one parenthesis: ( or )

‌You cannot start or type a query with only one parenthesis. Parenthesis are reserved for combining operators.

The query contains only one forward slash: /

‌You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).

The query starts with or contains tilde: ~

‌You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for fuzziness or proximity searches.

The query ends with question mark: !

You cannot end a query with question mark (!). Question mark is reserved as a search operator for excluding a term.

The query starts with or contains caret: ^

‌You cannot start a query with caret (^) or write one which contains caret. Caret is reserved as a boosting operator.

The query contains square brackets: [ or ]

You cannot use square brackets except for searching for ranges.

What if Datashare says 'No documents found'?

  • If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of losing your previously selected filters, open the menu > 'Search' > 'Documents', open the search breadcrumb on the left of the search bar, click 'Clear filters'.

  • You may not have added documents to Datashare yet. Check how to add documents for Mac, Windows or Linux.

  • In 'Tasks' > 'Documents', in the Progress column, if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you added, it can take multiple hours.

Screenshot of Datashare's search documents page in List layout where the 'Keyboard' icon at the bottom of the left menu is highlighted
Screenshot of Datashare's search documents page in List layout where the 'Keyboard' icon at the bottom of the left menu is hovered and the 'Keyboard shortcuts' popover is highlighted
Screenshot of Datashare's keyboard shortcuts page
Screenshot of a Mac's 'Applications' window with the Datashare's logo highlighted
Screenshot of a Mac's Applications window with the Datashare's logo selected and a dropdown menu with the entry 'Move to Bin' highlighted
Screenshot of Datashare's search page with 'AND ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'OR ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '"ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ikea"' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik"ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '(ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ikea)' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik(ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '/ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '~ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik~ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ikea!' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '^ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik^ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '[ikea]' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's document search page where a text says 'No documents matched your search. Try using different filters.' and the Search breadcrumb open and the 'Clear filters' button in it highlighted
Screenshot of Datashare's task page to add document where the header of the Progress column is highlighted
Screenshot of an 'authentication required' window with username and password fields and 'Cancel' and 'OK' buttons
Screenshot of an 'authentication required' window with username and password fields and 'Cancel' and 'OK' buttons
Screenshot of an 'authentication required' window with username and password fields and 'Cancel' and 'OK' buttons
A diagram of a workflow

Frontend

API

The Datashare API is fully defined using the OpenAPI 3.0 specification and automatically generated after every Datashare release.

The OpenAPI spec is a language-agnostic, machine-readable document that describes all of the API’s endpoints, parameter and response schemas, security schemes, and metadata. It empowers developers to discover available operations, validate requests and responses, generate client libraries, and power interactive documentation tools.

You can download the latest version of the API definition in JSON or explore an instantly browsable, developer-friendly interface with Redoc.

Backend

Write extensions

What if you want to add features to Datashare backend?

Unlike plugins that are providing a way to modify the Datashare frontend, extensions have been created to extend the backend functionalities. There are two extension points that have been defined :

  • NLP pipelines : you can add a new java NLP pipeline to Datashare

  • HTTP API : you can add HTTP endpoints to Datashare and call the Java API you need in those endpoints

Since version 7.5.0, instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the extensions they need or want, and have a fully customized installation of our search platform.

Getting started

When starting, Datashare can receive an extensionsDir option, pointing to your extensions' directory. In this example, let's call it /home/user/extensions:

mkdir /home/user/extensions
datashare --extensionsDir=/home/user/extensions

Installing and Removing registered extensions

Listing

You can list official Datashare extensions like this :

$ datashare -m CLI --extensionList
2020-08-29 09:27:51,219 [main] INFO  Main - Running datashare 
extension datashare-extension-nlp-opennlp
        OPENNLP Pipeline
        7.0.0
        https://github.com/ICIJ/datashare-extension-nlp-opennlp/releases/download/7.0.0/datashare-nlp-opennlp-7.0.0-jar-with-dependencies.jar
        Extension to extract NER entities with OPENNLP
        NLP
...

You can add a regular expression to --extensionList. You can filter the extension list if you know what you are looking for.

Installing

You can install an extension with its id and providing where the Datashare extensions are stored:

$ datashare -m CLI --extensionInstall datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions"
2020-08-29 09:34:30,927 [main] INFO  Main - Running datashare 
2020-08-29 09:34:32,632 [main] INFO  Extension - downloading from url https://github.com/ICIJ/datashare-extension-nlp-mitie/releases/download/7.0.0/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jar
2020-08-29 09:34:36,324 [main] INFO  Extension - installing extension from file /tmp/tmp218535941624710718.jar into /home/user/extensions

Then if you launch Datashare with the same extension location, the extension will be loaded.

Removing

When you want to stop using an extension, you can either remove by hand the jar inside the extensions folder or remove it with datashare --extensionDelete :

$ datashare -m CLI --extensionDelete datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions/"
2020-08-29 09:40:11,033 [main] INFO  Main - Running datashare 
2020-08-29 09:40:11,249 [main] INFO  Extension - removing extension datashare-extension-nlp-mitie jar /home/user/extensions/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jar

Create your first extension

NLP extension

You can create a "simple" java project like https://github.com/ICIJ/datashare-extension-nlp-opennlp (as simple as a java project can be right), with you preferred build tool.

You will have to add a dependency to the last version of datashare-api.jar to be able to implement your NLP pipeline.

With the datashare API dependency you can then create a class implementing Pipeline or extending AbstractPipeline. When Datashare will load the jar, it will look for a Pipeline interface.

Unfortunately, you'll have also to make a pull request to datashare-api to add a new type of pipeline. We will remove this step in the future.

Build the jar with its dependencies, and install it in the /home/user/extensions then start datashare with the extensionsDir set to /home/user/extensions. Your plugin will be loaded by datashare.

Finally, your pipeline will be listed in the available pipelines in the UI, when doing NER.

HTTP extension

For making a HTTP extension it will be the same as NLP, you'll have to make a java project that will build a jar. The only dependency that you will need is fluent-http because datashare will look for fluent http annotations @Get, @Post, @Put...

For example, we can create a small class like :

package org.myorg;

import net.codestory.http.annotations.Get;
import net.codestory.http.annotations.Prefix;

@Prefix("myorg")
public class FooResource {
    @Get("foo")
    public String getFoo() {
        return "hello from foo extension";
    }
}

Build the jar, copy it to the /home/user/extensions then start datashare:

$ datashare --extensionsDir /home/user/extensions/
# ... starting logs
2020-08-29 11:03:59,776 [Thread-0] INFO  ExtensionLoader - loading jar /home/user/extensions/my-extension.jar
2020-08-29 11:03:59,779 [Thread-0] INFO  CorsFilter - adding Cross-Origin Request filter allows *
2020-08-29 11:04:00,314 [Thread-0] INFO  Fluent - Production mode
2020-08-29 11:04:00,331 [Thread-0] INFO  Fluent - Server started on port 8080

et voilà 🔮 ! You can query your new endpoint. Easy, right?

$ curl localhost:8080/myorg/foo
hello from foo extension

Installing and Removing your custom Extension

You can also install and remove extensions with the Datashare CLI.

Then you can install it with:

$ datashare -m CLI --extensionInstall /home/user/src/my-extension/dist/my-extension.jar --extensionsDir "/home/user/extensions"
2020-07-27 10:02:32,381 [main] INFO  Main - Running datashare 
2020-07-27 10:02:32,596 [main] INFO  ExtensionService - installing extension from file /home/user/src/my-extension/dist/my-extension.jar into /home/user/extensions

And remove it:

$ datashare -m CLI --extensionDelete my-extension.jar --extensionsDir "/home/user/extensions"
2020-08-29 10:45:37,363 [main] INFO  Main - Running datashare 
2020-08-29 10:45:37,579 [main] INFO  Extension - removing extension my-extension jar /home/user/extensions/my-extension.jar

How to contribute

👷‍♀️ This page is currently being written by Datashare team.

Database Schema

api_key

Column
Type
Nullable
Default

id

character varying(96)

not null

user_id

character varying(96)

not null

creation_date

timestamp without time zone

not null

Constraints and indexes

  • api_key_pkey PRIMARY KEY, btree (id)

  • api_key_user_id_key UNIQUE CONSTRAINT, btree (user_id)


batch_search

Column
Type
Nullable
Default

uuid

character(36)

not null

name

character varying(255)

description

character varying(4096)

user_id

character varying(96)

not null

batch_date

timestamp without time zone

not null

state

character varying(8)

not null

published

integer

not null

0

phrase_matches

integer

not null

0

fuzziness

integer

not null

0

file_types

text

paths

text

error_message

text

batch_results

integer

0

error_query

text

query_template

text

nb_queries

integer

0

uri

text

nb_queries_without_results

integer

Constraints and indexes

  • batch_search_pkey PRIMARY KEY, btree (uuid)

  • batch_search_date btree (batch_date)

  • batch_search_nb_queries btree (nb_queries)

  • batch_search_published btree (published)

  • batch_search_user_id btree (user_id)

Referenced by

  • batch_search_pkey PRIMARY KEY, btree (uuid)

  • batch_search_date btree (batch_date)

  • batch_search_nb_queries btree (nb_queries)

  • batch_search_published btree (published)

  • batch_search_user_id btree (user_id)

  • Referenced by:

  • TABLE batch_search_project CONSTRAINT batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)


batch_search_project

Column
Type
Nullable
Default

search_uuid

character(36)

not null

prj_id

character varying(96)

not null

Constraints and indexes

  • batch_search_project_unique UNIQUE, btree (search_uuid, prj_id)

  • batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)


batch_search_query

Column
Type
Nullable
Default

search_uuid

character(36)

not null

query_number

integer

not null

query

text

not null

query_results

integer

0

Constraints and indexes

  • batch_search_query_search_id btree (search_uuid)

  • idx_query_result_batch_unique UNIQUE, btree (search_uuid, query)


batch_search_result

Column
Type
Nullable
Default

search_uuid

character(36)

not null

query

text

not null

doc_nb

integer

not null

doc_id

character varying(96)

not null

root_id

character varying(96)

not null

doc_path

character varying(4096)

not null

creation_date

timestamp without time zone

content_type

character varying(255)

content_length

bigint

prj_id

character varying(96)

Constraints and indexes

  • batch_search_result_prj_id btree (prj_id)

  • batch_search_result_query btree (query)

  • batch_search_result_uuid btree (search_uuid)


document

Column
Type
Nullable
Default

id

character varying(96)

not null

path

character varying(4096)

not null

project_id

character varying(96)

not null

content

text

metadata

text

status

smallint

extraction_level

smallint

language

character(2)

extraction_date

timestamp without time zone

parent_id

character varying(96)

root_id

character varying(96)

content_type

character varying(256)

content_length

bigint

charset

character varying(32)

ner_mask

smallint

Constraints and indexes

  • document_pkey PRIMARY KEY, btree (id)

  • document_parent_id btree (parent_id)

  • document_status btree (status)


document_tag

Column
Type
Nullable
Default

doc_id

character varying(96)

not null

label

character varying(64)

not null

prj_id

character varying(96)

user_id

character varying(255)

creation_date

timestamp without time zone

not null

'1970-01-01 00:00:00'::timestamp without time zone

Constraints and indexes

  • document_tag_doc_id btree (doc_id)

  • document_tag_label btree (label)

  • document_tag_project_id btree (prj_id)

  • idx_document_tag_unique UNIQUE, btree (doc_id, label)


document_user_recommendation

Column
Type
Nullable
Default

doc_id

character varying(96)

not null

user_id

character varying(96)

not null

prj_id

character varying(96)

creation_date

timestamp without time zone

now()

Constraints and indexes

  • document_user_mark_read_doc_id btree (doc_id)

  • document_user_mark_read_project_id btree (prj_id)

  • document_user_mark_read_user_id btree (user_id)

  • idx_document_mark_read_unique UNIQUE, btree (doc_id, user_id, prj_id)


document_user_star

Column
Type
Nullable
Default

doc_id

character varying(96)

not null

user_id

character varying(96)

not null

prj_id

character varying(96)

Constraints and indexes

  • document_user_star_doc_id btree (doc_id)

  • document_user_star_project_id btree (prj_id)

  • document_user_star_user_id btree (user_id)

  • idx_document_star_unique UNIQUE, btree (doc_id, user_id, prj_id)


named_entity

Column
Type
Nullable
Default

id

character varying(96)

not null

mention

text

not null

offsets

text

not null

extractor

smallint

not null

category

character varying(8)

doc_id

character varying(96)

not null

root_id

character varying(96)

extractor_language

character(2)

hidden

boolean

Constraints and indexes

  • named_entity_pkey PRIMARY KEY, btree (id)

  • named_entity_doc_id btree (doc_id)


note

Column
Type
Nullable
Default

project_id

character varying(96)

not null

path

character varying(4096)

note

text

variant

character varying(16)

blur_sensitive_media

boolean

not null

false

Constraints and indexes

  • idx_unique_note_path_project UNIQUE, btree (project_id, path)

  • note_project btree (project_id)


project

Column
Type
Nullable
Default

id

character varying(255)

not null

path

character varying(4096)

allow_from_mask

character varying(64)

label

character varying(255)

publisher_name

character varying(255)

''::character varying

maintainer_name

character varying(255)

''::character varying

source_url

character varying(2048)

''::character varying

logo_url

character varying(2048)

''::character varying

creation_date

timestamp without time zone

now()

update_date

timestamp without time zone

now()

description

character varying(4096)

''::character varying

Constraints and indexes

  • project_pkey PRIMARY KEY, btree (id)


task

Column
Type
Nullable
Default

id

character varying(96)

not null

name

character varying(128)

not null

state

character varying(16)

not null

user_id

character varying(96)

group_id

character varying(128)

progress

double precision

0

created_at

timestamp without time zone

not null

completed_at

timestamp without time zone

retries_left

integer

max_retries

integer

args

text

result

text

error

text

Constraints and indexes

  • task_pkey PRIMARY KEY, btree (id)

  • task_created_at btree (created_at)

  • task_group btree (group_id)

  • task_name btree (name)

  • task_state btree (state)

  • task_user_id btree (user_id)


user_history

Column
Type
Nullable
Default

id

integer

not null

generated by default as identity

creation_date

timestamp without time zone

not null

modification_date

timestamp without time zone

not null

user_id

character varying(96)

not null

type

smallint

not null

name

text

uri

text

not null

Constraints and indexes

  • user_history_pkey PRIMARY KEY, btree (id)

  • idx_user_history_unique UNIQUE, btree (user_id, uri)

  • user_history_creation_date btree (creation_date)

  • user_history_type btree (type)

  • user_history_user_id btree (user_id)

Referenced by

  • user_history_pkey PRIMARY KEY, btree (id)

  • idx_user_history_unique UNIQUE, btree (user_id, uri)

  • user_history_creation_date btree (creation_date)

  • user_history_type btree (type)

  • user_history_user_id btree (user_id)

  • Referenced by:

  • TABLE user_history_project CONSTRAINT user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)


user_history_project

Column
Type
Nullable
Default

user_history_id

integer

not null

prj_id

character varying(96)

not null

Constraints and indexes

  • user_history_project_unique UNIQUE, btree (user_history_id, prj_id)

  • user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)


user_inventory

Column
Type
Nullable
Default

id

character varying(96)

not null

email

text

name

character varying(255)

provider

character varying(255)

details

text

'{}'::text

Constraints and indexes

  • user_inventory_pkey PRIMARY KEY, btree (id)


Write plugins

What if you want to integrate text translations to Datashare’s interface? Or make it display tweets scraped with Twint? Ask no more: there is plugins for that!

Since version 5.6.1, instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform.

Getting started

When starting, Datashare can receive a pluginsDir option, pointing to your plugins' directory. In this example, this directory is called ~/Datashare Plugins:

mkdir ~/Datashare\ Plugins
datashare --pluginsDir=~/Datashare\ Plugins

Installing and Removing registered plugins

Listing

You can list official Datashare plugins like this :

$ datashare -m CLI --pluginList ".*"
2020-07-24 10:04:59,767 [main] INFO  Main - Running datashare 
plugin datashare-plugin-site-alert
        Site Alert
        v1.2.0
        https://github.com/ICIJ/datashare-plugin-site-alert
        A plugin to display an alert banner on the Datashare demo instance.
...

The string given to --pluginList is a regular expression. You can filter the plugin list if you know what you are looking for.

Installing

You can install a plugin with its id and providing where the Datashare plugins are stored:

$ datashare -m CLI --pluginInstall datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
2020-07-24 10:15:46,732 [main] INFO  Main - Running datashare 
2020-07-24 10:15:50,202 [main] INFO  PluginService - downloading from url https://github.com/ICIJ/datashare-plugin-site-alert/archive/v1.2.0.tar.gz
2020-07-24 10:15:50,503 [main] INFO  PluginService - installing plugin from file /tmp/tmp7747128158158548092.gz into /home/dev/Datashare Plugins

Then if you launch Datashare with the same plugin location, the plugin will be loaded.

Removing

When you want to stop using a plugin, you can either remove by hand the directory inside the plugins folder or remove it with datashare --pluginDelete :

$ datashare -m CLI --pluginDelete datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
2020-07-24 10:20:43,431 [main] INFO  Main - Running datashare 
2020-07-24 10:20:43,640 [main] INFO  PluginService - removing plugin base directory /home/dev/Datashare Plugins/datashare-plugin-site-alert-1.2.0

Create your first plugin

To inject plugins, Datashare will look for a Node-compatible module in ~/Datashare Plugins. This way we can rely on NPM/Yarn to handle built packages. As described in NPM documentation, it can be:

* A folder with a package.json file containing a "main" field.
* A folder with an index.js file in it.

Datashare will read the content of each module in the plugins directory to automatically inject them in the user interface. The backend will serve the plugin files. The entrypoint of each plugin (usually the main property of package.json) is injected with a <script> tag, right before the closing </body> tag.

Create a hello-world directory with a single index.js:

mkdir ~/Datashare\ Plugins/hello-world
echo "console.log('Welcome to %s', datashare.config.get('app.name'))" > ~/Datashare\ Plugins/hello-world/index.js

Reload the page, open the console: et voilà 🔮! Easy, right?

Installing and Removing your custom Plugin

Now you would like to develop your plugin in your repository and not necessarily in Datashare Plugins folder.

You can have your code under, say ~/src/my-plugin and deploy it into Datashare with the plugin API. To do so, you'll need to make a zip or a tarball, for example in ~/src/my-plugin/dist/my-plugin.tgz.

The tarball could contain :

$ tar tvzf ~/src/my-plugin/dist/my-plugin.tgz 
drwxr-xr-x dev/dev           0 2020-07-22 11:51 my-plugin/
-rw-r--r-- dev/dev          31 2020-07-21 14:07 my-plugin/main.js
-rw-r--r-- dev/dev          19 2020-07-21 14:07 my-plugin/package.json

Then you can install it with:

$ datashare -m CLI --pluginInstall ~/src/my-plugin/dist/my-plugin.tgz --pluginsDir "~/Datashare Plugins"
2020-07-27 10:02:32,381 [main] INFO  Main - Running datashare 
2020-07-27 10:02:32,596 [main] INFO  PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare Plugins

And remove it:

$ datashare -m CLI --pluginDelete my-plugin --pluginsDir "~/Datashare Plugins"
2020-07-27 10:02:32,381 [main] INFO  Main - Running datashare 
2020-07-27 10:02:32,596 [main] INFO  PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare Plugins

In that case my-plugin is the base directory of the plugin (the one that is in the tarball).

Adding elements to the Datashare user interface

To allow external developers to add their own components, we added markers in strategic locations of the user interface where a user can define new Vue Component. These markers are called "hooks":

Note: You can make all hooks visible by changing the config variables with plugins: datashare.config.set('hooksDebug', true).

To register a new component to a hook, use the following method:

// `datashare` is a global variable
datashare.registerHook({ target: 'app-sidebar.menu:before', definition: 'This is a message written with a plugin' })

Or with a more complex example:

// It's usualy safer to wait for the app to be ready
document.addEventListener('datashare:ready', ({ detail }) => {

  // Alert is a Vue component meaning it can have computed properties, methods, etc...
  const Alert = {
    computed: {
      weekday () {
        const today = new Date()
        return today.toLocaleDateString('en-US', { weekday: 'long' })  
      }
    },
    template: `<div class="text-center bg-info p-2 width-100">
      It's {{ weekday }}, have a lovely day!
    </div>`
  }

  // This is the most important part of this snippet: 
  // we register the component on the a given `target`
  // using the core method `registerHook`. 
  detail.core.registerHook({ target: 'landing.form:before', definition: Alert })

})

Script with Playground

Datashare Playground delivers a collection of Bash scripts (free of external dependencies) that streamline interaction with a Datashare instance’s Elasticsearch index and Redis queue.

From cloning or replacing whole indices and reindexing specific directories, to adjusting replica settings, monitoring or cancelling long-running tasks, and queuing files for processing, Playground implements each capability through intuitive shell scripts organized under the elasticsearch/ and redis/ directories.

To get started, set ELASTICSEARCH_URL and REDIS_URL in your environment (or add them to a .env file at the repo root). For a comprehensive guide to script options, directory layout, and example workflows, see the full documentation on Github:

Use playground to update index's mappings and settings

Some Datashare updates can bring some fixes and improvements on the index. The index has to be reindexed accordingly.

1. Create a temporary empty index and specify the desired Datashare version number:

./elasticsearch/index/create.sh <temporary_index> <ds_version_number>

2. Reindex all documents (under "/" path) from the original index under a temporary one:

This step can take some time if your index has plenty of documents.

./elasticsearch/documents/reindex.sh <original_index> <temporary_index> /

3. Replace the old index by the new one:

./elasticsearch/index/replace.sh <temporary_index> <original_index>

4. Delete the temporary index:

./elasticsearch/index/delete.sh <temporary_index>

CLI with Tarentula

Datashare Tarentula is a powerful command-line toolbelt designed to streamline bulk operations against any Datashare instance.

Whether you need to count indexed files, download large datasets, batch-tag records, or run complex Elasticsearch aggregations, Tarentula provides a consistent, scriptable interface with flexible query support, and Docker compatibility.

It also exposes a Python API for embedding automated workflows directly into your data pipelines. With commands like count, download, aggregate, and tagging-by-query, you can handle millions of records in a single invocation, or integrate Tarentula into CI/CD pipelines for reproducible data tasks.

You can install Tarentula with your favorite package manager:

pip3 install --user tarentula

Or alternatively with Docker:

docker run icij/datashare-tarentula

For the complete list of commands, options, and example, read the documentation or Github:

Design System

Datashare's frontend is build with Vue 3 and Bootstrap 5. We document all component of the interface on a dedicated Storybook:

To facile the creation of plugin, each component can be imported directly from the core:

// It's usualy safer to wait for the app to be ready
document.addEventListener('datashare:ready', async () => {
    // This load the ButtonIcon component asynchronously
    const ButtonIcon = await datashare.findComponent('Button/ButtonIcon')
    // Than we create a dummy component. For the sake of simplicity we use
    // Vue 3's option API but we strongly encourage you to build your plugins
    // with Vite and use the option API.
    const definition = {
        components: {
            ButtonIcon,
        },
        methods: {
            sayHi() {
                alert('Hi!')
            }
        },
        template: `
            <button-icon @click="sayHi()" icon-left="hand-waving">
                Say hi
            </button-icon>
        `
    }
    
    // Finally, we register the component's definition in a hook.
    datashare.registerHook({ target: 'app-sidebar-sections:before', definition })
})

In the example you learn that:

  • Datashare launch must be awaited with "datashare:ready"

  • You can asynchronously import components with datashare.findComponent

  • Component can be registered on targeted locations with a "hook"

  • All icons from Phosphor are available and loaded automatically

Design System - DatashareDatashare
Logo
GitHub - ICIJ/datashare-playground: A zero-dependencies series of bash script to interact with Datashare's index and queue.GitHub
Logo
GitHub - ICIJ/datashare-tarentula: Cli toolbelt for Datashare.GitHub
Logo