arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 90

Datashare

Loading...

Loading...

Loading...

Loading...

Loading...

On your computer

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

On your server

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Usage

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Developers

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Ask for help

To report a bug, please post an issue in our GitHubarrow-up-right detailing your logs with:

  • Your Operating System (Mac, Windows or Linux)

  • The version of your Operating System

  • The version of Datashare

  • Screenshots of your issue

  • A description of your issue

If, for confidentiality reasons, you don't want to open an issue on Github, please write to .

datashare@icij.orgenvelope

Concepts

This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.

About the local mode

In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines.

The software allows users to search into their documents within their own local environment, without relying on external servers or cloud infrastructure.

This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.

Install on Mac

These pages will help you set up and install Datashare on your computer.

Running modes

Datashare runs using different modes with their own features.

Mode
Category
Description

LOCAL

Web

To run Datashare on a single computer for a single user.

SERVER

Web

To run Datashare on a server for multiple users.

CLI

CLI

To index documents and analyze them directly .

hashtag
Web modes

There are two modes:

In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.

In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.

hashtag
Comparaison between modes

The running modes offer advantages and limitations. This matrix summarizes the differences:

When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.

hashtag
CLI mode

In cli mode, Datashare starts without a web server and allows user to perform tasks over their documents. This mode can be used in conjunction with both local and server modes, while allowing users to distribute heavy tasks between several servers.

If you want to learn more about which tasks you can execute in this mode, checkout the .

hashtag
Daemon modes

Those modes are intended to be used for action that requires to "wait" for pendings tasks.

In batch download mode, the daemon waits for a user to request a batch download of documents. When a request is received, the daemon starts a task to download the document matching the user search, and bundle them into a zip file.

In batch search mode, the daemon waits for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can upload a list of search terms (in CSV format). The daemon will then start a task to search all matching documents and store every occurrences in the database.

hashtag
How to change modes

Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mode is the embedded mode. Yet when starting Datashare in command line, you can explicitly specify the running mode. For instance on Ubuntu/Debian:

✅

❌

Extension UI

✅

❌

HTTP API

✅

✅

API Key

✅

✅

Single JVM

✅

❌

Tasks execution

✅

❌

TASK_RUNNER

Daemon

To execute async tasks (batch searches, batch downloads, scan, index, NER extraction, ...)

local

server

Multi-users

❌

✅

Multi-projects

✅

✅

Access-control

❌

✅

Indexing UI

✅

❌

stages documentation
in the command-line

Plugins UI

datashare \
  # Switch to SERVER mode
  --mode SERVER \
  # Dummy session filter to creates ephemeral users
  --authFilter org.icij.datashare.session.YesCookieAuthFilter \
  # Name of the default project for every user
  --defaultProject local-datashare \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # URI of Redis 
  --redisAddress redis://redis:6379 \
  # store user sessions in Redis.
  --sessionStoreType REDIS

Start Datashare

Find the Datashare application on your computer and run it locally on your browser.

Once Datashare is , go to 'Finder' > 'Applications', and double-click on 'Datashare':

A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening:

⇒ Important: Keep this Terminal window open as long as you use Datashare.

Once the process is done, Datashare should now automatically open in your default internet browser. If it doesn’t, type '

Install on Windows

These pages will help you set up and install Datashare on your computer.

hashtag

' as a URL in your browser.

Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: Can I use Datashare with no internet connection?).

Datashare's homepage

You can now add documents to Datashare.

installed
localhost:8080arrow-up-right
Screenshot of Mac's Applications window where Datashare's logo is highlighted
Screenshot of Mac's terminal window with Datashare's starting logs

CLI stages

When running Datashare from the command-line, pick which "stage" to apply to analyse your documents.

The CLI stages are primarily intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heavy tasks between servers.

hashtag
1. SCAN

This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.

datashare --mode CLI \  
  # Select the SCAN stage
  --stage SCAN \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Redis 
  --redisAddress redis://redis:6379

hashtag
2. INDEX

The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue created in the previous step, then use a combination of and to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurrent value at the time. This allows users to distribute this command on several servers.

hashtag
3. NLP

Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entity mentions in ElasticSearch, it will also mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be parallelized as well on several servers.

Add documents to Datashare

Datashare provides a folder on your Mac to collect documents you want to have in Datashare.

1

hashtag
Find your Datashare folder on your Mac

Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':

On the menu bar at the top of your computer, click 'Go' and 'Home' (the house icon):

You will see a folder called 'Datashare':

If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:

2

hashtag
Add documents to your Datashare folder on your Mac

Copy or drop the documents that you want to add to Datashare in this Datashare folder.

3

hashtag
Launch Datashare

Open your Applications. You should see Datashare. Double-click on it:

4

hashtag
In the menu, in 'Tasks', open 'Documents'

Expand the menu on the left:

In 'Tasks', open 'Documents':

5

hashtag
Choose your options

  • Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.

6

hashtag
Watch the progress of your document addition

Two extraction tasks are now running:

  • The first is the

You can now .

About Datashare

Datashare allows you to search in your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).

hashtag
What is Datashare?

Welcome to Datasharearrow-up-right - a self-hosted documents search software. It is a free and open-source software developed by the International Consortium of Investigative Journalistsarrow-up-right (ICIJ). Initially created to combine multiple named-entity recognitionarrow-up-right pipelines, this tool is now a fully-featured search interface to dig into your documents.

With the help of several open-source tools (Extractarrow-up-right, , , , , , and more), Datashare can be used on one single personal computer, as well as on 100 interconnected servers.

hashtag
Who uses it?

Datashare is developed by the ICIJ, a collective of investigative journalists. Datashare is built at the top of technologies and methods already tested with investigations like the or the .

Seeing the growing interest for ICIJ's technology, we decided to open source this key component of our investigations so a single journalist as well as big media organizations could use it for their own documents.

Datashare is free so anyone can use it and find is useful.

Curious to know more about how we use Datashare?

hashtag
Where can I see Datashare in action?

We setup a with a small set of documents from the investigation (2014). When using this instance, you will be assigned a temporary user which can star, tag, recommend and explore documents.

hashtag
Can I run Datashare on my server?

Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to to know how it works.

hashtag
Can I customize Datashare?

When building Datashare, one of our first decisions was to use to create an index of documents. It would be fair to describe Datashare as a nice looking web interface for Elasticsearch. We want our search platform to be user-friendly while keeping all the powerful Elasticsearch features available for advanced users. This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers who want to query the index directly .

We implemented the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user can pick the plugins they need or want, and have a fully customized installation of our search platform. Please have a look at the .

hashtag
In which languages is Datashare available?

This project is currently available in English, French and Spanish. You can help improve and complete translations on .

Install Datashare

The installer will take care of checking that your system have all the dependencies to run Datashare. Because this software use (to perform Optical Character Recognition, OCR) and Mac doesn't support them out-of-box, heavy dependencies must be downloaded. If your system have none of those dependencies, the first installation of Datashare can take up to 30 minutes.

The installer will set up:

  • Xcode Command Line Tools (if neither XCode or are installed)

  • Homebrew

Screenshot of the homepage of Datashare, the projects' page with one project called 'Default'
Apache Tikaarrow-up-right
Apache Tesseractarrow-up-right
CoreNLParrow-up-right
OpenNLParrow-up-right
Elasticsearcharrow-up-right
Panama Papersarrow-up-right
Luanda Leaksarrow-up-right
How ICIJ analysed 715,000 Luanda Leaks recordsarrow-up-right
Help test and improve our latest journalism toolarrow-up-right
How Datashare project will help journalists breach bordersarrow-up-right
demo instance of Datasharearrow-up-right
LuxLeaksarrow-up-right
the server documentation
Elasticsearcharrow-up-right
with our APIarrow-up-right
documentation
Crowdinarrow-up-right
Image of a screenshot of the batch search page of Datashare
Launch your own batch search on Datashare's demoarrow-up-right
Image showing Datashare logo, the tagline 'Find stories in any files' and a screenshot of a page of the software

Install on Linux

These pages will help you set up and install Datashare on your computer.

datashare --mode CLI \
  # Select the NLP stage
  --stage NLP \
  # Use CORENLP to detect named entities
  --nlpp CORENLP \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 
Apache Tikaarrow-up-right
Tesseractarrow-up-right
datashare --mode CLI \
  # Select the INDEX stage
  --stage INDEX \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # Enable OCR \
  --ocr true
  # URI of Redis 
  --redisAddress redis://redis:6379

On the top right, click the 'Plus' button:

Click the 'Plus' button

Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.

  • Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.

  • Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.

  • Skip already indexed documents if you'd like.

  • Click 'Add'

  • Form for adding documents
    scanning
    of your Datashare folder - it sees if there are documents to analyze. It is called 'Scan folders'.
  • The second is the indexing of these files. It is called 'Index documents'.

  • Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.

    But you can start searching in your documents without having to wait for all tasks to be done.

    search documents in Datashare
    Screenshot of Mac's dock where the Finder is active in first position
    Screenshot of Mac's Finder window with a dropdown menu below the 'Go' entry with the 'Home' entry highlighted
    Screenshot of Mac's Home window with an arrow pointing at the 'Datashare' folder in the list
    Screenshot of Mac's Home window highligting 'Datashare' entry located in the 'Favorites'
    Screenshot of Mac's Applications window with an arrow pointing Datashare's logo
    Screenshot of Datashare's homepage highlighting the top icon in the left menu top to expand it
    Expand the menu
    Open the 'Documents' page
    (if neither Homebrew or MacPorts are installed)
  • Apache Tesseract with MacPorts or Homebrew

  • Java JRE 17

  • Datashare executable

  • Note: Previous versions of this document referred to a "Docker Installer". We do not provide this installer anymore but Datashare is still published on the Docker Hubarrow-up-right and supported with Docker.

    Installation fails:

    • Error while installing Homebrew or MacPorts: you can manually install Homebrewarrow-up-right first and then restart the installer.

    • "System Software from application was blocked from loading" : Check in your Mac's "System Settings" > "privacy & security" if you have a section with this mention "System software from application 'Datashare' was blocked from loading" or something similar related to Datashare. If you have this section you'll have to click "Allow" to be able to install datashare.

    • For any other issue check our Github issuesarrow-up-right or create a new onearrow-up-right with your setup (macOs version) and installer logs (Command+L when the installer is launched and failed).

    1

    hashtag
    Download Datashare

    Go to datashare.icij.orgarrow-up-right and click 'Download for Mac'.

    2

    hashtag
    Start the installer

    In Finder, go to your 'Downloads' directory and double-click 'datashare-X.Y.Z.pkg':

    3

    hashtag
    Go through the Datashare Installer

    Click 'Continue', 'Install', enter your password and 'Install Software':

    You can now start Datashare.

    Apache Tesseractarrow-up-right
    Xcode Command Line Toolsarrow-up-right

    Install Datashare

    You must have Windows 7 Service Pack 2 or any newer version.

    1

    hashtag
    Uninstall any prior standard version

    Before we start, please uninstall any prior standard version of Datashare if you had already installed it. You can follow these steps: https://www.laptopmag.com/articles/uninstall-programs-windows-10arrow-up-right

    2

    hashtag
    Download Datashare

    Go to and click 'Download for Windows':

    The file 'datashare-X.Y.Z.exe' is now downloaded. You can find it in your Downloads.

    Double-click on the name of the file in order to execute it.

    3

    hashtag
    Allow Datashare

    As Datashare is not signed, this popup asks for your permission. Don't click 'Don't run' but click 'More info':

    Click 'Run anyway':

    4

    hashtag
    Install Datashare

    On the Installer Wizard, as you need to download and install OpenJDK11 if it is not installed on your device, click 'Install':

    The following windows with progress bars will be displayed:

    5

    hashtag
    Install Tesseract OCR

    To install Tesseract OCR, click the following buttons on the Installer Wizard's windows:

    6

    hashtag
    Install Datashare.jar

    It is now downloading the back-end and the front-end, Datashare.jar:

    When it is finished, click 'Close':

    You can now .

    Install Datashare

    Currently, only a .deb package for Debian/Ubuntu is provided.

    If you want to run it with another Linux distribution, you can download the latest version of the Datashare jar here: https://github.com/ICIJ/datashare/releases/latestarrow-up-right

    And adapt the following launch script to your environment: https://github.com/ICIJ/datashare/blob/master/datashare-dist/src/main/deb/bin/datasharearrow-up-right.

    1

    hashtag
    Download Datashare

    Go to and click 'Download for Linux':

    Save the Debian package as a file:

    2

    hashtag
    Install the package

    3

    hashtag
    Run Datashare

    You can now .

    Start Datashare

    Find the application on your computer and run it locally in your browser.

    Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)

    A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.

    Keep this Terminal window open as long as you use Datashare.

    Datashare should now automatically open in your default internet browser. If it doesn’t, type 'localhost:8080arrow-up-right' in your browser.

    Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: ).

    You can now .

    Add documents to Datashare

    Datashare provides a folder to collect documents on your computer to index in Datashare.

    1

    hashtag
    Add documents in 'Datashare Data' folder

    When you open your desktop in Windows on your computer, you will see a folder called 'Datashare Data'.

    Move or copy and paste the documents you want to add to Datashare to this folder:

    2

    hashtag
    Launch Datashare

    You will find it in your main menu:

    3

    hashtag
    In the menu, in 'Tasks', open 'Documents'

    Expand the menu on the left:

    In 'Tasks', open 'Documents':

    4

    hashtag
    Choose your options

    • Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.

    5

    hashtag
    Watch the progress of your document addition

    Two extraction tasks are now running:

    • The first is the

    You can now .

    Add documents to Datashare

    Datashare provides a folder to collect documents on your computer to index in Datashare.

    1

    hashtag
    Add documents to your 'Datashare' folder

    You can find a folder called 'Datashare' in your home directory.

    Move the documents you want to add to Datashare into this folder.

    2

    hashtag
    Launch Datashare

    Launch Datashare and see the interface opening in your default browser.

    3

    hashtag
    In the menu, in 'Tasks', open 'Documents'

    Expand the menu on the left:

    In 'Tasks', open 'Documents':

    4

    hashtag
    Choose your options

    • Select the project in Datashare where you want to add your documents. The Default project, which is automatically created, is selected by default.

    5

    hashtag
    Watch the progress of your document addition

    Two extraction tasks are now running:

    • The first is the

    You can now .

    Start Datashare

    Find the application on your computer and run it locally on your browser.

    Start Datashare by launching it from the command-line:

    Datashare should now automatically open in your default internet browser. If it doesn’t, type '' in your browser.

    Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see: ).

    It's now time to .

    Install plugins and extensions

    This page explains how to locally add plugins and extensions to Datashare.

    Plugins are front-end modules to add new features in Datashare's user interface.

    Extensions are back-end modules to add new features to store and manipulate data with Datashare.

    hashtag
    Add plugins to Datashare's front-end

    1

    Install with Docker

    This page will help you set up and install Datashare within a Docker.

    hashtag
    Prerequisites

    Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.

    Add more languages

    This page explains how to install language packages to support Optical Character Recognition (OCR) on more languages.

    To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses 'language packages' especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide explains you how to manually install it on your system.

    hashtag
    Install packages on Linux

    To add OCR languages on Linux, simply use the following command:

    Find entities

    This page helps you find entities (people, organizations, locations, e-mail addresses) in your documents.

    circle-info

    Prerequisite: Your documents must be added to Datashare. Check how for , and .

    1

    In the menu, in 'Tasks', click 'Entities'

    Screenshot of Datashare's homepage with the left menu open highlighting the 'Documents' entry in the 'Tasks' category
    Screenshot of Datashare's Documents page highlighting the 'Plus' button at the top right corner
    Screenshot of Datashare's 'Add Documents' page with the form showing 5 options, a 'Reset' and an 'Add' buttons
    Screenshot of Datashare's Documents page highlighting two lines in a table, one for 'Scan folders' and another one for 'Index documents'
    Can I use Datashare with no internet connection?
    add documents to Datashare
    Screenshot of Datashare's homepage, the projects' page with one project called 'Default'
    Datashare's homepage
    Screenshot of Windows' homepage with an open menu with the entry 'ICIJ' > 'Datashare 1.3' highlighted
    Screenshot of Windows' homepage with a Terminal Window showing logs of Datashare's starting process
    datashare.icij.orgarrow-up-right
    start Datashare
    Screenshot of the homepage of datashare.icij.org highlighting the 'Download for Linux' button
    Screenshot of a Linux' window saying 'What should Firefox do with this file?' with 2 radiobuttons: 'Open with Archive Manager' and "Save File' (selected) with 2 buttons: 'Cancel' and 'OK'
    Save as file
    datashare
    localhost:8080arrow-up-right
    Can I use Datashare with no internet connection?
    add documents to Datashare
    Screenshot of the homepage of Datashare, the projects' page with one project called 'Default'
    Datashare's homepage
    hashtag
    Starting Datashare with a single container

    To start Datashare within a Dockerarrow-up-right container, you can use this command:

    Make sure the Datashare folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.

    hashtag
    Starting Datashare with multiple containers

    Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).

    By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.

    Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store, will be retained even if the containers are restarted or redeployed.

    This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.

    To start Datashare with Docker Composearrow-up-right, you can use the following docker-compose.yml file:

    circle-info

    Apple Silicon (M1/M2/M3) users:

    If you encounter the error Error response from daemon: no matching manifest for linux/arm64/v8 in the manifest list entries when pulling the redis Docker image, add the following line to the redis service in your docker-compose.yml:

    This forces Docker to use the x86_64 image, which is necessary because some Redis images do not provide ARM64 builds.

    Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:

    The -d flag runs the containers in detached mode, allowing them to run in the background.

    Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this operation with:

    Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.

    That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.

    To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:

    This will stop and remove the containers, freeing up system resources.

    Read more about how to install Docker on your systemarrow-up-right
    $ sudo apt install /dir/to/debian/package/datashare-dist_7.2.0_all.deb
    $ datashare
    platform: linux/x86_64
    docker run --mount src=$HOME/Datashare,target=/home/datashare/data,type=bind -p 8080:8080 icij/datashare:11.1.9 --mode EMBEDDED
    version: "3.7"
    services:
    
      datashare:
        image: icij/datashare:18.1.3
        hostname: datashare
        ports:
          - 8080:8080
        environment:
          - DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
        volumes:
          - type: bind
            source: ${HOME}/Datashare
            target: /home/datashare/data
          - type: volume
            source: datashare-models
            target: /home/datashare/dist
        command: >-
          --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password 
          --mode LOCAL
          --tcpListenPort 8080
        depends_on:
          - postgresql
          - redis
          - elasticsearch
    
      elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
        restart: on-failure
        volumes:
          - type: volume
            source: elasticsearch-data
            target: /usr/share/elasticsearch/data
            read_only: false
        environment:
          - "http.host=0.0.0.0"
          - "transport.host=0.0.0.0"
          - "cluster.name=datashare"
          - "discovery.type=single-node"
          - "discovery.zen.minimum_master_nodes=1"
          - "xpack.license.self_generated.type=basic"
          - "http.cors.enabled=true"
          - "http.cors.allow-origin=*"
          - "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"
    
      redis:
        image: redis:4.0.1-alpine
        restart: on-failure
    
      postgresql:
        image: postgres:12-alpine
        environment:
          - POSTGRES_USER=datashare
          - POSTGRES_PASSWORD=password
          - POSTGRES_DB=datashare
        volumes:
          - type: volume
            source: postgresql-data
            target: /var/lib/postgresql/data
    
    volumes:
      datashare-models:
      elasticsearch-data:
      postgresql-data:
    docker-compose up -d
    docker-compose logs -f datashare
    docker-compose down

    The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.

    You can see what it actually does by typing command+L: it will open a window which logs every action made.

    In the end, you should see this screen:

    You can now safely close this window.

    datashare.icij.orgarrow-up-right
    Screenshot of the Downloads window on Mac showing the installer package of Datashare
    Screenshot of the Mac installer's first step to install Datashare: 'Introduction'
    Screenshot of the Mac installer's third step to install Datashare: 'Installation Type''
    Screenshot of the homepage of datashare.icij.org highlighting the 'Download for Mac' button

    It asks if you want to allow the app to make changes to your device. Click 'Yes':

    Choose a language and click 'OK':

    Untick 'Show README' and click 'Finish':

    Finally, click 'Close' to close the installer of TesseractOCR.

    datashare.icij.orgarrow-up-right
    start Datashare
    Screenshot of the homepage of datashare.icij.org highlighting the 'Download for Windows' button
    datashare.icij.orgarrow-up-right
    Screenshot of Windows' window saying 'Windows protected your PC' with the text "Windows Defender SmartScreen prevented an unrecognized app from starting. Running this app might put your PC at risk. More info (which is a link)" and a button 'Don't run'
    Screenshot of Windows' window with the title 'Welcome to the ICIJ Setup Wizard' with 2 buttons: 'Install' (which is highlighted) and 'Cancel'
    Screenshot of Windows' window saying 'Welcome to the Tessearct-OCR Setup Wizard' with 2 buttons: 'Next (which is highlighted) and 'Cancel'
    Screenshot of Windows' window saying 'Licence agreement' with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
    Screenshot of Windows' window showing 2 radiobuttons: 'Install for anyone using this computer' (which is selected) and 'Install just for me' and with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
    Screenshot of Windows' window saying 'ICIJ Datashare Setup' with a progress bar and a 'Cancel' button
    Screenshot of Windows' window saying 'ICIJ Datashare Setup' with a completed progress bar with 3 buttons: 'Back', 'Close' (which is highlighted) and 'Cancel'
    Open the "Documents" page

    On the top right, click the "Plus" button:

    Click the "Plus" button

    Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.

  • Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.

  • Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.

  • Skip already indexed documents if you'd like.

  • Click 'Add'

  • Form for adding documents
    scanning
    of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.
  • The second is the indexing of these files. It is called 'IndexTask'.

  • Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.

    But you can start searching in your documents without having to wait for all tasks to be done.

    search documents in Datashare
    Screenshot of Windows' homepage with the Datashare folder icon highlighted
    Screenshot of Windows' homepage with the menu open with the entry 'ICIJ' > 'Datashare 1.3' highlighted
    Screenshot of Datashare's homepage highlighting the top icon in the left menu top to expand it
    Expand the menu
    Open the "Documents" page

    On the top right, click the 'Plus' button:

    Click the "Plus" button

    Select the folder or sub-folder on your computer in your 'Datashare' directory containing the documents you want to add. The entire 'Datashare' directory will be added by default.

  • Choose the language of your documents if you don't want Datashare to guess it automatically. Note: If you choose to also extract text from images (at the next option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.

  • Extract text from images/PDFs with Optical Character Recognition (OCR). Be aware the indexing can take up to 10 times longer.

  • Skip already indexed documents if you'd like.

  • Click 'Add'

  • Form for adding documents
    scanning
    of your Datashare folder - it sees if there are documents to analyze. It is called 'ScanTask'.
  • The second is the indexing of these files. It is called 'IndexTask'.

  • Note: It is not possible to 'Find entities' while these two tasks are still running. You won't have the entities (names of people, organizations, locations and e-mail addresses) yet. To get these, once your document addition is finished, please follow the steps to 'Find entities'.

    But you can start searching in your documents without having to wait for all tasks to be done.

    search documents in Datashare
    Screenshot of Datashare's homepage highlighting the top icon in the left menu top to expand it
    Expand the menu

    At the bottom of the menu, click the 'Settings' icon:

    2

    Open the 'Plugins' tab:

    3

    Choose the plugin you want to add and click 'Install':

    If you want to install a plugin from an URL, click 'Install from a URL':

    4

    Your plugin is now installed:

    5

    Refresh your page to see your new plugin activated in Datashare.

    hashtag
    Add extensions to Datashare's back-end

    1

    At the bottom of the menu, click the 'Settings' icon:

    2

    Open the 'Extensions' tab:

    3

    Choose the extension you want to add and click 'Install':

    If you want to install an extension from an URL, click 'Install from a URL':

    4

    Your extension is now installed:

    5

    Restart Datashare to see your new extension activated in Datashare. Check how for , and .

    hashtag
    Update plugin or extension with latest version

    When a newer version of a plugin or extension is available, get the latest version.

    If it is a plugin, refresh your page to activate the latest version.

    If it is an extension, restart Datashare to activate the latest version. Check how for Mac, Windows and Linux.

    hashtag
    Create your own plugin or extension

    People who can code can create their own plugins and extensions by following these steps:

    • Plugins

    • Extensions

    Where `[lang]` is can be :
    • all if you want to install all languages

    • a language code (ex: fra, for French), the list of languages is available herearrow-up-right

    hashtag
    Install packages on Mac

    The Datashare Installer for Mac checks for the existence of either MacPortsarrow-up-right or Homebrewarrow-up-right, which package managers are used for the installation of Tesseract. If none of those two package managers is present, the Datashare Installer will install MacPorts by default.

    hashtag
    With MacPorts (default)

    First, you must check that MacPort is installed on your computer. Please run in a Terminal:

    You should see an output similar to this:

    If you get a command not found: port, this either means you are using Homebrew (see next section) or you did not run the Datashare installer for Mac yet.

    If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):

    The full list of supported language packages can be found on MacPorts websitearrow-up-right.

    Once the installation is done, close and restart Datashare to be able to use the newly installed packages.

    hashtag
    With Homebrew

    If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!

    If you want to check if Homebrew is installed, run the following command in a Terminal:

    You should see an output similar to this:

    If you get a command not found: brew error, this mean Homebrew is not installed on your system. You might either use MacPorts (see previous section) or run the Datashare installer for Mac on your computer.

    hashtag
    Install languages on Windows

    Languages packages are available on Tesseract Github repositoryarrow-up-right. Trained data files have to be downloaded and added into tessdata folder in Tesseract's installation folder.

    *Additional languages can be also added during Tesseract's installation.

    Download and add French into tessdata

    The list of installed languages can be checked with Windows command prompt or Powershell with the command tesseract --list-langs.

    French is listed in installed languages

    Datashare has to be restarted after the language installation. Check how for Mac, Windows and Linux.

    2

    In the menu or on the top right, click the 'Plus' button or on the page, click 'Find entities':

    3

    Select your options

    • Select a project where you want to find entities

    • Choose between finding names of people, organizations and locations or finding email addresses. You cannot do both simultaneously, you need to do one after the other, no matter the order.

    • Choose a Natural Language Processing model, that is to say the software which will run the entity recognition. If you want to add more models, you can check .

    4

    In 'Tasks' > 'Entities', watch the progress of your entity recognition:

    Once they are done, you can click 'Delete done tasks' to stop displaying tasks that are completed.

    5

    Explore your entities in the documents

    You can now start searching your entities in the documents without having to wait for all tasks to be done.

    In the menu, click 'Search' > 'Documents' and open the 'Entities' tab of your documents or use the Entities filters.

    Mac
    Windows
    Linux

    Install Neo4j plugin

    hashtag
    Install the Neo4j plugin

    Install the Neo4j plugin following these instructions.

    hashtag
    Configure the Neo4j plugin

    1. At the bottom of the menu, click on the 'Settings' icon:

    2. Make sure the following settings are properly set:

    • Neo4j Host should be localhost or the address where your Neo4j instance is running

    • Neo4j Port should be the port where your Neo4j instance is running (7687 by default)

    3. When running Neo4j Community Edition, set the Neo4j Single Project value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.

    4. Restart Datashare to apply the changes. Check how for , or .

    5. Go to 'Projects' > your project's page > the Graph tab. You should see the Neo4j widget. After a little while, its status should be RUNNING:

    You can now .

    About the server mode

    In server mode, Datashare operates as a centralized server-based system. Users can access the platform through a web interface, and the documents are stored and processed on Datashare's servers.

    This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.

    hashtag
    Launch configuration

    Datashare is launched with --mode SERVER and you have to provide:

    • The external elasticsearch index address elasticsearchAddress

    • A Redis store address redisAddress

    • A Redis data bus address messageBusAddress

    Example:

    Add documents from the CLI

    This document assumes that you have installed Datashare in server mode within Docker.

    In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.

    This is likely to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.

    Here is a simple command to scan a directory and index its files:

    What's happening here:

    • Datashare starts in "CLI"

    • We ask to process both SCAN and INDEX at the same time

    • The SCAN stage feeds a queue in memory with file to add

    • The INDEX stage pulls files from the queue to add them to ElasticSearch

    • We tell Datashare to use the elasticsearch service

    • Files to add are located in /home/datashare/Datashare/ which is a directory mounted from the host machine

    Alternatively, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the Redis:

    Once the operation is done, we can easily check the content of queue created by Datashare in Redis. In this example we only display the 20 first files in the datashare:queue:

    The INDEX can now be executed in the same container:

    Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.

    Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :

    • Scan the existing ElasticSearch index and gather document paths to store it inside a report queue

    • Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it

    Neo4j

    This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your computer.

    hashtag
    Prerequisites

    hashtag
    Get Neo4j up and running

    Follow the instructions of the to get Neo4j up and running.

    We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.

    hashtag
    Add entities

    If it's not done yet and extract names of people, organizations and locations as well as email addresses.

    If your project contains emails, make sure to also extract email addresses.

    hashtag
    Next step

    You can now .

    Add entities from the CLI

    This document assumes that you have installed Datashare in server mode within Docker and already added documents to Datashare.

    In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.

    This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.

    Datashare has the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing (NLP) pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:

    What's happening here:

    • Datashare starts in "CLI"

    • We ask to process the NLP

    • We tell Datashare to use the elasticsearch service

    Datashare will use the output queue from the previous INDEX stage (by default extract:queue:nlp in Redis) that contains all the document ids to be analyzed.

    The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.

    You can also use chain the 3 stages altogether:

    As for the previous you may want to restore the output queue from the INDEX stage. You can do:

    The added ENQUEUEIDX stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the IDs of those documents into the extract:queue:nlp queue.

    Install with Docker

    This page explain how to start Datashare within a Docker in server mode.

    hashtag
    Prerequisites

    Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.

    Read more about how to install Docker on your systemarrow-up-right.

    hashtag

    hashtag
    Starting Datashare with multiple containers

    Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.

    Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.

    This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.

    To start Datashare in server mode with , you can use the following docker-compose.yml file for version 20.1.4 (check latest version on ):

    Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:

    The -d flag runs the containers in detached mode, allowing them to run in the background.

    Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:

    Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.

    To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:

    This will stop and remove the containers, freeing up system resources.

    hashtag

    hashtag
    Add documents to Datashare

    If you reach that point, Datashare is up and running but you will discover very quickly that no documents is available in the search results. Next step: .

    hashtag

    hashtag
    Extract named entities

    Datashare has the ability to detect email addresses, name of people, organizations and locations. You must perform the named entities extraction in the same fashion than the previous commands. Final step: .

    Dummy

    Dummy authentication provider to disable authentication

    You can have a dummy authentication that always accepts basic auth. So you should see this popup:

    But then whatever user or password you type, it will enter Datashare.

    hashtag
    Example

    Basic with a database

    Basic authentication with a database.

    Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:

    It is secure as long as the communication to the server is encrypted (with SSL for example).

    On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then Datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.

    Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):

    Authentication providers

    Authentication with Datashare in server mode is the most impacting choice that has to be made. It can be one of the followings:

    • Basic authentication with credentials stored in database (PostgreSQL)

    • Basic authentication with credentials stored in Redis

    Create and update Neo4j graph

    This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects

    hashtag
    Create the graph

    1. Go to 'All projects' and click on your project's name:

    Install Neo4j plugin

    hashtag
    Install the Neo4j plugin

    Install the Neo4j plugin using the Datashare CLI so that users can access it from the frontend:

    Installing the plugin installs the datashare-plugin-neo4j-graph-widget plugin inside /home/datashare/plugings and will also install the datashare-extension-neo4j backend extension inside

    Basic with Redis

    Basic authentication with Redis

    Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:

    It is secure as long as the communication to the server is encrypted (with SSL for example).

    On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.

    So you have to provision users. The passwords are sha256 hex encoded. For example using bash

    sudo apt install tesseract-ocr-[lang]
    port version
    port install tesseract-deu
    brew -v
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage SCAN,INDEX \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --dataDir /home/datashare/Datashare/
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage NLP \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --nlpParallelism 2 \
      --nlpp CORENLP

    OAuth2 with credentials provided by an identity provider (KeyCloak for example)

  • Dummy basic auth to accept any user (⚠️ if the service is exposed to internet, it will leak your documents)

  • dedicated FAQ page
    find entities
    run Datashare with the Neo4j plugin
    Neo4j User should be set to your Neo4j user name (neo4j by default)
  • Neo4j Password should only be set if your Neo4j user is using password authentication

  • Mac
    Windows
    Linux
    create the graph
    Screenshot of Datashare's homepage with the Settings icon at the bottom of the menu highlighted
    Screenshot of a Project's page on the Graph tab with the Running status highlighted
    Screenshot of the Mac installer's step to install Datashare when username and password are asked
    Screenshot of the Mac installer's last step to install Datashare: 'Summary' saying 'The installation was successful.'with a blue 'Close' button
    Screenshot of Windows' window saying 'Windows protected your PC' with 2 buttons: 'Run anyway' and 'Don't run'
    Screenshot of Windows' window with the question 'Do you want to allow this app from an unknown producer to make changes to your device?' with 2 buttons: 'Yes' (which is highlighted) and 'No'
    Screenshot of Windows' window saying 'Please wait (...) Datashare is being installed' with a progress bar and a 'Cancel' button
    Screenshot of Windows' window saying 'Please wait (...) Tesseract is being installed' with a progress bar and a 'Cancel' button
    Screenshot of Windows' window saying 'Please wait (...) Datashare is being installed' and 'Please wait while Setup is loading'
    Screenshot of Windows' window saying 'Please wait (...) Datashare is being installed' containing another window which says 'Please select a language' with a dropdown with 'English' selected' with 2 buttons: 'Ok' (which is highlighted) and 'Cancel'
    Screenshot of Windows' window showing some pre-ticked options with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
    Screenshot of Windows' window showing a pre-ticked 'Destination Folder' (C:\Program Files (x86)\Tesseract-OCR) with 3 buttons: 'Previous', 'Next (which is highlighted) and 'Cancel'
    Screenshot of Windows' window saying 'Choose Start Menu Folder' with 3 buttons: 'Back', 'Install' (which is highlighted) and 'Cancel'
    Screenshot of Windows' window saying 'Installation Complete' with 3 buttons: 'Back', 'Install' (which is highlighted) and 'Cancel'
    Screenshot of Windows' window saying 'Completing the Tesseract-OCR Setup Wizard' with 3 buttons: 'Back', 'Finish' (which is highlighted) and 'Cancel'
    Screenshot of Datashare's homepage with the left menu open highlighting the 'Documents' entry in the 'Tasks' category
    Screenshot of Datashare's Documents page highlighting the 'Plus' button at the top right corner
    Screenshot of Datashare's 'Add Documents' page with the form showing 5 options, a 'Reset' and an 'Add' buttons
    Screenshot of Datashare's Documents page highlighting two lines in a table, one for 'Scan folders' and another one for 'Index documents'
    Screenshot of Datashare's homepage with the left menu open highlighting the 'Documents' entry in the 'Tasks' category
    Screenshot of Datashare's Documents page highlighting the 'Plus' button at the top right corner
    Screenshot of Datashare's 'Add Documents' page with the form showing 5 options, a 'Reset' and an 'Add' buttons
    Screenshot of Datashare's Documents page highlighting two lines in a table, one for 'Scan folders' and another one for 'Index documents'
    Mac
    Windows
    Linux
    Screenshot of Datashare's Settings page on the Extensions tab with a Extension's 'Install' button highlighted
    Screenshot of Datashare's Settings page on the Extensions tab with the field 'Install from a URL' highlighted
    Screenshot of Datashare's Settings page on the Extensions tab with the installed extension highlighted
    Screenshot of a Datashare's project page with the Settings icon at the bottom of the left menu highlighted
    Screenshot of a Datashare's settings page with the Plugins tab highlighted
    Screenshot of Datashare's Settings page on the Plugins tab with a Plugin's 'Install' button highlighted
    Screenshot of Datashare's Settings page on the Plugins tab with the field 'Install from a URL' highlighted
    Screenshot of Datashare's Settings page on the Plugins tab with the installed plugin highlighted
    Screenshot of a Datashare's project page with the Settings icon at the bottom of the left menu highlighted
    Screenshot of a Datashare's settings page with the Extensions tab highlighted
    Screenshot of a terminal window with the text: username % port version / version 2.8.0 / username %
    Screenshot of a terminal window with a text mentioning homebrew
    Screenshot of the Tessdata folder showing languages files
    Screenshot of the command tesseract --list-langs. with the result: 'List of available languages (3): eng fra osd'
    how to add them as extensions
    Screenshot of Datashare's 'Find Entities' page with the whole form highlighted
    Screenshot of Datashare's Entities page with the menu's Entities entry highlighted
    Screenshot of Datashare's Entities page with 3 highlights: the menu's 'Plus' button next to Entities entry, the central button 'Find entities' in the empty state and the top right 'Plus' button
    Screenshot of Datashare's Entities page with the table which lists tasks and the entity recognition task highlighted in one line
  • A database JDBC URLarrow-up-right dataSourceUrl

  • The host of Datashare (used to generate batch search results URLs) rootHost

  • An authentication mechanism and its parameters

  • mode
    stages
    stage

    Datashare will pull documents from ElasticSearch directly

  • Up to 2 documents will be analyzed in parallel

  • Datashare will use the CORENLP pipeline

  • mode
    stage
    stages
    Docker Composearrow-up-right
    https://datashare.icij.org/arrow-up-right
    Add documents from the CLI
    Add named entities from the CLI
    /home/datashare/extensions
    . These locations can be changed by updating the
    docker-compose.yml
    .

    hashtag
    Configure the Neo4j extension

    Update the docker-compose.yml to reflect your Neo4j docker service settings.

    If your choose a different Neo4j user or set a password for your Neo4j user make sure to also set DS_DOCKER_NEO4J_USER and DS_DOCKER_NEO4J_PASSWORD.

    When running Neo4j Community Edition, set the DS_DOCKER_NEO4J_SINGLE_PROJECT value. In community edition, the Neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the DS_DOCKER_NEO4J_SINGLE_PROJECT with the name of the project which will use Neo4j plugin. Other projects won't be able to use the Neo4j plugin.

    hashtag
    Restart Datasahre

    After installing the plugin a restart might be needed for the plugin to display:

    hashtag
    Next step

    You can now create the graph.

    docker run -ti ICIJ/datashare:version --mode SERVER \
        --redisAddress redis://my.redis-server.org:6379 \
        --elasticsearchAddress https://my.elastic-server.org:9200 \
        --messageBusAddress my.redis-server.org \
        --dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
        --rootHost https://my.datashare-server.org
        # ... +auth parameters (see authentication providers section)
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage SCAN \
      --queueType REDIS \
      --queueName "datashare:queue" \
      --redisAddress redis://redis:6379 \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --dataDir /home/datashare/Datashare/
    docker compose exec redis redis-cli lrange datashare:queue 0 20
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage INDEX \
      --queueType REDIS \
      --queueName "datashare:queue" \
      --redisAddress redis://redis:6379 \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --dataDir /home/datashare/Datashare/
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage SCANIDX \
      --queueType REDIS \
      --reportName "report:queue" \
      --redisAddress redis://redis:6379 \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --dataDir /home/datashare/Datashare/
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage SCAN,INDEX \
      --ocr true \
      --queueType REDIS \
      --queueName "datashare:queue" \
      --reportName "report:queue" \
      --redisAddress redis://redis:6379 \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --dataDir /home/datashare/Datashare/
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage SCAN,INDEX,NLP \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --nlpParallelism 2 \
      --nlpp CORENLP \
      --dataDir /home/datashare/Datashare/
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --stage ENQUEUEIDX,NLP \
      --defaultProject secret-project \
      --elasticsearchAddress http://elasticsearch:9200 \
      --nlpParallelism 2 \
      --nlpp CORENLP
    version: "3.7"
    services:
    
      datashare:
        image: icij/datashare:20.1.4
        hostname: datashare
        ports:
          - 8080:8080
        environment:
          - DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
        volumes:
          - type: bind
            source: ${HOME}/Datashare
            target: /home/datashare/data
          - type: volume
            source: datashare-models
            target: /home/datashare/dist
        command: >-
          --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password 
          --mode LOCAL
          --tcpListenPort 8080
        depends_on:
          - postgresql
          - redis
          - elasticsearch
    
      elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
        restart: on-failure
        volumes:
          - type: volume
            source: elasticsearch-data
            target: /usr/share/elasticsearch/data
            read_only: false
        environment:
          - "http.host=0.0.0.0"
          - "transport.host=0.0.0.0"
          - "cluster.name=datashare"
          - "discovery.type=single-node"
          - "discovery.zen.minimum_master_nodes=1"
          - "xpack.license.self_generated.type=basic"
          - "http.cors.enabled=true"
          - "http.cors.allow-origin=*"
          - "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"
    
      redis:
        image: redis:4.0.1-alpine
        restart: on-failure
    
      postgresql:
        image: postgres:12-alpine
        environment:
          - POSTGRES_USER=datashare
          - POSTGRES_PASSWORD=password
          - POSTGRES_DB=datashare
        volumes:
          - type: volume
            source: postgresql-data
            target: /var/lib/postgresql/data
    
    volumes:
      datashare-models:
      elasticsearch-data:
      postgresql-data:
    docker-compose up -d
    docker-compose logs -f datashare_web
    docker-compose down
    docker compose exec datashare_web /entrypoint.sh \
      --mode CLI \
      --pluginInstall datashare-plugin-neo4j-graph-widget 
    ...
    services:
        datashare_web:
          ...
          environment:
            - DS_DOCKER_NEO4J_HOST=neo4j
            - DS_DOCKER_NEO4J_PORT=7687
            - DS_DOCKER_NEO4J_SINGLE_PROJECT=secret-project  # This is for community edition only
    docker compose restart datashare_web
    Then you can insert the user like this in your database:

    If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For example if you use myindex:

    Or you can use PostgreSQL import CSVarrow-up-right COPY statement if you want to create them all at once.

    Then when accessing Datashare, you should see this popup:

    basic auth popup

    hashtag
    Example

    Here is an example of launching Datashare with Docker and the basic auth provider filter backed in database:

    Authorization: Basic dXNlcjpwYXNzd29yZA==
  • Go to the Graph tab and in the first step 'Import', click on the 'Import' button:

  • You will then see a new import task running.

    When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:

    hashtag
    Update the graph

    If new documents or entities are added or modified in Datashare, you will need to update the Neo4j graph to reflect these changes.

    Go to 'All projects' > one project's page > the 'Graph' tab. In the first step, click on the 'Update graph' button:

    To detect whether a graph update is needed, go to the 'Projects' page and open your project:

    Open your project

    Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...

    Statistics of one project

    ...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:

    The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.

    You can now explore your graph using your favorite visualization tool.

    Screenshot of Datashare's 'All projects' page with the name of one project highlighted
    :

    Then insert the user like this in Redis:

    If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For exammple if you use myindex:

    Then you should see this popup:

    basic auth popup

    hashtag
    Example

    Here is an example of launching Datashare with Docker and the basic auth provider filter backed in Redis:

    Authorization: Basic dXNlcjpwYXNzd29yZA==
    $ echo -n bar | sha256sum
    fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9  -
    docker run -ti ICIJ/datashare -m SERVER \
      --dataDir /home/dev/data \
        --batchQueueType REDIS \
        --dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=dstest&password=test'\
        --sessionStoreType REDIS \
        --authFilter org.icij.datashare.session.YesBasicAuthFilter
    basic auth popup

    Create and update Neo4j graph

    This page describes how to create and maintain your Neo4j graph up to date with your server's Datashare projects

    hashtag
    Run the Neo4j extension CLI

    The Neo4j related features are added to the DatashareCLI through the extension mechanism. In order to run the extended CLI, the Java CLASSPATH must be extended with the path of the datashare-extension-neo4j jar. By default, this jar is located in /home/.local/share/datashare/extensions/*, so the CLI will be run as following:

    hashtag
    Create the graph

    In order to create the graph, run the --fullImport command for your project:

    The CLI will display the import task progress and log import related information.

    hashtag
    Update the graph

    When new documents or entities are added or modified inside Datashare, you will need to update the Neo4j graph to reflect these changes.

    To update the graph, you can just re-run the full export:

    The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.

    To detect whether a graph update is needed, go to the 'Projects' page and open your project:

    Compare the number of documents and entities found in Datashare in 'Projects' > 'Your project' > 'Insights'...

    ...with the numbers found in your project in the 'Graph' tab. Run an update in case of mismatch:

    The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.

    You can now using your favorite visualization tool.

    Neo4j

    This page explains how to setup Neo4j, install the Neo4j plugin and create a graph on your server

    hashtag
    Prerequisites

    hashtag
    Get Neo4j up and running

    Follow the instructions of the to get Neo4j up and running.

    We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'All platforms and versions' button when downloading to access versions if necessary.

    hashtag
    Add entities

    If it's not done yet add entities to your project .

    If your project contains email documents, make sure to run the EMAIL pipeline together with regular NLP pipeline. To do so add set the follow nlpp flag to --nlpp CORENLP,EMAIL.

    hashtag
    Next step

    You can now .

    Search projects

    Projects are collections of documents. Datashare displays statistics about each projects.

    Expand the menu to go to 'Projects' > 'All projects':

    Search in projects' names using the search bar on the right:

    Sort your projects by clicking the top right Settings icon:

    In the Page settings, choose a sort by option, change the number of projects per page or the layout:

    To explore a project, close the Settings and click on the name of the project:

    You can now .

    Search documents

    Search with the main search bar and configure settings to display your search's results.

    You must have added documents in Datashare before. Check how for Mac, Windows and Linux.

    hashtag
    Search bar

    Expand the menu to go to 'Search' > 'Documents':

    Make room by closing the menu:

    Type terms in the search bar and press Enter:

    hashtag
    Default operator is OR

    If you type several terms separated by space, as the default operator is OR, Datashare will search for all documents containing at least one of the searched terms.

    For instance, Datashare finds documents containing either 'ikea' or 'paris' or both terms here:

    hashtag
    Linked entities

    As you type a term, Datashare suggest linked entities - only if a task to find entities in this project was completed.

    Press Esc on your keyboard to close the dropdown or click on one of the entities to replace your term in the search bar:

    hashtag
    Search in a field

    Search within a specific field only, by using the dropdown 'All fields':

    hashtag
    Search breadcrumb

    To see your queries in the search breadcrumb, click on the icon on the left of the search bar:

    If you'd like to remove all searched terms from the search bar, click 'Clear query':

    hashtag
    Results settings

    To change the page settings, click the Settings icon on the top right:

    You can change Sort by, Documents per page, Layout and also Properties:

    Ticking these properties will change which document's metadata are displayed in the results, in the document cards, in all 3 layouts (List, Grid, Table):

    You can now make your search more precise .

    OAuth2

    OAuth2 authentication with a third-party id service

    This is the default authentication mode: if not provided in CLI, it will be selected. With OAuth2 you will need a third-party authorization service. The diagram below describes the workflow:

    oauth

    hashtag
    Example

    docker run -ti ICIJ/datashare:version --mode SERVER \
        --oauthClientId 30045255030c6740ce4c95c \
        --oauthClientSecret 10af3d46399a8143179271e6b726aaf63f20604092106 \
        --oauthAuthorizeUrl https://my.oauth-server.org/oauth/authorize \
        --oauthTokenUrl https://my.oauth-server.org/oauth/token \
        --oauthApiUrl https://my.oauth-server.org/api/v1/me.json \
        --oauthCallbackPath /auth/callback

    hashtag
    Integration with KeyCloak

    We made a small demo to show how it could be setup.

    Keyboard shortcuts

    Shortcuts help do some actions faster.

    Open the menu > 'Search' > 'Documents' and click the keyboard icon at the bottom of the menu:

    It opens a window with the shortcuts for your OS (Mac, Windows, Linux):

    Click on 'See all shortcuts' to reach the full page view:

    Search with operators or Regex

    To make your searches more precise, use operators in the main search bar.

    hashtag
    Double quotes for exact phrase

    To have all documents mentioning an exact phrase, you can use double quotes. Use straight double quotes ("example"), not curly double quotes (“example”).

    "Alicia Martinez’s bank account in Portugal"

    Create a Neo4j graph and explore it

    This page explains how to leverage Neo4j to explore your Datashare projects.

    hashtag
    Prerequisites

    We recommend using a recent release of Datashare (>= 14.0.0) to use this feature. To download a specific version, click on 'All platforms and versions' .

    If you are not familiar with graph and Neo4j, take a look at the following resources:

    Explore a project

    A project is a collection of documents. Datashare displays statistics about each projects.

    Expand the menu, open 'All projects' and click on the name of the project that you want to explore:

    If you'd like to pin this project in the menu for an easy access, click 'Pin to menu':

    Your project is now pinned in the menu:

    In a project page, in the first tab called 'Insights', you find statistics and a bar chart displaying the

    Performance considerations

    Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with and , this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.

    hashtag
    Separate Processing Stages

    Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.

    Examples:

    Filter documents

    Filters are on the left of the search bar. You can contextualize, exclude and reset them. Active filters are displayed in the search breadcrumb.

    hashtag
    Filters

    Open 'Filters' on the left of the search bar:

    'Indexing dates' arethe dates when the documents were added to Datashare.

    Star, tag and recommend

    Turn the documents into starred, tag them or, in server mode, recommend them to project's other members.

    hashtag
    Star documents

    circle-info

    In server collaborative mode, starring documents is private. Other members of your projects can't see your starred documents.

    $ echo -n bar | sha256sum
    fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9  -
    $ psql datashare
    datashare=> insert into user_inventory (id, email, name, provider, details) values ('fbar', 'foo@bar.com', 'Foo Bar', 'my_company', '{"password": "fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}');
    $ psql datashare
    datashare=> insert into user_inventory (id, email, name, provider, details) values ('fbar', 'foo@bar.com', 'Foo Bar', 'my_company', '{"password": "fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex", "local-datashare"]}}');
    docker run -ti ICIJ/datashare --mode SERVER \
        --batchQueueType REDIS \
        --dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=<username>&password=<password>' \
        --sessionStoreType REDIS \
        --authFilter org.icij.datashare.session.BasicAuthAdaptorFilter \
        --authUsersProvider org.icij.datashare.session.UsersInDb
    $ redis-cli -h my.redis-server.org
    redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}'
    $ redis-cli -h my.redis-server.org
    redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex","local-datashare"]}}'
    docker run -ti ICIJ/datashare --mode SERVER \
        --batchQueueType REDIS \
        --dataSourceUrl 'jdbc:postgresql://postgres/datashare?user=<username>&password=<password>' \
        --sessionStoreType REDIS \
        --authFilter org.icij.datashare.session.BasicAuthAdaptorFilter \
        --authUsersProvider org.icij.datashare.session.UsersInRedis
    docker compose exec \
      # if you are not using the default extensions directory  
      # you have to specify it extending the CLASSPATH variable ex:
      # -e CLASSPATH=/home/datashare/extensions/* \ 
      datashare_web /entrypoint.sh \
      --mode CLI \
      --ext neo4j \
      ... 
    dedicated FAQ page
    using the Datashare CLI
    run Datashare with the Neo4j plugin
    Screenshot of Datashare's project page on the Graph tab with the 'Import' button highlighted on the first step of the form
    Screenshot of Datashare's project page on the Graph tab with the Graph statistics highligted
    Screenshot of Datashare's project page on the Graph tab with 'Update' button on the first step of the form highlighted
    Screenshot of Datashare's All projects page with the name of one project highlighted
    Screenshot of Datashare's project page on the Insights tab with statistics highlighted
    Screenshot of Datashare's project page on the Graph tab with statistics highlighted
    explore a project
    Screenshot of Datashare's 'All projects' page with the right panel 'Page settings' open and highlighted
    Screenshot of Datashare's 'All projects' page with LuxLeaks' project's name highlighted
    Screenshot of Datashare's homepage with the menu 'All projects' entry highlighted
    Screenshot of Datashare's 'All projects' page with the search bar 'Search projects' highlighted
    Screenshot of Datashare's 'All projects' page with the top right Settings icon highlighted
    with operators or Regex (Regular Expressions)
    Screenshot of a Datashare's search documents page with the menu open and its top right X icon highlighted
    Screenshot of a Datashare's search documents page with highlighted 'Ikea' typed in the search bar
    Screenshot of a Datashare's search documents page with highlighted 'Ikea paris' typed in the search bar
    Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and a dropdown with linked entities below highligted
    Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'All fields' dropdown button highlighted at the right of the search bar
    Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'All fields' dropdown highlighted at the right of the search bar
    Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'Your search' icon button highlighted at the left of the search bar
    Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'Your search' breadcrumb open and highlighted below the search bar
    Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and the 'Your search' breadcrumb open and its 'Clear query' button highlighted
    Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the top right Settings icon button highlighted
    Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and and the 'Results settings' panel open and highlighted at the right of the page
    Screenshot of a Datashare's search documents page with 'Ikea' typed in the search bar and and the 'Results settings' panel open and its 'Properties' category highlighted at the right of the page
    Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the 'Results' column open and highlighted at the left of the page in a List layout
    Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the first document card highlighted at the left of the page in a Grid layout
    Screenshot of a Datashare's search documents page with 'Ikea paris' typed in the search bar and the first document card highlighted at the top of the results in a Table layout
    Screenshot of a Datashare's search documents page with the 'Documents' entry in the 'Search' category in the menu highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Keyboard' icon at the bottom of the left menu is highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Keyboard' icon at the bottom of the left menu is hovered and the 'Keyboard shortcuts' popover is highlighted
    Screenshot of Datashare's keyboard shortcuts page
    Screenshot of an 'authentication required' window with username and password fields and 'Cancel' and 'OK' buttons

    General

    👷‍♀️ This page is currently being written by Datashare team.

    FAQ

    👷‍♀️ This page is currently being written by Datashare team.

    Do you recommend OS or machines for large corpuses?

    Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.

    To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).

    The most complex operation is OCR (we use Apache Tesseractarrow-up-right) so if your documents don't contain many images, it might be worth deactivating it (--ocr false).

    Can I use Datashare with no internet connection?

    You need an internet connection to install Datashare.

    You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.

    You don't need internet connection to:

    • Add documents to Datashare

    • Find named entities (except for the first time you use an NLP options - see above)

    • Search and explore documents

    • Download documents

    This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.

    Screenshot of an 'authentication required' window with username and password fields and 'Cancel' and 'OK' buttons
    repositoryarrow-up-right
    A diagram of a workflow
    Screenshot of an 'authentication required' window with username and password fields and 'Cancel' and 'OK' buttons

    Can I use an external drive as data source?

    Warning: this requires some technological knowledge.

    You can make Datashare follow soft links : add --followSymlinks when Datashare is launched.

    If you're on Mac or Windows, you must change the launch script.

    If you're on Linux, you can add the option after the Datashare command.

    explore your graph
    Screenshot of Datashare's All projects page with the name of one project highlighted
    Open your project
    Screenshot of Datashare's project page on the Insights tab with statistics highlighted
    Statistics of one project
    Screenshot of Datashare's project page on the Graph tab with statistics highlighted
    docker compose exec \
      datashare_web /entrypoint.sh \
      --mode CLI \
      --ext neo4j \
      --full-import \
      --project secret-project
    docker compose exec \
      datashare_web /entrypoint.sh \
      --mode CLI \
      --ext neo4j \
      --full-import \
      --project secret-project
    hashtag
    OR (or space)

    To have all documents mentioning at least one of the queried terms, you can use a simple space between your queries (as OR is the default operator in Datashare) or OR . You need to write OR with all letters uppercase.

    Alicia Martinez

    Alicia OR Martinez

    hashtag
    AND (or +)

    To have all documents mentioning all the queried terms, you can use AND between your queried words. You need to write AND with all letters uppercase.

    Alicia AND Martinez

    +Alicia +Martinez

    hashtag
    NOT (or ! or -)

    To have all documents NOT mentioning some queried terms, you can use NOT before each word you don't want. You need to write NOT with all letters uppercase.

    NOT Martinez

    !Martinez

    -Martinez

    hashtag
    Combine operators

    Parentheses should be used whenever multiple operators are used together and you want to give priority to some.

    ((Alicia AND Martinez) OR (Delaware AND Pekin) OR Grey) AND NOT "parking lot"

    You can also combine these with regular expressions (regex) between two slashes (see below).

    hashtag
    Wildcards

    If you search faithf?l, the search engine will look for all words with all possible single character between the second f and the l in this word. It also works with * to replace multiple characters.

    Alicia Martin?z

    Alicia Mar*z

    hashtag
    Fuzziness

    You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

    kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

    kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

    If you search for similar terms (to catch typos for example), you can use fuzziness. Use the tilde symbolarrow-up-right at the end of the word to set the fuzziness to 1 or 2.

    "The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elasticarrow-up-right).

    quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

    Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

    hashtag
    Proximity searches

    When you type an exact phrase (in double quotes) and use fuzziness, then the meaning of the fuzziness changes. Now, the fuzziness means the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

    Examples:

    "the cat is blue" -> "the small cat is blue" (1 insertion = fuzziness is 1)

    "the cat is blue" -> "the small is cat blue" (1 insertion + 2 transpositions = fuzziness is 3)

    "While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elasticarrow-up-right).

    "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

    The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase quick fox would be considered more relevant than quick brown fox(source: Elasticarrow-up-right).

    hashtag
    Boosting operators

    Use the boost operator ^ to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes:

    quick^2 fox

    The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups:

    "john smith"^2 (foo bar)^4

    (source: Elasticarrow-up-right)

    hashtag
    Regular expressions (Regex)

    ‌"A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern." (Wikipediaarrow-up-right).

    1. You can use Regex in Datashare. Regular expressions (Regex) in Datashare need to be written between 2 slashes and starting with the field (content, name, author, recipients, etc):

    content: /.*..*@.*..*/

    The example above will search in the content of the document for any expression which is structured like an email address with a dot between two expressions before the @ and a dot between two expressions after the @ like in 'first.lastname@email.com' for instance.

    2. Regex can be combined with standard queries in Datashare :

    ("Ada Lovelace" OR "Ado Lavelace") AND paris AND content:/.*..*@.*..*/

    3. You need to escape the following characters by typing a backslash just before them (without space):‌ . ? + * | { } [ ] ( ) " \ # @ & < > ~

    /.*..*\@.*..*/ (the @ was escaped by a backslash \ just before it)

    4. Important: Datashare relies on Elastic's Regex syntax as explained herearrow-up-right. Datashare uses the Standard tokenizerarrow-up-right. A consequence of this is that spaces cannot be searched as such in Regex.

    We encourage you to use the AND operator to work around this limitation and make sure you can make your search.

    If you're looking for French International Bank Account Number (IBAN) that can or cannot contain space and contain FR followed by numbers and/or letters (it could be FR7630001007941234567890185 ou FR76 3000 4000 0312 3456 7890 H43 for example), you can then search for:

    /FR[0-9]{14}[0-9a-zA-Z]{11}/ OR (/FR[0-9]{2}.*/ AND /[0-9]{4}.*/ AND /[0-9a-zA-Z]{11}.*/)

    Here are a few examples of useful Regex:

    • You can search for /Dimitr[iyu]/ instead of searching for Dimitri OR Dimitry OR Dimitru. It will find all the Dimitri, Dimitry or Dimitru.

    • You can search for /Dimitr[^yu]/ if you want to search all the words which begin with Dimitr and do not end with either y nor u.

    • You can search for /Dimitri<1-5>/ if you want to search Dimitri1, Dimitri2, Dimitri3, Dimitri4 or Dimitri5.

    Other common Regex examples:

    • phone numbers with "-" and/or country code like +919367788755, 8989829304, +16308520397 or 786-307-3615 for instance: /[\+]?[(]?[0-9]{3}[)]?[-\s.]?[0-9]{3}[-\s.]?[0-9]{4,6}/

    • emails (simplifiedarrow-up-right): /[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+/

    • credit cards: /(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35[0-9]{3})[0-9]{11})/

    You can find many other examples on this sitearrow-up-right. More generally, if you use a regex found on internet, beware that the syntax is not necessarily compatible with elasticsearch's. For example \d, \S and the like are not understoodarrow-up-right.

    hashtag
    Search with metadata fields

    1

    In 'Search' > 'Documents', open a document and go to the 'Metadata' tab:

    2

    Click a metadata's search icon to search documents with same properties:

    3

    See the query in the main search bar. It contains the field name, two dots and the searched value:

    So for example, if you are looking for documents that:

    • Contains term1, term2 and term3

    • And were created after 2010

    you can use the 'Date' filter or type in the search bar:

    Get started with Neo4jarrow-up-right
  • Find out what is a graph database?arrow-up-right

  • Learn Neo4j fundamentalsarrow-up-right

  • Check out how to use Neo4j for investigative journalismarrow-up-right

  • hashtag
    The documents and entities graph

    Neo4jarrow-up-right is a graph database technology which lets you represent your data as a graph.

    Inside Datashare, Neo4j lets you connect entities between them through documents in which they appear.

    After creating a graph from your Datashare project, you will be able to explore this graph and visualize these kinds of relationships between you project entities:

    In the above graph, we can see 3 e-mail document nodes in orange, 3 e-mail address nodes in red, 1 person node in green and 1 location node in yellow. Reading the relationship types on the arrows, we can deduce the following information from the graph:

    • shapp@caiso.com emailed 20participants@caiso.com, the sent email has an ID starting with f4db344...

    • One person named vincent is mentioned inside this email, as well as the california location

    • Finally, the e-mail also mentions the dle@caiso.com e-mail address which is also mentioned in 2 other e-mail documents (with ID starting with 11df197... and 033b4a2...)

    hashtag
    Graph nodes

    The Neo4j graph is composed of :Document nodes representing Datashare documents and :NamedEntity nodes representing entities mentioned in these documents.

    The :NamedEntity nodes are additionally annotated with their entity types: :NamedEntity:PERSON, :NamedEntity:ORGANIZATION, :NamedEntity:LOCATION, :NamedEntity:EMAIL...

    hashtag
    Graph relationships

    In most cases, an entity :APPEARS_IN a document, which means that it was detected in the document content. In the particular case of e-mail documents and EMAIL addresses, it is most of the time possible to identify richer relationships from the e-mail metadata, such as who sent (:SENT relationship) and who received (:RECEIVED relationship) the e-mail.

    When an :EMAIL address entity is neither :SENT or :RECEIVED, like it is the case in the above graph for dle@caiso.com, it means that the address was mentioned in the e-mail document body.

    When a document is embedded inside another document (as an e-mail attachment for instance), the child document is connected to its parent through the :HAS_PARENT relationship.

    hashtag
    Create your Datashare project's graph

    The creation of a Neo4j graph inside Datashare is supported through a plugin. To use the plugin to create a graph, follow these instructions:

    • When using Datashare on your computer

    • When Datashare is running on your server

    After the graph is created, open the menu, go to the 'Projects' page, select your project and go to the Graph tab.

    You should be able to visualize a new Neo4j widget displaying the number of documents and entities found inside the graph:

    hashtag
    Access your project's graph

    Depending on your access to the Neo4j database behind Datashare, you might need to export the Neo4j graph and import it locally to access it from visualization tools.

    Exporting and importing the graph into your own database is also useful when you want to perform write operations on your graph without any consequences on Datashare.

    hashtag
    With read access to Datashare's Neo4j database

    If you have read access to the Neo4j database (it should be the case if you are running Datashare on your computer), you will be able to plug visualization tools to it and start exploring.

    hashtag
    Without read access to Datashare's Neo4j database

    If you can't have read access to the database, you will need to export it and import it into your own Neo4j instance (running on your laptop for instance).

    hashtag
    Ask for a DB dump

    If it's possible, ask you system administrator for a DB dump obtained using the neo4j-admin database dump commandarrow-up-right.

    hashtag
    Export your graph from Datashare

    In case you don't have access to the DB and can't be provided with a dump, you can export the graph from inside. Be aware that limits might be applied on the size of the exported graph.

    To export the graph, open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Cypher shell' export format and at the end of the form, click the 'Export' button:

    In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.

    DB import

    Depending on how you run Neo4j on your laptop, use one of the following ways to import your graph into your DB:

    Docker

    • Identify your Neo4j instance container ID:

    • Copy your the graph dump inside your Neo4j container import directory:

    • Import the dumped file using the cypher-shellarrow-up-right command:

    Neo4j Desktop import

    • Open 'Cypher shell':

    desktop-shell
    • Copy your the graph dump inside your neo4j instance import directory:

    • Import the dumped file using the cypher-shellarrow-up-right command:

    You will now be able to explore the graph imported in your own Neo4j instance.

    hashtag
    Explore and visualize entity links

    Once your graph is created and you can access it (see this section if you can't access the Datashare's Neo4j instance), you will be able to use your favorite tool to extract meaningful information from it.

    hashtag
    Connect to your database

    Once you access your Neo4j database, you can use different tools to visualize and explore it. You can start by connecting the Neo4j Desktoparrow-up-right to your DB.

    hashtag
    Visualize and explore with Neo4j Bloom

    Neo4j Bloomarrow-up-right is a simple and powerful tool developed by Neo4j to quickly visualize and query graphs, if you run Neo4j Enterprise Edition. Bloom lets you navigate and explore the graph through a user interface similar to the one below:

    bloom-viz

    Neo4j Bloom is accessible from inside Neo4j Desktop app.

    Find out more information about how to use Neo4j Bloom to explore your graph with:

    • Bloom's User Guidearrow-up-right

    • Bloom's Quick Startarrow-up-right

    • This series of videosarrow-up-right about graph exploration with Bloom

    hashtag
    Query the graph with Neo4j Browser

    The Neo4j Browserarrow-up-right lets you run Cypherarrow-up-right queries on your graph to explore it and retrieve information from it. Cypher is like SQL for graphs, running Cypher queries inside the Neo4j browser lets you explore the results as shown below:

    browser-viz

    The Neo4j Browser is available for both Enterprise and Community distributions. You can access it:

    • Inside the Neo4j Desktop app when running Neo4j from the Desktop app

    • At http://localhost:7474/browser/arrow-up-right when running Neo4j inside Docker

    hashtag
    Visualize and explore with Linkurious Enterprise Explorer

    Linkuriousarrow-up-right is a proprietary software which, similarly to Neo4j Bloom, lets you visualize and query your graph through a powerful UI.

    Find out more information about Linkurious:

    • Linkurious User Manualarrow-up-right

    • configure Linkurious with neo4jarrow-up-right

    • run Linkurious inside Dockerarrow-up-right

    hashtag
    Visualize with Gephi

    Gephiarrow-up-right is a simple open-source visualization software. It is possible to export graphs from Datashare into the GraphML File Formatarrow-up-right and import them into Gephi.

    Find out more information about:

    • How to export your graph in the GraphML format

    • Gephi featuresarrow-up-right

    • How to get startedarrow-up-right with Gephi

    hashtag
    Export your graph in the GraphML format

    To export the graph in the GraphML file formatarrow-up-right, open the menu, click 'Projects' > 'All projects' > select your project > open the Graph tab. At step 2 called 'Format', select the 'Graph ML' export format and at the end of the form, click the 'Export' button:

    In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using, at step 3, the 'Filters' 'Paths' and 'File types'.

    You will now be able to visualize the graph using Gephi by opening the exported GraphML file in it.

    herearrow-up-right
    number of documents by creation date
    .

    Filter this chart by path by clicking 'Select path':

    Click on one bar for a year or month to see all the corresponding documents:

    On the 'Languages', 'File Types' and 'Authors' widgets, you see stats:

    Search all documents with a specific criteria, for instance here with the French language:

    Finally, in the server collaborative mode, you see the Latest recommended documents, that is to say the documents marked as recommended by other members of the project:

    You can now search documents.

    Screenshot of Datashare's 'All projects' page with LuxLeaks' project's name highlighted
    Screenshot of Datashare's 'All projects' page with LuxLeaks' project's 'Pin to menu' top right button highlighted
    Screenshot of Datashare's 'All projects' page with LuxLeaks' project's name pinned in the left menu and highlighted
    hashtag
    Distribute the INDEX Stage

    Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlargearrow-up-right instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.

    For projects like the Pandora Papersarrow-up-right (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.\

    hashtag
    Leverage Parallelism

    Datashare offers --parallelism and --parserParallelism options to enhance processing speed.

    Example (for g4dn.8xlarge with 32 CPUs):

    hashtag
    Optimize ElasticSearch

    ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.

    hashtag
    Adjust JAVA_OPTS

    You can fine-tune the JAVA_OPTS environment variable based on your system's configuration to optimize Java Virtual Machine memory usage. &#xNAN;Example (for g4dn.8xlarge8with 120 GB Memory):

    hashtag
    Specify Document Language

    If the document language is known, explicitly setting it can save processing time.

    • Use --language for general language setting (e.g., FRENCH, ENGLISH).

    • Use --ocrLanguage for OCR tasks to specify the Tesseract model (e.g., fra, eng).

    Example:

    hashtag
    Manage OCR Tasks Wisely

    OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false.

    Example:

    hashtag
    Efficient Handling of Large Files

    Large PST files or archives can hinder processing efficiency. We recommend extracting these files before processing with Datashare. If they are too many of them, keep in mind that Datashare will be able to extract them anyway.

    Example of splitting Outlook PST files in multiple .eml files with readpstarrow-up-right:

    Apache Tikaarrow-up-right
    Tesseract OCRarrow-up-right
    'Extraction levels' regard embedded documents:
    • The 'file on disk' is level zero

    • If a document is attached to (or contained in) a file on disk, its extraction level is '1st'

    • If a document is attached to (or contained in) a document itself contained in a file on disk, its extraction level is '2nd'

    • And so on

    hashtag
    Filter by entities

    If you asked Datashare to 'Find entities' and the task was complete, you will see names of people, organizations, locations and e-mail adresses in the filters. These are the entities automatically detected by Datashare:

    hashtag
    Exclude filters

    Tick the 'Exclude' checkbox to select all items except those selected.

    In the search breadcrumb, you see that the excluded filters are strikethrough:

    hashtag
    Contextualize filters

    In most filters, tick 'Contextualize' to update the number of documents indicated in the filters so they reflect the results.

    The filter will only count what you selected, it will reflect the results of your current selection:

    hashtag
    Clear all filters

    To reset all filters at the same time, open the search breadcrumb:

    Click 'Clear filters':

    Screenshot of Datashare's page to search documents with the 'Filters' button at the left of the search bar highlighted
    Screenshot of Datashare's page to search documents with the 'Filters' panel open and highlighted on the left of the page and on the right of the menud
    hashtag
    Star a single document

    Click the star icon either at the right of the document's card or at the top right of the document:

    Click on the same icons to unstar.

    hashtag
    Star multiple documents

    Open the selection mode by clicking the multiple cards icon on the left of the pagination:

    Select the documents you want to star:

    Click the star filled icon:

    To unstar documents, click the three-dot icon if necessary and click Unstar:

    hashtag
    Filter starred documents

    Open the filters by clicking the 'Filters' button on the left of the search bar:

    In the 'User data' category, open 'Starred' and tick the 'Starred' checkbox:

    hashtag
    Tag documents

    circle-info

    Tags are always in lower case letters. They can contain numbers, hyphens and special characters but not commas nor semicolons (which are the keyboard shortcuts to add the tags).

    circle-info

    In server collaborative mode, tags are public to the project's other members. You can see their tag and they can see yours.

    hashtag
    Tag a single document

    Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the hashtag icon:

    It opens the Tags panel on the left:

    Type your tag and press Enter or click 'Add':

    Your tag is now displayed in the 'Added by you' category:

    Remove your tag, or others' tags, by clicking their cross icon:

    hashtag
    Tag multiple documents

    Open the selection mode by clicking the multiple cards icon on the left of the pagination:

    Select the documents you want to tag:

    Click the three-dot icon if necessary and click 'Tag':

    Type your tag or type multiple tags by separating them with comma and click 'Add':

    Remove your tag, or others' tags, by clicking their cross icon on each single document (you cannot untag multiple documents):

    hashtag
    Filter tagged documents

    Open the filters by clicking the 'Filters' button on the left of the search bar:

    In the 'User data' category, open 'Tags' and tick the 'Tag' checkboxes for tagged documents you want to filter:

    hashtag
    Recommend a document

    circle-info

    In server collaborative mode, recommending documents is public to the project's other members. All members can see who recommended some documents.

    Open a document in 'Search' > 'Documents' > open on a document and above the document's name, click on the eyes icon:

    It opens the Recommendations panel on the left:

    Click on the 'Mark as recommended' button:

    The document is now marked as recommended by you:

    Click 'Unmark as recommended' to unmarked it as recommended.

    hashtag
    Filter recommended documents

    Open the filters by clicking the 'Filters' button on the left of the search bar:

    In the 'User data' category, open 'Recommended by' and tick the 'Username' checkboxes for documents recommended by the users you want to filter:

    Explore a document

    Explore the document's data through different tabs.

    hashtag
    See a document in full-screen view

    Open a document in 'Search' > 'Documents' > one document and click the icon with in and out arrows (this applies to the List layout while in Grid and Table layout, the documents always open in full-screen view):

    You now see the document in full screen view and can go to the next document in your results by using the pagination carousel on the top of the screen:

    hashtag
    Search in a document

    • Open a document in 'Search' > 'Documents' > one document

    • Stay on the first tab called 'Text'. This tab shows the text as extracted from your document by Datashare.

    • Click on the search bar or press Command (⌘) / Control + F

    circle-info

    To see all the keyboard shortcuts in Datashare, please read ''.

    hashtag
    See original document

    Go to the 'View' tab to see the original document.

    Note: this visualization of the document is available only for some file types: images, PDF, CSV, xlsx and tiff but not other file types like Word documents or e-mails for instance.

    hashtag

    hashtag
    Search for attachments and documents in the same folder

    circle-info

    Attachments are called 'children documents' in Datashare.

    Go to the 'Metadata' tab and click on 'X documents in the same folder' or 'Y children documents':

    You see the list of documents. To open all the documents in the same folder or all the children documents, click 'Search all' below. There is no 'Search all' button if there is no documents, as for the children documents below:

    hashtag

    hashtag
    Explore metadata

    Go the 'Metadata' tab to explore all the properties of the document:

    If a metadata is interesting to you and you'd like to know if other documents in your project share the same metadata, click the search icon:

    You can also copy or pin a metadata.

    hashtag

    hashtag
    Entities

    In the 'Entities' tab, only if you previously run tasks to in Datashare, you read the name of people, organizations, locations and e-mail adresses, along with the number of their occurrences in the document:

    Hover one entity to see a popover with all their mentions in context in the document by clicking on the arrows:

    Go to the 'Info' tab to check how the entity was extracted:

    Batch search documents

    Batch searches allow to get the results of each query of a list all at once: instead of searching each query one by one, upload a list, set options/filters and see the matching documents.

    1

    hashtag
    Prepare a CSV list

    Open a spreadsheet (LibreOffice, Framacalc, Excel, Google Sheets, Numbers, ...)

    Write your queries in the first column of the spreadsheet, typing one query per line:

    • Do not put line break(s) in any of your cells.

    To delete all line breaks in your spreadsheet, use 'Find and replace all': find all '\n' and replace them by nothing or a space.

    • Write 2 characters minimum in each query. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to a 'failure'.

    • If you have blank cells in your spreadsheet...

    ...the CSV, which stand for 'Comma-separated values', will translate these blank cells into semicolons (the 'commas'). You will thus see semicolons in your batch search results:

    To avoid that, remove blank cells in your spreadsheet before exporting it as a CSV.

    • If there is a comma in one of your cells (like in 'Jane, Austen' below), the CSV will put the content of the cell in double quotes so it will search for the exact phrase in the documents:

    Remove all commas in your spreadsheet if you want to avoid exact phrase search.

    • Want to search only in some documents? Use the 'Filters' step in the batch search's form (see below). Or describe fields directly in your queries in the CSV. For instance, if you want to search only in some documents with certain tags, write your queries like this:

      Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)

    • Use operators in your CSV: AND NOT * ? ! + - and other operators do work in batch searches as they do in the regular search bar but only if "Do phrase match" at step 3 is turned off. You can thus turn it off and write your queries like this for instance:

    2

    hashtag
    Export the list as a CSV

    Export your spreadsheet of queries in a CSV format:

    Important: Use the in your spreadsheet software's settings.

    3

    hashtag
    Create the batch search

    Open the menu, go to 'Tasks', open 'Batch searches' and click the 'Plus' button at the top right:

    Alternatively, in the menu next to 'Batch searches', click the 'Plus' button :

    4

    hashtag
    Explore your results

    In the menu, click 'Batch searches' and click the name of the batch search to open it:

    See the number of matching documents per query:

    5

    hashtag
    Relaunch a batch search (optional)

    If you've added new files in Datashare after you launched a batch search, you might want to relaunch the batch search to search in the new documents too.

    The relaunched batch search will apply to newly indexed documents and previously indexed documents (not only the newly indexed ones).

    6

    hashtag
    Failures

    Failures in batch searches can be due to several causes.

    circle-info

    How can I contact ICIJ for help, bug reporting or suggestions?

    You can send an email to datashare@icij.org.

    When reporting a bug, please share:

    • Your OS (Mac, Windows or Linux) and version

    • The problem, with screenshots

    How can we use Datashare on a collaborative mode on a server?

    You can use Datashare with multiple users accessing a centralized database on a server.

    Warning: to put the server mode in place and to maintain it requires some technical knowledge.

    Please find the .

    Can I remove document(s) from Datashare?

    In local mode, you cannot remove a single document or a selection of documents from Datashare. But you can remove all your projects and documents from Datashare.

    Open the menu and on the bottom of the menu, click the trash icon:

    A confirmation window opens. The action cannot be undone. It removes all the projects and their documents from Datashare. Click 'Yes' if you are sure:

    For advanced users - if you'd like to do it with the Terminal, here are the instructions:

    Can I download a document from Datashare?

    Yes, you can download a document from Datashare.

    hashtag
    Download a document

    Open the menu > 'Search' > 'Documents' and click on the download icon on the right of documents' cards:

    ...or on the top right of an opened document:

    What should I do if I get more than 10,000 results?

    In Datashare, for technical reasons, it is not possible to open the 10,000th document.

    Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.

    As it is not possible to fix this, here are some tips:

    • : use filters to narrow down your results and ensure you have less than 10,000 matching documents

    docker ps | grep neo4j # Should display your running neo4j container ID
    docker cp \
        <export-path> \
        <neo4j-container-id>:/var/lib/neo4j/imports/datashare-graph.dump
    docker exec -it <neo4j-container-id> /bin/bash
    ./bin/cypher-shell -f imports/datashare-graph.dump 
    cp <export-path> imports
    ./bin/cypher-shell -f imports/datashare-graph.dump 
    datashare --mode CLI --stage SCAN --redisAddress redis://redis:6379 --busType REDIS
    datashare --mode CLI --stage INDEX --redisAddress redis://redis:6379 --busType REDIS
    datashare --mode CLI --stage INDEX --parallelism 14 --parserParallelism 14
    datashare --mode CLI --stage NLP --parallelism 14 --nlpParallelism 14
    JAVA_OPTS="-Xms10g -Xmx50g" datashare --mode CLI --stage INDEX
    datashare --mode CLI --stage INDEX --language FRENCH --ocrLanguage fra
    datashare --mode CLI --stage INDEX --language CHINESE --ocrLanguage chi_sim
    datashare --mode CLI --stage INDEX --language GREEK --ocrLanguage ell
    datashare --mode CLI --stage INDEX --ocr false
    readpst -reD <Filename>.pst

    The actions that led to the problem

    Or you can post an issue with your logs on Datashare's GitHub: https://github.com/ICIJ/datashare/issuesarrow-up-right

    documentation here

    Change the sorting of your results: use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring

  • Search your query in a batch search: you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents

  • Refine your search
    • If you're using Mac: rm -Rf ~/Library/Datashare/index

    • If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index

    • If you're using Linux: rm -Rf ~/.local/share/datashare/index

    Screenshot of Datashare's homepage with the menu and the trash icon at the bottom right of the menu highlighted
    Screenshot of Datashare's homepage with a confirmation modal to delete all projects and documents highlighted

    term1 AND term2 AND term3 AND metadata.tika_metadata_creation_date:>=2010-01-01

    Explanations:

    • 'metadata.tika_metadata_creation_date:' means that we filter with creation date

    • '>="'means 'since January 1st included'

    • '2010-01-01' stands for January 2010 and the search will include January 2010

    For other searches:

    • '<' will mean 'strictly before (with January 1st excluded)'

    • no character will mean 'at this exact date'

    Ranges: You can also search for numbers in a range. Ranges can be specified for date, numeric or string fields among the ones you can find by clicking the magnifying glass when you hover the fields in a document's tab 'Metadata'. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}. For more details, please refer to Elastic's page on rangesarrow-up-right.

    Screenshot of Datashare's document search page with the search bar displaying 'contentTextLength:"26952"' highlighted
    Screenshot of Datashare's document page with the 'Metadata' tab highlighted
    Screenshot of Datashare's document page in the 'Metadata' tab at scroll level of 'Content text length' with the magnifying glass icon hovered with the tooltip 'Search this metadata value' highlighted
    Screenshot of a graph showing circles in different colors with arrows between them
    Screenshot of Datashare's project page on the 'Graph' tab with the 'Graph statistics' highlighted
    Screenshot of Datashare's project page on the 'Graph' tab with the form to export a graph open and its second step called 'Format' highlighted
    Screenshot of a window with the title 'Graph DBMS' with the three dot dropdown open and the entry 'Terminal' highlighted
    Screenshot of a window showing a graph with many points grouped in 1 big and 1 small circles
    Screeenshot of a Neo4j Browser with blue and orange circle with arrows between some of them
    Screenshot of Datashare's project page on the 'Graph' tab with the form to export a graph open at its second step called 'Format' and the 'GraphML' radiobutton selected and highlighted
    Screenshot of a Datashare's project page on the Insights tab at the level of the 'Documents per creation date' bar chart with 'Select path' button highlighted
    Screenshot of a Datashare's project page on the Insights tab at the level of the 'Documents per creation date' bar chart with one-year bar highlighted
    Screenshot of a Datashare's project page on the Insights tab with the Languages, File types and Authors' widgets highlighted
    Screenshot of a Datashare's project page on the Insights tab at the level of the Languages, File types and Authors' widgets with the French documents' number '11" highlighted
    Screenshot of a Datashare's project page on the Insights tab with the 'Latest recommended documents' highlighted
    Screenshot of a bar chart showing the size in terabytes of the Panama Papers (2016) (2.6TB), the Paradise Papers (2017) (1.4TB) and the Pandora Papers (2021) (2.94TB)
    Screenshot of the Filters' entities
    Screenshot of Datashare's page to search documents with the 'People' filter open with 2 names ticked and the Exclude button ticked and highlighted as well as the two names in the search breadcrumb that are also strikethrough
    Screenshot of Datashare's page to search documents with a filter open and the Contextualize button at the bottom of this filter highlighted
    Screenshot of Datashare's page to search documents with the 'Language' filter open, the 'Contextualize' checkbox ticked and the whole filter highlighted
    Screenshot of Datashare's page to search documents with the 'Your search' button on the left of the search bar highlighted
    Screenshot of Datashare's page to search documents with search breadcrumb open and the 'Clear filter' button highlighted
    Screenshot of Datashare's search documents page in List layout with a document open and its star icon on the top right is highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode button on the left of the results' paginationis highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked and their checkboxes are highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked the star filled icon is highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode is open and 2 documents are ticked, the 'Unstar' entry in the three-dot dropdown is highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked, the 'Unstar' button is highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Filter' button on the left of the search bar is highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Filters' are open on the left and the 'Starred' filter is open and highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open and its 'Hashtag' (tag) button above its title is highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open and the 'Tags' floating panel on the left of the document is highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the field to add filters is highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the category 'Added by you' is highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the Cross icons in 2 tags label are highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode button on the left of the results' paginationis highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked and their checkboxes are highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode is open and 2 documents are ticked, the 'Tag' entry in the three-dot dropdown is highlighted
    Screenshot of Datashare's search documents page in List layout where the selection mode is open and 3 documents are ticked, the 'Tag' button is highlighted
    Screenshot of Datashare's search documents page in List layout where a modal to add Tags is open
    Screenshot of Datashare's search documents page in List layout where a document is open, the 'Tags' floating panel on the left of the document is open and the Cross icons in 2 tags label are highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Filter' button on the left of the search bar is highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Filters' are open on the left and the 'Tags' filter is open and highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open and its 'Eyes' (recommend) button above its title is highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open and the 'Recommendations' floating panel on the left of the document is highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open and the 'Recommendations' floating panel on the left of the document is open and the button 'Mark as recommended' is highlighted
    Screenshot of Datashare's search documents page in List layout where a document is open and the 'Recommendations' floating panel on the left of the document is open and where the username (you) is highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Filter' button on the left of the search bar is highlighted
    Screenshot of Datashare's search documents page in List layout where the 'Filters' are open on the left and the 'Recommended by' filter is open and highlighted

    Type the terms you're searching for

  • Press ENTER to go from one occurrence to the next one

  • Presse SHIFT + ENTER to go from one occurrence to the previous one

  • Use keyboard shortcuts
    Find entities
    Screenshot of Datashare's document full screen view with the pagination carousel on the top highlighted
    Screenshot of Datashare's page to search documents in List view with a document open on the 'Text' tab and the search bar to search within the document highlighted
    Screenshot of Datashare's page to search documents in List view with a document open on the 'View' tab which is highlighted
    Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and the dropdowns 'X documents in the same folder' and 'Y children documents' highlighted
    Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and the dropdowns 'X documents in the same folder' and 'Y children documents' and the 'Search all' button highlighted
    Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and list of metadata highlighted
    Screenshot of Datashare's page to search documents in List view with a document open on the 'Metadata' tab and the search buttons for one metadata with its tooltip 'Search this metadata value'highlighted
    Screenshot of Datashare's full-screen document view on the 'Entities' tab
    Screenshot of Datashare's full-screen document view on the 'Entities' tab with one entity and its popover on the 'Mentions' tab highlighted
    Screenshot of Datashare's full-screen document view on the 'Entities' tab with one entity and its popover on the 'Info' tab highlighted
    Screenshot of Datashare's search in document page in List view with a document opened and its in out icon on the top right highlighted

    Definitions

    👷‍♀️ This page is currently being written by Datashare team.

    What is an entity?

    An entity in Datashare is the name of people, organizations or locations or an email address.

    Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically detect entities in your documents.

    You can filter documents by their entities and see all the entities mentioned in a document.

    What if the 'View' of my documents is 'not available'?

    Datashare can display 'View' for some file types only: images, PDF, CSV, xlsx and tiff. Other document types are not supported yet.

    Paris NOT Barcelona AND Taipei
  • Reserved characters (^ " ? ( [ *), when misused, can lead to failures because of syntax errors.

  • Searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.

  • LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':

    • Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, as explained herearrow-up-right.

    • Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".

    The form to create a batch search opens:

    • Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query. In that case, it won't apply any operators (AND OR, etc) that would be in the queries. If 'Do phrase match' is off, queries are searched without double quotes and with potential operators.

    • What is fuzziness? When you run a batch search, you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

    kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

    kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

    If you search for similar terms (to catch typos for example), use fuzziness.

    "The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elasticarrow-up-right).

    Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

    Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

    • What are proximity searches? When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

    “the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)

    “the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)

    Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

    Once you filled all steps, click 'Create' and wait for the batch search to complete.

    Sort the queries by number of matching documents or by query position using the page settings (icon at the top right of the screen). The query position will put the query in their original order as you put them in the CSV.

    To explore a query's matching documents, click its name and see the list of matching documents:

    Click a document's name to open it. Use the page settings or the column's names to sort documents.

    In 'Batch searches', go at the end of the table and click the 'Relaunch' icon:

    Or click 'Relaunch' in the batch search page below its name on the right panel:

    Change its name, description and decide to delete current batch search after relaunch or not:

    See your relaunched batch search in the list of batch searches:

    The first query containing an error makes the batch search fail and stop.

    Go to 'Tasks' > 'Batch searches' > open the batch search with a failure status and click the 'Red cross icon' button on the right panel:

    Check the first failure-generating query in the error window:

    Here it says:

    The first line contained a comma while it shouldn't. Datashare interpreted this query as a syntax error, it thus failed so the batch search stopped.

    Check the most common syntax errors.

    We recommend to remove the commas, as well as any reserved characters, in your CSV using 'Find and replace all' features in your spreadsheet software and re-create the batch search.

    hashtag
    'elasticsearch: Name does not resolve'

    If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.

    In that case, you need to re-open Datashare: check how for Mac, Windows or Linux.

    Example of a message regarding a problem with ElasticSearch:

    SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'

    hashtag
    'Data too large'

    One of your queries can lead to a 'Data too large' error.

    It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.

    We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.

    UTF-8 encodingarrow-up-right
    Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell
    One query per line in a spreadsheet
    Screenshot of a spreadsheet cell filled with a text containing a line break and a red cross indicates it is wrong
    This will lead to a "failure"
    Screenshot of a spreadsheet cell filled with a text not containing a line break and a green check indicates it is right
    This will lead to a "success"
    Screenshot of a spreadsheet software's 'Find and replace' window with the 'Replace all' button highlighted
    Use this functionality to delete all line break(s)
    Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell and other columns from B to H empty and highlighted
    Blank columns in a spreadsheet
    Screenshot of Datashare's batch search page where each query with the female personality's surname is followed by several semicolons which are highlighted
    Remove blank cells in your spreadsheet in order to avoid this.
    Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell and the second cell contains 'Jane, Austen' and is highlighted
    Screenshot of Datashare's batch search page where two queries are highlighted: one is 'Jane, Austen' and has 0 documents as results and the second one is 'Jane Austen' and has 2 documents as results is 'Jane Austen'
    Screenshot of a window of 'Numbers' software where the menu's path File > Export to > CSV is selected
    Screenshot of Datashare's batch searches page where the 'Plus' button on the top right is highlighted
    Screenshot of Datashare's batch searches page where the first batch search's name is highlighted

    hashtag
    Batch download documents

    You can also batch download all the documents that match a search. It is limited to 100.00MB.

    Open the menu > 'Search' > 'Documents', make queries and apply filter. Once all the results of a specific search are relevant to you, click on the download icon on the right of results:

    Find your batch downloads as zip files in the menu > 'Tasks' > 'Batch downloads':

    Click on a batch download's name to download it:

    hashtag
    Can't download?

    If you can't download a document, it means that:

    • either Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on Datashare's Githubarrow-up-right.

    • or you are using the server collaborative mode and the admins prevented users from downloading documents

    Screenshot of Datashare's search page in List view with the download icons in 3 document cards highlighted

    How to run Neo4j?

    This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)

    hashtag
    Run Neo4j inside docker

    1. enrich the services section of the docker-compose.yml of the install with Docker page, with the following neo4j service:

    make sure not to forget the APOC pluginarrow-up-right (NEO4J_PLUGINS: '["apoc"]').

    2. enrich the volumes section of the docker-compose.yml of the page, with the following neo4j volumes:

    3. Start the neo4j service using:

    hashtag
    Run Neo4j Desktop

    1. install with , follow installation instructions found

    2. and save your password for later

    3. if the installer notifies you of any ports modification, check the and save the server.bolt.listen_address for later

    hashtag
    Additional options

    Additional options to install neo4j are .

    Why results from a simple search and a batch search can be slightly different?

    If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.

    Why?

    For technical reasons, Datashare processes both queries in 2 different ways:

    a. Search bar (a simple search processed in the browser):

    The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.

    b. Batch search (several searches processed by the server):

    1. Datashare's server processes each of the batch search's queries

    2. Each query is sent to Elasticsearch. The results are saved into a database

    3. When the batch search is finished, you get the results from Datashare

    Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.

    Advanced: how can I do bulk actions with Tarentula?

    Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:

    • Clean Tags by Queryarrow-up-right

    • Downloadarrow-up-right

    Please find all the use cases in Datashare Tarentula's .

    How can I uninstall Datashare?

    hashtag
    Mac

    1. Go to Applications

    2. Click right on 'Datashare' and click 'Move to Bin'

    hashtag
    Windows

    Follow the steps here:

    hashtag
    Linux

    Use the following command:

    sudo apt remove datashare-dist

    What are proximity searches?

    hashtag
    As a search operator

    In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

    Examples:

    the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)

    the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)

    "While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: ).

    Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

    The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than quick brown fox(source: ).

    hashtag
    In batch searches

    When you run a , if you turn 'Do phrase matches' on, you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

    the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)

    the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)

    Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

    What are NLP pipelines?

    Pipelines of Natural Language Processing are tools that automatically identify entities in your documents. You can only choose one model at a time for one entity detection task.

    Open the menu > 'Tasks' > 'Entities' and . Select 'CoreNLP' if you want to use the model with the highest probability of working in most of documents.

    What is fuzziness?

    hashtag
    As a search operator

    In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

    kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

    List of common errors leading to "failure" in Batch Searches

    hashtag
    SearchException: query='AND ada'

    One or several of your queries contains syntax errors.

    It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators: .

    You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors.

    'We were unable to perform your search.' What should I do?

    This can be due to some syntax errors in the way you wrote your query.‌

    Here are the most common errors that you should correct: ‌

    hashtag
    The query starts with AND

    You cannot start a query with AND all uppercase. .

    Unexpected char 106 at (line no=1, column no=81, offset=80)
    ...
    services:
        neo4j:
          image: neo4j:5-community
          environment:
            NEO4J_AUTH: none
            NEO4J_PLUGINS: '["apoc"]'
          ports:
            - 7474:7474
            - 7687:7687
          volumes:
            - neo4j_conf:/var/lib/neo4j/conf
            - neo4j_data:/var/lib/neo4j/data

    kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

    If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbolarrow-up-right at the end of the word to set the fuzziness to 1 or 2.

    "The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elasticarrow-up-right).

    Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

    Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

    hashtag
    In batch searches

    When you run a batch search, you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

    kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

    kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

    If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbolarrow-up-right at the end of the word to set the fuzziness to 1 or 2.

    "The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elasticarrow-up-right).

    Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

    Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

    Export by Queryarrow-up-right
    Taggingarrow-up-right
    CSV formatsarrow-up-right
    Tagging by Queryarrow-up-right
    GitHub documentationarrow-up-right
    Elasticarrow-up-right
    Elasticarrow-up-right
    batch search
    follow these instructions
    .

    Datashare stops at the first syntax error. It reports only the first ​error. You might need to check all your quferies as some errors can remain after correcting the first one.

    Example of a syntax error message:

    SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'

    hashtag
    elasticsearch: Name does not resolve

    If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.

    In that case, you need to re-start Datashare: check how for Mac, Windows or Linux.

    Example of a message regarding a problem with ElasticSearch:

    SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'

    read the list of syntax errors by clicking here
    Check how to create a batch search
    Datashare sends back the results stored into the database/
    A diagram with the title 'Query from navigator'
    Screenshot of a window of LibreOffice software where the Export options with 'Character set: Unicode (UTF-8)' is highlighted
    Screenshot of Datashare's batch searches page where the 'Plus' button in the menu next to the entry 'Tasks > Batch searches' is highlighted
    Screenshot of Datashare's page with a form to create a new batch search
    Screenshot of Datashare's page for one batch search where the list of queries and their matching documents are highlighted
    Screenshot of Datashare's page for one batch search's matching documents
    Screenshot of Datashare's batch searches page where the last button with the 'Relaunch' icon is highlighted
    Screenshot of Datashare's page for one batch search where the 'Relaunch' button in the right panel describing the batch search is highlighted
    Screenshot of Datashare's page for one batch search where the 'Relaunch batch search' pop-in window is open
    Screenshot of Datashare's batch searches page where the two first batch searches (one normal, one relaunched) are highlighted
    Screenshot of Datashare's batch search page where the 'Failure' button in the right panel describing the batch search is highlighted
    Screenshot of Datashare's batch search page where a modal window shows 'The error is' with a description of the error 'Unexpected char 106 at (line no=1, column no=81, offset=80)'
    Screenshot of Datashare's search page in List view with a document open and the download icons on the top right of the document highlighted
    Screenshot of Datashare's search page in List view with the download icon on the top right of the result column highlighted
    Screenshot of Datashare's batch downloads page with the menu open and the Tasks' entry 'Batch downloads' highlighted
    Screenshot of Datashare's batch downloads page with the name of one batch download highlighted
    https://support.microsoft.com/en-us/windows/uninstall-or-remove-apps-and-programs-in-windows-10-4b55f974-2cc6-2d2b-d092-5905080eaf98arrow-up-right
    Screenshot of a Mac's 'Applications' window with the Datashare's logo highlighted
    Screenshot of a Mac's Applications window with the Datashare's logo selected and a dropdown menu with the entry 'Move to Bin' highlighted

    Common errors

    👷‍♀️ This page is currently being written by Datashare team.

    I see entities in the filters but not in the documents

    Datashare's filters keep the entities (people, organizations, locations, e-mail addresses) previously found.

    "Old" named entities can remain in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later.

    In the future, removing the documents from Datashare before indexing new ones will remove the entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can follow these instructions.

    make sure to install the APOC Pluginarrow-up-right

    install with Docker
    Neo4j Desktoparrow-up-right
    herearrow-up-right
    create a new local DBMSarrow-up-right
    DBMS settingsarrow-up-right
    listed herearrow-up-right
    volumes:
      ...
      neo4j_data:
        driver: local
      neo4j_conf:
        driver: local
    docker compose up -d neo4j
    hashtag
    The query starts with OR

    You cannot start a query with OR all uppercase. OR is reserved as a search operator.

    hashtag
    The query contains only one double-quote: "

    ‌You cannot start or type a query with only one double quote. Double quotes are reserved as a search operator for exact phrase.

    hashtag
    The query contains only one parenthesis: ( or )

    ‌You cannot start or type a query with only one parenthesis. Parenthesis are reserved for combining operators.

    hashtag
    The query contains only one forward slash: /

    ‌You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).

    hashtag
    The query starts with or contains tilde: ~

    ‌You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for fuzziness or proximity searches.

    hashtag
    The query ends with question mark: !

    You cannot end a query with question mark (!). Question mark is reserved as a search operator for excluding a term.

    hashtag
    The query starts with or contains caret: ^

    ‌You cannot start a query with caret (^) or write one which contains caret. Caret is reserved as a boosting operator.

    hashtag
    The query contains square brackets: [ or ]

    You cannot use square brackets except for searching for ranges.

    AND is reserved as a search operator
    Screenshot of Datashare's search page with 'AND ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'

    What if tasks are 'running' but not completing?

    You started tasks, and they are running as you can see on 'http://localhost:8080/#/indexingarrow-up-right' but they are not completing.

    There are two possible causes:

    • If you see a progress of less than 100%, please wait.

    • If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on with the application logs.

    What do I do if Datashare opens a blank screen in my browser?

    If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:

    1. First wait 30 seconds and reload the page.

    2. If the screen remains blank, restart Datashare following instructions for Macarrow-up-right, Windowsarrow-up-right or Linuxarrow-up-right.

    3. If you still see a blank screen, please uninstall and reinstall Datashare

    To uninstall Datashare:

    On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.

    On Windows, please follow .

    On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.

    To reinstall Datashare, see 'Install Datashare' for , or .

    What if Datashare says 'No documents found'?

    • If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of losing your previously selected filters, open the menu > 'Search' > 'Documents', open the search breadcrumb on the left of the search bar, click 'Clear filters'.

    • You may not have added documents to Datashare yet. Check how to add documents for Mac, Windows or .

    • In 'Tasks' > 'Documents', in the Progress column, if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you added, it can take multiple hours.

    Write extensions

    What if you want to add features to Datashare backend?

    Unlike plugins that are providing a way to modify the Datashare frontend, extensions have been created to extend the backend functionalities. There are two extension points that have been defined :

    • NLP pipelines : you can add a new java NLP pipeline to Datashare

    • HTTP API : you can add HTTP endpoints to Datashare and call the Java API you need in those endpoints

    Since , instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the extensions they need or want, and have a fully customized installation of our search platform.

    hashtag
    Getting started

    When starting, Datashare can receive an extensionsDir option, pointing to your extensions' directory. In this example, let's call it /home/user/extensions:

    hashtag
    Installing and Removing registered extensions

    hashtag
    Listing

    You can list official Datashare extensions like this :

    You can add a to --extensionList. You can filter the extension list if you know what you are looking for.

    hashtag
    Installing

    You can install an extension with its id and providing where the Datashare extensions are stored:

    Then if you launch Datashare with the same extension location, the extension will be loaded.

    hashtag
    Removing

    When you want to stop using an extension, you can either remove by hand the jar inside the extensions folder or remove it with datashare --extensionDelete :

    hashtag
    Create your first extension

    hashtag
    NLP extension

    You can create a "simple" java project like (as simple as a java project can be right), with you preferred build tool.

    You will have to add a dependency to the last version of to be able to implement your NLP pipeline.

    With the datashare API dependency you can then create a class implementing or extending . When Datashare will load the jar, it will look for a Pipeline interface.

    Unfortunately, you'll have also to make a pull request to datashare-api to add a new type of pipeline. We this step in the future.

    Build the jar with its dependencies, and install it in the /home/user/extensions then start datashare with the extensionsDir set to /home/user/extensions. Your plugin will be loaded by datashare.

    Finally, your pipeline will be listed in the available pipelines in the UI, when .

    hashtag
    HTTP extension

    For making a HTTP extension it will be the same as NLP, you'll have to make a java project that will build a jar. The only dependency that you will need is because datashare will look for fluent http annotations @Get, @Post, @Put...

    For example, we can create a small class like :

    Build the jar, copy it to the /home/user/extensions then start datashare:

    et voilà 🔮 ! You can query your new endpoint. Easy, right?

    hashtag
    Installing and Removing your custom Extension

    You can also install and remove extensions with the Datashare CLI.

    Then you can install it with:

    And remove it:

    Datashare Githubarrow-up-right
    these stepsarrow-up-right
    Mac
    Windows
    Linux
    Screenshot of Mac's 'Applications' window with an arrow pointing at Datashare logo
    Screenshot of Datashare's search page with 'OR ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with '"ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with 'ikea"' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with 'ik"ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with '(ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with 'ikea)' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with 'ik(ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with '/ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with '~ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with 'ik~ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with 'ikea!' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with '^ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with 'ik^ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Screenshot of Datashare's search page with '[ikea]' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
    Linux
    Screenshot of Datashare's task page to add document where the header of the Progress column is highlighted
    Screenshot of Datashare's document search page where a text says 'No documents matched your search. Try using different filters.' and the Search breadcrumb open and the 'Clear filters' button in it highlighted

    API

    The Datashare API is fully defined using the OpenAPI 3.0 specification and automatically generated after every Datashare release.

    The OpenAPI spec is a language-agnostic, machine-readable document that describes all of the API’s endpoints, parameter and response schemas, security schemes, and metadata. It empowers developers to discover available operations, validate requests and responses, generate client libraries, and power interactive documentation tools.

    You can download the latest version of the API definitionarrow-up-right in JSON or explore an instantly browsable, developer-friendly interface with Redocarrow-up-right.

    How to contribute

    👷‍♀️ This page is currently being written by Datashare team.

    Backend

    Frontend

    version 7.5.0arrow-up-right
    regular expressionarrow-up-right
    https://github.com/ICIJ/datashare-extension-nlp-opennlparrow-up-right
    datashare-api.jararrow-up-right
    Pipelinearrow-up-right
    AbstractPipelinearrow-up-right
    will removearrow-up-right
    doing NER
    fluent-httparrow-up-right
    mkdir /home/user/extensions
    datashare --extensionsDir=/home/user/extensions
    $ datashare -m CLI --extensionList
    2020-08-29 09:27:51,219 [main] INFO  Main - Running datashare 
    extension datashare-extension-nlp-opennlp
            OPENNLP Pipeline
            7.0.0
            https://github.com/ICIJ/datashare-extension-nlp-opennlp/releases/download/7.0.0/datashare-nlp-opennlp-7.0.0-jar-with-dependencies.jar
            Extension to extract NER entities with OPENNLP
            NLP
    ...
    $ datashare -m CLI --extensionInstall datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions"
    2020-08-29 09:34:30,927 [main] INFO  Main - Running datashare 
    2020-08-29 09:34:32,632 [main] INFO  Extension - downloading from url https://github.com/ICIJ/datashare-extension-nlp-mitie/releases/download/7.0.0/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jar
    2020-08-29 09:34:36,324 [main] INFO  Extension - installing extension from file /tmp/tmp218535941624710718.jar into /home/user/extensions
    $ datashare -m CLI --extensionDelete datashare-extension-nlp-mitie --extensionsDir "/home/user/extensions/"
    2020-08-29 09:40:11,033 [main] INFO  Main - Running datashare 
    2020-08-29 09:40:11,249 [main] INFO  Extension - removing extension datashare-extension-nlp-mitie jar /home/user/extensions/datashare-nlp-mitie-7.0.0-jar-with-dependencies.jar
    package org.myorg;
    
    import net.codestory.http.annotations.Get;
    import net.codestory.http.annotations.Prefix;
    
    @Prefix("myorg")
    public class FooResource {
        @Get("foo")
        public String getFoo() {
            return "hello from foo extension";
        }
    }
    $ datashare --extensionsDir /home/user/extensions/
    # ... starting logs
    2020-08-29 11:03:59,776 [Thread-0] INFO  ExtensionLoader - loading jar /home/user/extensions/my-extension.jar
    2020-08-29 11:03:59,779 [Thread-0] INFO  CorsFilter - adding Cross-Origin Request filter allows *
    2020-08-29 11:04:00,314 [Thread-0] INFO  Fluent - Production mode
    2020-08-29 11:04:00,331 [Thread-0] INFO  Fluent - Server started on port 8080
    $ curl localhost:8080/myorg/foo
    hello from foo extension
    $ datashare -m CLI --extensionInstall /home/user/src/my-extension/dist/my-extension.jar --extensionsDir "/home/user/extensions"
    2020-07-27 10:02:32,381 [main] INFO  Main - Running datashare 
    2020-07-27 10:02:32,596 [main] INFO  ExtensionService - installing extension from file /home/user/src/my-extension/dist/my-extension.jar into /home/user/extensions
    $ datashare -m CLI --extensionDelete my-extension.jar --extensionsDir "/home/user/extensions"
    2020-08-29 10:45:37,363 [main] INFO  Main - Running datashare 
    2020-08-29 10:45:37,579 [main] INFO  Extension - removing extension my-extension jar /home/user/extensions/my-extension.jar

    Datashare doesn't open. What should I do?

    It can be due to extensions priorly installed. The tech team is fixing the issuearrow-up-right. In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:

    • On Mac

    rm -rf ~/Library/datashare/plugins ~/Library/datashare/extensions
    • On Linux

    rm -rf ~/.local/share/datashare/plugins ~/.local/share/datashare/extensions
    • On Windows

    Press Enter. Open Datashare again.

    Database Schema

    hashtag
    api_key

    Column
    Type
    Nullable
    Default

    id

    hashtag
    Constraints and indexes

    • api_key_pkey PRIMARY KEY, btree (id)

    • api_key_user_id_key UNIQUE CONSTRAINT, btree (user_id)


    hashtag
    batch_search

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • batch_search_pkey PRIMARY KEY, btree (uuid)

    • batch_search_date btree (batch_date)

    • batch_search_nb_queries btree (nb_queries)

    hashtag
    Referenced by

    • batch_search_pkey PRIMARY KEY, btree (uuid)

    • batch_search_date btree (batch_date)

    • batch_search_nb_queries btree (nb_queries)


    hashtag
    batch_search_project

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • batch_search_project_unique UNIQUE, btree (search_uuid, prj_id)

    • batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)


    hashtag
    batch_search_query

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • batch_search_query_search_id btree (search_uuid)

    • idx_query_result_batch_unique UNIQUE, btree (search_uuid, query)


    hashtag
    batch_search_result

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • batch_search_result_prj_id btree (prj_id)

    • batch_search_result_query btree (query)

    • batch_search_result_uuid btree (search_uuid)


    hashtag
    document

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • document_pkey PRIMARY KEY, btree (id)

    • document_parent_id btree (parent_id)

    • document_status btree (status)


    hashtag
    document_tag

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • document_tag_doc_id btree (doc_id)

    • document_tag_label btree (label)

    • document_tag_project_id btree (prj_id)


    hashtag
    document_user_recommendation

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • document_user_mark_read_doc_id btree (doc_id)

    • document_user_mark_read_project_id btree (prj_id)

    • document_user_mark_read_user_id btree (user_id)


    hashtag
    document_user_star

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • document_user_star_doc_id btree (doc_id)

    • document_user_star_project_id btree (prj_id)

    • document_user_star_user_id btree (user_id)


    hashtag
    named_entity

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • named_entity_pkey PRIMARY KEY, btree (id)

    • named_entity_doc_id btree (doc_id)


    hashtag
    note

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • idx_unique_note_path_project UNIQUE, btree (project_id, path)

    • note_project btree (project_id)


    hashtag
    project

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • project_pkey PRIMARY KEY, btree (id)


    hashtag
    task

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • task_pkey PRIMARY KEY, btree (id)

    • task_created_at btree (created_at)

    • task_group btree (group_id)


    hashtag
    user_history

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • user_history_pkey PRIMARY KEY, btree (id)

    • idx_user_history_unique UNIQUE, btree (user_id, uri)

    • user_history_creation_date btree (creation_date)

    hashtag
    Referenced by

    • user_history_pkey PRIMARY KEY, btree (id)

    • idx_user_history_unique UNIQUE, btree (user_id, uri)

    • user_history_creation_date btree (creation_date)


    hashtag
    user_history_project

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • user_history_project_unique UNIQUE, btree (user_history_id, prj_id)

    • user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)


    hashtag
    user_inventory

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • user_inventory_pkey PRIMARY KEY, btree (id)


    hashtag
    user_policy

    Column
    Type
    Nullable
    Default

    hashtag
    Constraints and indexes

    • idx_user_policy_unique UNIQUE, btree (user_id, prj_id)


    Script with Playground

    Datashare Playground delivers a collection of Bash scripts (free of external dependencies) that streamline interaction with a Datashare instance’s Elasticsearch index and Redis queue.

    From cloning or replacing whole indices and reindexing specific directories, to adjusting replica settings, monitoring or cancelling long-running tasks, and queuing files for processing, Playground implements each capability through intuitive shell scripts organized under the elasticsearch/ and redis/ directories.

    To get started, set ELASTICSEARCH_URL and REDIS_URL in your environment (or add them to a .env file at the repo root). For a comprehensive guide to script options, directory layout, and example workflows, see the full documentation on Github:

    hashtag
    Use playground to update index's mappings and settings

    Some Datashare updates can bring some fixes and improvements on the index. The index has to be reindexed accordingly.

    1. Create a temporary empty index and specify the desired Datashare version number:

    2. Reindex all documents (under "/" path) from the original index under a temporary one:

    This step can take some time if your index has plenty of documents.

    3. Replace the old index by the new one:

    hashtag
    4. Delete the temporary index:

    Write plugins

    What if you want to integrate text translations to Datashare’s interface? Or make it display tweets scraped with Twint? Ask no more: there is plugins for that!

    Since version 5.6.1arrow-up-right, instead of modifying Datashare directly, you can now isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform.

    hashtag
    Getting started

    When starting, Datashare can receive a pluginsDir option, pointing to your plugins' directory. In this example, this directory is called ~/Datashare Plugins:

    hashtag
    Installing and Removing registered plugins

    hashtag
    Listing

    You can list official Datashare plugins like this :

    The string given to --pluginList is a . You can filter the plugin list if you know what you are looking for.

    hashtag
    Installing

    You can install a plugin with its id and providing where the Datashare plugins are stored:

    Then if you launch Datashare with the same plugin location, the plugin will be loaded.

    hashtag
    Removing

    When you want to stop using a plugin, you can either remove by hand the directory inside the plugins folder or remove it with datashare --pluginDelete :

    hashtag
    Create your first plugin

    To inject plugins, Datashare will look for a Node-compatible module in ~/Datashare Plugins. This way we can rely on NPM/Yarn to handle built packages. As described in , it can be:

    Datashare will read the content of each module in the plugins directory to automatically inject them in the user interface. The backend will serve the plugin files. The entrypoint of each plugin (usually the main property of ) is injected with a <script> tag, right before the closing </body> tag.

    Create a hello-world directory with a single index.js:

    Reload the page, open the console: et voilà 🔮! Easy, right?

    hashtag
    Installing and Removing your custom Plugin

    Now you would like to develop your plugin in your repository and not necessarily in Datashare Plugins folder.

    You can have your code under, say ~/src/my-plugin and deploy it into Datashare with the plugin API. To do so, you'll need to make a zip or a tarball, for example in ~/src/my-plugin/dist/my-plugin.tgz.

    The tarball could contain :

    Then you can install it with:

    And remove it:

    In that case my-plugin is the base directory of the plugin (the one that is in the tarball).

    hashtag

    hashtag
    Adding elements to the Datashare user interface

    To allow external developers to add their own components, we added markers in strategic locations of the user interface where a user can define new . These markers are called "hooks":

    To register a new component to a hook, use the following method:

    Or with a more complex example:

    CLI with Tarentula

    Datashare Tarentula is a powerful command-line toolbelt designed to streamline bulk operations against any Datashare instance.

    Whether you need to count indexed files, download large datasets, batch-tag records, or run complex Elasticsearch aggregations, Tarentula provides a consistent, scriptable interface with flexible query support, and Docker compatibility.

    It also exposes a Python API for embedding automated workflows directly into your data pipelines. With commands like count, download, aggregate, and tagging-by-query, you can handle millions of records in a single invocation, or integrate Tarentula into CI/CD pipelines for reproducible data tasks.

    You can install Tarentula with your favorite package manager:

    pip3 install --user tarentula

    Or alternatively with Docker:

    For the complete list of commands, options, and example, read the documentation or Github:

    Design System

    Datashare's frontend is build with Vue 3 and Bootstrap 5. We document all component of the interface on a dedicated Storybook:

    To facile the creation of plugin, each component can be imported directly from the core:

    // It's usualy safer to wait for the app to be ready
    document.addEventListener('datashare:ready', async () => {
        // This load the ButtonIcon component asynchronously
        const ButtonIcon = await datashare.findComponent('Button/ButtonIcon')
        // Than we create a dummy component. For the sake of simplicity we use
        // Vue 3's option API but we strongly encourage you to build your plugins
        // with Vite and use the option API.
        const definition = {
            components: {
                ButtonIcon,
            },
            methods: {
                sayHi() {
                    alert('Hi!')
                }
            },
            template: `
                <button-icon @click="sayHi()" icon-left="hand-waving">
                    Say hi
                </button-icon>
            `
        }
        
        // Finally, we register the component's definition in a hook.
        datashare.registerHook({ target: 'app-sidebar-sections:before', definition })
    })

    In the example you learn that:

    • Datashare launch must be awaited with "datashare:ready"

    • You can asynchronously import components with datashare.findComponent

    • Component can be registered on targeted locations with a "hook"

    • All icons from are available and loaded automatically

    user_id

    character varying(96)

    not null

    batch_date

    timestamp without time zone

    not null

    state

    character varying(8)

    not null

    published

    integer

    not null

    0

    phrase_matches

    integer

    not null

    0

    fuzziness

    integer

    not null

    0

    file_types

    text

    paths

    text

    error_message

    text

    batch_results

    integer

    0

    error_query

    text

    query_template

    text

    nb_queries

    integer

    0

    uri

    text

    nb_queries_without_results

    integer

  • batch_search_published btree (published)

  • batch_search_user_id btree (user_id)

  • batch_search_published btree (published)

  • batch_search_user_id btree (user_id)

  • Referenced by:

  • TABLE batch_search_project CONSTRAINT batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)

  • query_results

    integer

    0

    doc_id

    character varying(96)

    not null

    root_id

    character varying(96)

    not null

    doc_path

    character varying(4096)

    not null

    creation_date

    timestamp without time zone

    content_type

    character varying(255)

    content_length

    bigint

    prj_id

    character varying(96)

    content

    text

    metadata

    text

    status

    smallint

    extraction_level

    smallint

    language

    character(2)

    extraction_date

    timestamp without time zone

    parent_id

    character varying(96)

    root_id

    character varying(96)

    content_type

    character varying(256)

    content_length

    bigint

    charset

    character varying(32)

    ner_mask

    smallint

    user_id

    character varying(255)

    creation_date

    timestamp without time zone

    not null

    '1970-01-01 00:00:00'::timestamp without time zone

    idx_document_tag_unique UNIQUE, btree (doc_id, label)

    creation_date

    timestamp without time zone

    now()

  • idx_document_mark_read_unique UNIQUE, btree (doc_id, user_id, prj_id)

  • idx_document_star_unique UNIQUE, btree (doc_id, user_id, prj_id)

  • extractor

    smallint

    not null

    category

    character varying(8)

    doc_id

    character varying(96)

    not null

    root_id

    character varying(96)

    extractor_language

    character(2)

    hidden

    boolean

    variant

    character varying(16)

    blur_sensitive_media

    boolean

    not null

    false

    label

    character varying(255)

    publisher_name

    character varying(255)

    ''::character varying

    maintainer_name

    character varying(255)

    ''::character varying

    source_url

    character varying(2048)

    ''::character varying

    logo_url

    character varying(2048)

    ''::character varying

    creation_date

    timestamp without time zone

    now()

    update_date

    timestamp without time zone

    now()

    description

    character varying(4096)

    ''::character varying

    user_id

    character varying(96)

    group_id

    character varying(128)

    progress

    double precision

    0

    created_at

    timestamp without time zone

    not null

    completed_at

    timestamp without time zone

    retries_left

    integer

    max_retries

    integer

    args

    text

    result

    text

    error

    text

    task_name btree (name)

  • task_state btree (state)

  • task_user_id btree (user_id)

  • user_id

    character varying(96)

    not null

    type

    smallint

    not null

    name

    text

    uri

    text

    not null

  • user_history_type btree (type)

  • user_history_user_id btree (user_id)

  • user_history_type btree (type)

  • user_history_user_id btree (user_id)

  • Referenced by:

  • TABLE user_history_project CONSTRAINT user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)

  • provider

    character varying(255)

    details

    text

    '{}'::text

    write

    boolean

    not null

    admin

    boolean

    not null

    character varying(96)

    not null

    user_id

    character varying(96)

    not null

    creation_date

    timestamp without time zone

    not null

    uuid

    character(36)

    not null

    name

    character varying(255)

    description

    character varying(4096)

    search_uuid

    character(36)

    not null

    prj_id

    character varying(96)

    not null

    search_uuid

    character(36)

    not null

    query_number

    integer

    not null

    query

    text

    not null

    search_uuid

    character(36)

    not null

    query

    text

    not null

    doc_nb

    integer

    not null

    id

    character varying(96)

    not null

    path

    character varying(4096)

    not null

    project_id

    character varying(96)

    not null

    doc_id

    character varying(96)

    not null

    label

    character varying(64)

    not null

    prj_id

    character varying(96)

    doc_id

    character varying(96)

    not null

    user_id

    character varying(96)

    not null

    prj_id

    character varying(96)

    doc_id

    character varying(96)

    not null

    user_id

    character varying(96)

    not null

    prj_id

    character varying(96)

    id

    character varying(96)

    not null

    mention

    text

    not null

    offsets

    text

    not null

    project_id

    character varying(96)

    not null

    path

    character varying(4096)

    note

    text

    id

    character varying(255)

    not null

    path

    character varying(4096)

    allow_from_mask

    character varying(64)

    id

    character varying(96)

    not null

    name

    character varying(128)

    not null

    state

    character varying(16)

    not null

    id

    integer

    not null

    generated by default as identity

    creation_date

    timestamp without time zone

    not null

    modification_date

    timestamp without time zone

    not null

    user_history_id

    integer

    not null

    prj_id

    character varying(96)

    not null

    id

    character varying(96)

    not null

    email

    text

    name

    character varying(255)

    user_id

    character varying(96)

    not null

    prj_id

    character varying(96)

    not null

    read

    boolean

    not null

    del /S %APPDATA%\Datashare\Extensions  %APPDATA%\Datashare\Plugins
    ./elasticsearch/index/create.sh <temporary_index> <ds_version_number>
    ./elasticsearch/documents/reindex.sh <original_index> <temporary_index> /
    regular expressionarrow-up-right
    NPM documentationarrow-up-right
    package.jsonarrow-up-right
    Vue Componentarrow-up-right
    Note: You can make all hooks visible by changing the config variables with plugins: datashare.config.set('hooksDebug', true).
    docker run icij/datashare-tarentula
    Phosphorarrow-up-right
    ./elasticsearch/index/replace.sh <temporary_index> <original_index>
    ./elasticsearch/index/delete.sh <temporary_index>
    mkdir ~/Datashare\ Plugins
    datashare --pluginsDir=~/Datashare\ Plugins
    $ datashare -m CLI --pluginList ".*"
    2020-07-24 10:04:59,767 [main] INFO  Main - Running datashare 
    plugin datashare-plugin-site-alert
            Site Alert
            v1.2.0
            https://github.com/ICIJ/datashare-plugin-site-alert
            A plugin to display an alert banner on the Datashare demo instance.
    ...
    $ datashare -m CLI --pluginInstall datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
    2020-07-24 10:15:46,732 [main] INFO  Main - Running datashare 
    2020-07-24 10:15:50,202 [main] INFO  PluginService - downloading from url https://github.com/ICIJ/datashare-plugin-site-alert/archive/v1.2.0.tar.gz
    2020-07-24 10:15:50,503 [main] INFO  PluginService - installing plugin from file /tmp/tmp7747128158158548092.gz into /home/dev/Datashare Plugins
    $ datashare -m CLI --pluginDelete datashare-plugin-site-alert --pluginsDir "~/Datashare Plugins"
    2020-07-24 10:20:43,431 [main] INFO  Main - Running datashare 
    2020-07-24 10:20:43,640 [main] INFO  PluginService - removing plugin base directory /home/dev/Datashare Plugins/datashare-plugin-site-alert-1.2.0
    * A folder with a package.json file containing a "main" field.
    * A folder with an index.js file in it.
    mkdir ~/Datashare\ Plugins/hello-world
    echo "console.log('Welcome to %s', datashare.config.get('app.name'))" > ~/Datashare\ Plugins/hello-world/index.js
    $ tar tvzf ~/src/my-plugin/dist/my-plugin.tgz 
    drwxr-xr-x dev/dev           0 2020-07-22 11:51 my-plugin/
    -rw-r--r-- dev/dev          31 2020-07-21 14:07 my-plugin/main.js
    -rw-r--r-- dev/dev          19 2020-07-21 14:07 my-plugin/package.json
    $ datashare -m CLI --pluginInstall ~/src/my-plugin/dist/my-plugin.tgz --pluginsDir "~/Datashare Plugins"
    2020-07-27 10:02:32,381 [main] INFO  Main - Running datashare 
    2020-07-27 10:02:32,596 [main] INFO  PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare Plugins
    $ datashare -m CLI --pluginDelete my-plugin --pluginsDir "~/Datashare Plugins"
    2020-07-27 10:02:32,381 [main] INFO  Main - Running datashare 
    2020-07-27 10:02:32,596 [main] INFO  PluginService - installing plugin from file ~/src/my-plugin/dist/my-plugin.tgz into ~/Datashare Plugins
    // `datashare` is a global variable
    datashare.registerHook({ target: 'app-sidebar.menu:before', definition: 'This is a message written with a plugin' })
    // It's usualy safer to wait for the app to be ready
    document.addEventListener('datashare:ready', ({ detail }) => {
    
      // Alert is a Vue component meaning it can have computed properties, methods, etc...
      const Alert = {
        computed: {
          weekday () {
            const today = new Date()
            return today.toLocaleDateString('en-US', { weekday: 'long' })  
          }
        },
        template: `<div class="text-center bg-info p-2 width-100">
          It's {{ weekday }}, have a lovely day!
        </div>`
      }
    
      // This is the most important part of this snippet: 
      // we register the component on the a given `target`
      // using the core method `registerHook`. 
      detail.core.registerHook({ target: 'landing.form:before', definition: Alert })
    
    })
    GitHub - ICIJ/datashare-playground: A zero-dependencies series of bash script to interact with Datashare's index and queue.GitHubchevron-right
    Design System - DatashareDatasharechevron-right
    GitHub - ICIJ/datashare-tarentula: Cli toolbelt for Datashare.GitHubchevron-right
    Logo
    Logo
    Logo