1 of 100

Datashare

About Datashare

Datashare allows you to search within your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).

What is Datashare?

Who uses it?

Curious to know more about how we use Datashare?

Where can I see it in action?

Can I run it on my server?

Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to the server documentation to know how it works.

Can I customize Datashare?

Translations

Ask for help

your Operating System (Mac, Windows or Linux)
the version of your Operating System
the version of Datashare
screenshots of your issue
a description of your issue.

Concepts

This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.

Running modes

Datashare runs using different modes with their own specifities.

Mode

Web modes

Those two modes are the only one who create

In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.

In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.

Comparaison between modes

The running modes offer advantages and limitations. This matrix summarizes the differences:

local

server

Multi-users

❌

✅

Multi-projects

❌

✅

Access-control

❌

✅

Indexing UI

✅

❌

Plugins UI

✅

❌

Extension UI

✅

❌

HTTP API

✅

API Key

✅

Single JVM

✅

❌

Tasks execution

✅

❌

When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.

CLI mode

In cli mode, Datashare starts without a web server and allow user to perform task over their documents. This mode can be used in conjunction both with local and server modes, while allowing users to distribute heaving task between several servers.

Daemon modes

Those modes are intended to be used for action that requires to "wait" for pendings tasks.

In batch download mode, the daemon wait for a user to request a batch download of documents. When a request is receive, the daemon start a task to download the document matching the user search, a bundle them into a zip file.

In batch search mode, the daemon wait for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can to upload a list of search terms (in CSV format). The daemon will then start a task to search all matching document and store every occurences in the database.

How to change modes

Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mpode is the embedded mode. Yet when starting Datashare in command line, you can explicitely specify the running mode. For instance on Ubuntu/Debian:

datashare \
  # Switch to SERVER mode
  --mode SERVER \
  # Dummy session filter to creates ephemeral users
  --authFilter org.icij.datashare.session.YesCookieAuthFilter \
  # Name of the default project for every user
  --defaultProject local-datashare \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # URI of Redis 
  --redisAddress redis://redis:6379 \
  # store user sessions in Redis.
  --sessionStoreType REDIS

CLI stages

When running Datashare from the command-line, you can pick which "stage" to apply to analyse your documents.

The CLI stages are primarly intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heaving tasks between servers.

1. SCAN

This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.

datashare --mode CLI \  
  # Select the SCAN stage
  --stage SCAN \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Redis 
  --redisAddress redis://redis:6379

2. INDEX

datashare --mode CLI \
  # Select the INDEX stage
  --stage INDEX \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # Enable OCR \
  --ocr true
  # URI of Redis 
  --redisAddress redis://redis:6379

3. NLP

Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entities mentions in ElasticSearch, it will mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be paralelized as well on several servers.

datashare --mode CLI \
  # Select the NLP stage
  --stage NLP \
  # Use CORENLP to detect named entities
  --nlpp CORENLP \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200

On your computer

About the local mode

In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.

Install on Mac

It will help you set up and install Datashare on your computer.

The installer will setup:

MacPorts (if neither Homebrew or MacPorts are installed)
Apache Tesseract with MacPorts or Homenrew
Java JRE 17
Datashare executable

1. Download Datashare

2. Start the installer

Go to your "Downloads" directory in Finder and double-click "datashare-X.Y.Z.pkg":

3. Go through the Datashare Installer

Click 'Continue', 'Install', enter your password and 'Install Software':

The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.

You can see what it actually does by typing command+L, it will open a window which logs every action made.

In the end, you should see this screen:

Start Datashare

Find the application on your computer and run it locally on your browser.

Once Datashare is installed, go to "Finder", then "Applications", and double-click on "Datashare".

A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening.

Keep this Terminal window open as long as you use Datashare.

Add documents to Datashare

Datashare provides a folder to use to collect documents on your computer to index in Datashare.

Find your Datashare folder on your Mac

Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':

On the menu bar at the top of your computer, click 'Go'. Click on 'Home' (the house icon).

You will see a folder called 'Datashare':

If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:

Add documents in your Datashare folder

Copy or place the documents you want to have in Datashare in this Datashare folder.

Launch Datashare

Open your Applications. You should see Datashare. Double click on it:

Datashare opens in your default internet browser. Click 'Tasks':

Click the 3rd tab 'Analyze your documents':

Install on Windows

It will help you set up the software on your computer.

Standard (Windows 7 Service Pack 2 or newer version)

Uninstall any prior standard version

Download Datashare

The file "datashare-X.Y.Z.exe" is now downloaded. Double click on the name of the file in order to execute it.

Allow Datashare to install

As Datashare is not signed, this popup asks for your permission. Don't click 'Don't run' but click 'More info':

Click 'Run anyway':

It asks if you want to allow the app to make changes to your device. Click 'Yes':

Install Datashare

On the Installer Wizard, as you need to download and install OpenJDK11 if it is not installed on your device, click 'Install':

The following windows with progress bars will be displayed:

Choose a language and click 'OK':

Install Tesseract OCR

To install Tesseract OCR, click the following buttons on the Installer Wizard's windows:

Untick 'Show README' and click 'Finish':

Finally, click "Close" to close the installer of TesseractOCR.

Install Datashare.jar

It now downloads the back end and the front end, Datashare.jar:

When it is finished, click 'Close':

Start Datashare

Find the application on your computer and have it running locally in your browser.

Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)

A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.

Keep this Terminal window open as long as you use Datashare.

Add documents to Datashare

Datashare provides a folder to use to collect documents on your computer to index in Datashare.

When you open your desktop, you will see a folder called 'Datashare Data'. Move or copy and paste the documents you want to add to Datashare to this folder:

Once Datashare has opened, click on 'Analyze documents' on the top navigation bar in Datashare:

Install on Linux

Install Datashare will help you set up the software on your computer.

Currently, only a .deb package for Debian/Ubuntu is provided.

1. Download Datashare

Save the Debian package as a file

2. Install the package

$ sudo apt install /dir/to/debian/package/datashare-dist_7.2.0_all.deb

3. Run Datashare with:

$ datashare

Start Datashare

Find the application on your computer and run it locally on your browser.

Start Datashare by launching it from the command-line:

Add documents to Datashare

Datashare provides a folder to use to collect documents on your computer to index in Datashare.

You can find a folder called 'Datashare' in your home directory.
Move the documents you want to add to Datashare into this folder.
Open Datashare to extract text and eventually find people, organizations and locations in your documents.

Install with Docker

This page explain how to start Datashare within a Docker.

Prerequisites

Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.

Starting Datashare with a single container

docker run --mount src=$HOME/Datashare,target=/home/datashare/data,type=bind -p 8080:8080 icij/datashare:11.1.9 --mode EMBEDDED

Make sure the Datashare folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.

Starting Datashare with multiple containers

Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).

By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.

Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.

This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.

version: "3.7"
services:

  datashare:
    image: icij/datashare:18.1.3
    hostname: datashare
    ports:
      - 8080:8080
    environment:
      - DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
    volumes:
      - type: bind
        source: ${HOME}/Datashare
        target: /home/datashare/data
      - type: volume
        source: datashare-models
        target: /home/datashare/dist
    command: >-
      --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password 
      --mode LOCAL
      --tcpListenPort 8080
    depends_on:
      - postgresql
      - redis
      - elasticsearch

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
    restart: on-failure
    volumes:
      - type: volume
        source: elasticsearch-data
        target: /usr/share/elasticsearch/data
        read_only: false
    environment:
      - "http.host=0.0.0.0"
      - "transport.host=0.0.0.0"
      - "cluster.name=datashare"
      - "discovery.type=single-node"
      - "discovery.zen.minimum_master_nodes=1"
      - "xpack.license.self_generated.type=basic"
      - "http.cors.enabled=true"
      - "http.cors.allow-origin=*"
      - "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"

  redis:
    image: redis:4.0.1-alpine
    restart: on-failure

  postgresql:
    image: postgres:12-alpine
    environment:
      - POSTGRES_USER=datashare
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=datashare
    volumes:
      - type: volume
        source: postgresql-data
        target: /var/lib/postgresql/data

volumes:
  datashare-models:
  elasticsearch-data:
  postgresql-data:

Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:

docker-compose up -d

The -d flag runs the containers in detached mode, allowing them to run in the background.

Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:

docker-compose logs -f datashare

Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.

That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.

To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:

docker-compose down

This will stop and remove the containers, freeing up system resources.

Add documents

It will help you index and have your documents in Datashare. This step is required in order to explore your documents.

Add documents

1. To add your documents in Datashare, click 'Tasks' in the left menu:

2. Click 'Analyze your documents':

3. Click 'Add documents' so Datashare can extract the texts from your files:

Options when adding documents

You can:

Select the specific folder or sub-folder containing the documents you want to add.
Extract text also from images/PDFs (OCR). Be aware the indexing can be up to 10 times longer.
Skip already indexed files.

Two extraction tasks are now running: the first is the scanning of your Datashare folder which sees if there are new documents to analyze (ScanTask). The second is the indexing of these files (IndexTask):

It is not possible to 'Find people, organizations and locations' while of these two tasks is still running.

When tasks are done, you can start exploring documents by clicking 'Search' in the left menu but you won't have the named entities (names of people, organizations and locations) yet. To have these, follow the steps below.

Extract names of people, organizations and locations

1. After the text is extracted, you can launch named entities recognition by clicking the button 'Find people, organizations and locations'.

2. In the window below, you are asked to choose between finding Named Entities or finding email addresses (you cannot do both simultaneously, you need to do one after the other, no matter the order):

You can now see running tasks and their progress. After they are done, you can click 'Clear done tasks' to stop displaying tasks that are completed.

3. You can search your indexed documents without having to wait for all tasks to be done. To access your documents, click 'Search':

Extract email addresses

To extract email addresses in your documents:

Re-click on 'Find people, organizations, locations and email addresses' (in Tasks (left menu) > Analyze your documents)
Click the second radio button 'Find email addresses':

Add more languages

This page will explain to you how to install language packages to support Optical Character Recognition (OCR) on more languages.

To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses "language packages" especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide will explain you how to manually install it on your system.

Install packages on Linux

To add ocr languages on linux, simply use the following command:

Where `[lang]` is can be :

all if you want to install all languages

Install packages on Mac

With MacPorts (default)

First, you must check that MacPort is installed on your computer. Please run in a Terminal:

You should see an output similar to this:

If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):

Once the installation is done, simply close and restart Datashare to be able to use the newly installed packages.

With Homebrew

If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!

If you want to check if Homebrew is installed, run the following command in a Terminal:

You should see an output similar to this:

Install languages on Windows

*Additionnal languages can be also added during Tesseract's installation.

The list of installed languages can be checked with Windows command prompt or Powershell with the commandtesseract --list-langs.

Datashare has to be restarted after the language installation.

Install plugins and extensions

It will help you locally add plugins and extensions to Datashare.

Add plugins to Datashare (front-end)

Go to "Settings":

Click "Plugins":

Choose the plugin you want to add and click "Install now":

If you want to install a plugin from an URL, click "Install plugin from URL".

Your plugin is installed.

Refresh your page to see your new plugin activated in Datashare.

Add extensions to Datashare (back-end)

Go to "Settings":

Click "Extensions":

Choose the extension you want to add and click "Install now":

If you want to install an extension from an URL, click "Install extension from URL".

Your extension is installed.

Restart Datashare to see your new extension activated in Datashare.

Update plugin or extension with latest version

When a newer version of a plugin or extension is available, you can click on the "Update" button to get the latest version.

After that, if it is a plugin, refresh your page to activate the latest version.

If it is an extension, restart Datashare to activate the latest version.

Create your own plugin or extension

People who code can create their own plugins and extensions by following these steps:

Neo4j

This page explains how to setup neo4j, install the neo4j plugin and create a graph on your computer

Prerequisites

Get neo4j up and running

We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.

Add entities

If your project contains email documents, make sure to also extract email addresses.

Next step

Install Neo4j plugin

Install the neo4j plugin

Configure the neo4j plugin

1. Go to "Settings":

2. Make sure the following settings are properly set:

Neo4j Host should be localhost or the address where your neo4j instance is running
Neo4j Port should be the port where your neo4j instance is running (7687 by default)
Neo4j User should be set to your neo4j user name (neo4j by default)
Neo4j Password should only be set if your neo4j user is using password authentication

3. When running Neo4j Community Edition, set the Neo4j Single Project value. In community edition, the neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project with the name of the project which will use neo4j plugin. Other projects won't be able to use the neo4j plugin.

4. Restart Datashare to apply the changes

5. You should be able to see the neo4j widget in your project page, after a little while its status should be RUNNING:

Next step

Create and update Neo4j graph

This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects

Create the graph

Open the 'Projects' page and select your project:

Create the graph by clicking on the 'Create graph' button inside the neo4j widget:

You will see a new import task running:

When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:

Update the graph

When new documents or entities are added or modified inside Datashare, you will need to update the neo4j graph to reflect these changes.

To update the graph click on the 'Update graph' button inside the neo4j widget:

To detect whether a graph update is needed you can compare the number of documents found inside Datashare to the number found in the 'Graph statistics' and run an update in case of mismatch:

The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.

Next steps

On your server

About the server mode

Launch configuration

Datashare is launched with --mode SERVER and you have to provide:

the external elasticsearch index address elasticsearchAddress
a Redis store address redisAddress
a Redis data bus address messageBusAddress
the host of Datashare (used to generate batch search results URLs) rootHost
an authentication mechanism and its parameters

Example:

docker run -ti ICIJ/datashare:version --mode SERVER \
    --redisAddress redis://my.redis-server.org:6379 \
    --elasticsearchAddress https://my.elastic-server.org:9200 \
    --messageBusAddress my.redis-server.org \
    --dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
    --rootHost https://my.datashare-server.org
    # ... +auth parameters (see authentication providers section)

Install with Docker

This page explain how to start Datashare within a Docker in server mode.

Prerequisites

Starting Datashare with multiple containers

Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.

version: "3.7"
services:

  # This is the main Datashare service that serves the web interface. 
  # Here it is configured with a "dummy" authentication backend which 
  # creates epehemeral user sessions.
  datashare_web:
    image: icij/datashare:18.1.3
    hostname: datashare
    ports:
      - 8080:8080
    environment:
      - DS_DOCKER_MOUNTED_DATA_DIR=${HOME}/Datashare
    volumes:
      - type: bind
        source: ${HOME}/Datashare
        target: /home/datashare/Datashare
    depends_on:
      postgresql:
        condition: service_healthy
      redis:
        condition: service_healthy
      elasticsearch:
        condition: service_healthy
    command: >-
      --mode SERVER
      --dataDir /home/datashare/Datashare
      --pluginsDir /home/datashare/plugins
      --extensionsDir /home/datashare/extensions
      --authFilter org.icij.datashare.session.YesCookieAuthFilter
      --busType REDIS
      --batchQueueType REDIS
      --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password 
      --defaultProject secret-project
      --elasticsearchAddress http://elasticsearch:9200      
      --messageBusAddress redis://redis:6379
      --queueType REDIS
      --redisAddress redis://redis:6379  
      --rootHost http://localhost:8080
      --sessionStoreType REDIS
      --sessionTtlSeconds 43200
      --tcpListenPort 8080

  # We use a service to create the "secret-project". In theory you only need 
  # to run it the first time you start Datashare. 
  datashare_create_project:
    image: icij/datashare:18.1.3
    restart: no
    depends_on:
      elasticsearch:
        condition: service_healthy
    command: >-
      --defaultProject secret-project 
      --mode CLI 
      --stage INDEX 
      --elasticsearchAddress http://elasticsearch:9200

  # This service starts a deamon that wait for background tasks
  # so it can run them (and save them in the database).
  datashare_task:
    image: icij/datashare:18.1.3
    depends_on:
      - datashare_web
    command: >-
      --mode TASK_RUNNER
      --batchQueueType REDIS
      --batchThrottleMilliseconds 500
      --busType REDIS
      --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password
      --defaultProject secret-project 
      --elasticsearchAddress http://elasticsearch:9200  
      --queueType REDIS
      --redisAddress redis://redis:6379
      --scrollSize 100  
      
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
    restart: on-failure
    volumes:
      - type: volume
        source: elasticsearch-data
        target: /usr/share/elasticsearch/data
        read_only: false
    environment:
      - "http.host=0.0.0.0"
      - "transport.host=0.0.0.0"
      - "cluster.name=datashare"
      - "discovery.type=single-node"
      - "discovery.zen.minimum_master_nodes=1"
      - "xpack.license.self_generated.type=basic"
      - "http.cors.enabled=true"
      - "http.cors.allow-origin=*"
      - "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"
    healthcheck:
      test: ["CMD-SHELL", "curl --silent --fail elasticsearch:9200/_cluster/health || exit 1"]

  postgresql:
    image: postgres:12-alpine
    environment:
      - POSTGRES_USER=datashare
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=datashare
      # This is needed by the heathcheck command
      # @see https://stackoverflow.com/a/60194261
      - PGUSER=datashare
    volumes:
      - type: volume
        source: postgresql-data
        target: /var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-U", "datashare", "-d", "datashare"]
  
  redis:
    image: redis:4.0.1-alpine
    restart: on-failure
    volumes:
      - type: volume
        source: redis-data
        target: /data
    healthcheck:
      test: ["CMD-SHELL", "redis-cli", "--raw", "incr", "ping"]

volumes:
  datashare-batchdownload-dir:
  elasticsearch-data:
  postgresql-data:
  redis-data:

Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:

docker-compose up -d

The -d flag runs the containers in detached mode, allowing them to run in the background.

docker-compose logs -f datashare_web

To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:

docker-compose down

This will stop and remove the containers, freeing up system resources.

Add documents to Datashare

Extract named entities

Add documents from the CLI

This is likelly to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.

Here is a simple command to scan a directory and index its files:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

What's happening here:

The SCAN stage feeds a queue in memory with file to add
The INDEX stage pulls files from the queue to add them to ElasticSearch
We tell Datashare to use the elasticsearch service
Files to add are located in /home/datashare/Datashare/ which is a directory mounted from the host machine

Alternativly, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the redis:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Once the opperation is done, we can easily check the content of queue created by Datashare in redis. In this example we only display the 20 first files in the datashare:queue:

docker compose exec redis redis-cli lrange datashare:queue 0 20

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage INDEX \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.

Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :

Scan the existing ElasticSearch index and gather document paths to store it inside a report queue
Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCANIDX \
  --queueType REDIS \
  --reportName "report:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX \
  --ocr true \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --reportName "report:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Authentication providers

Authentication with Datashare in server mode is the most impacting choice that has to be made. It can be one of the followings:

basic authentication with credentials stored in database (PostgreSQL)
basic authentication with credentials stored in Redis
OAuth2 with credentials provided by an identity provider (KeyCloak for example)
dummy basic auth to accept any user (⚠️ if the service is exposed to internet, it will leak your documents)

Basic with a database

Basic authentication with a database.

Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:

It is secure as long as the communication to the server is encrypted (with SSL for example).

On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.

Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):

Then you can insert the user like this in your database:

If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For exammple if you use myindex:

Then when accessing Datashare, you should see this popup:

Example

Here is an example of launching Datashare with Docker and the basic auth provider filter backed in database:

Usage

FAQ