This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.
Loading...
Loading...
On your computer
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
On your server
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Usage
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
FAQ
👷♀️ This page is currently being written by Datashare team.
General
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Definitions
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Common errors
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Developers
How to contribute
👷♀️ This page is currently being written by Datashare team.
Backend
Loading...
Loading...
Loading...
Frontend
Loading...
Loading...
Loading...
Vue app
Components
Api
AppliedSearchFilters
AppliedSearchFiltersItem
AppNav
AppSidebar
BatchDownloadActions
Ask for help
To report a bug, please contribute in our GitHub detailing your logs with:
your Operating System (Mac, Windows or Linux)
the version of your Operating System
the version of Datashare
screenshots of your issue
a description of your issue.
If for confidentiality reasons you don't want to open an issue on Github, please write to datashare@icij.org and our team will do its best to answer you in a timely manner.
Start Datashare
Find the application on your computer and run it locally on your browser.
Once Datashare is installed, go to "Finder", then "Applications", and double-click on "Datashare".
A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening.
Keep this Terminal window open as long as you use Datashare.
Datashare should now automatically open in your default internet browser.
If it doesn’t, type "localhost:8080" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: Can I use Datashare with no internet connection?).
In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
Running modes
Datashare runs using different modes with their own specifities.
Mode
Category
Description
LOCAL
Web
To run Datashare on a single computer for a single user.
SERVER
Web
To run Datashare on a server for multiple users.
CLI
CLI
TASK_RUNNER
Daemon
Web modes
Those two modes are the only one who create
In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
Comparaison between modes
The running modes offer advantages and limitations. This matrix summarizes the differences:
local
server
Multi-users
❌
✅
Multi-projects
❌
✅
Access-control
❌
✅
Indexing UI
✅
❌
Plugins UI
✅
❌
Extension UI
✅
❌
HTTP API
✅
✅
API Key
✅
✅
Single JVM
✅
❌
Tasks execution
✅
❌
When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.
CLI mode
In cli mode, Datashare starts without a web server and allow user to perform task over their documents. This mode can be used in conjunction both with local and server modes, while allowing users to distribute heaving task between several servers.
If you want to learn more about which tasks you can execute in this mode, checkout the stages documentation.
Daemon modes
Those modes are intended to be used for action that requires to "wait" for pendings tasks.
In batch download mode, the daemon wait for a user to request a batch download of documents. When a request is receive, the daemon start a task to download the document matching the user search, a bundle them into a zip file.
In batch search mode, the daemon wait for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can to upload a list of search terms (in CSV format). The daemon will then start a task to search all matching document and store every occurences in the database.
How to change modes
Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mpode is the embedded mode. Yet when starting Datashare in command line, you can explicitely specify the running mode. For instance on Ubuntu/Debian:
datashare \
# Switch to SERVER mode
--mode SERVER \
# Dummy session filter to creates ephemeral users
--authFilter org.icij.datashare.session.YesCookieAuthFilter \
# Name of the default project for every user
--defaultProject local-datashare \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200 \
# URI of Redis
--redisAddress redis://redis:6379 \
# store user sessions in Redis.
--sessionStoreType REDIS
To index documents and analyze them directly .
To execute async tasks (, batch downloads, scan, index, NER extraction
When running Datashare from the command-line, you can pick which "stage" to apply to analyse your documents.
The CLI stages are primarly intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heaving tasks between servers.
1. SCAN
This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.
datashare --mode CLI \
# Select the SCAN stage
--stage SCAN \
# Where the document are located
--dataDir /path/to/documents \
# Store the queued files in Redis
--dataBusType REDIS \
# URI of Redis
--redisAddress redis://redis:6379
2. INDEX
The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue create in the previous step, then use a combination of Apache Tika and Tesseract to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurent value at the time. This allows users to distribute this command on serveral servers.
datashare --mode CLI \
# Select the INDEX stage
--stage INDEX \
# Where the document are located
--dataDir /path/to/documents \
# Store the queued files in Redis
--dataBusType REDIS \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200 \
# Enable OCR \
--ocr true
# URI of Redis
--redisAddress redis://redis:6379
3. NLP
Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entities mentions in ElasticSearch, it will mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be paralelized as well on several servers.
datashare --mode CLI \
# Select the NLP stage
--stage NLP \
# Use CORENLP to detect named entities
--nlpp CORENLP \
# URI of Elasticsearch
--elasticsearchAddress http://elasticsearch:9200
Install on Mac
It will help you set up and install Datashare on your computer.
This guide will explain to you how to install Datashare on Mac. The installer will take care of checking your system have all the dependencies to run Datashare. Because this software use Apache Tesseract (to perform Optical Character Recognition) and Mac doesn't support them out-of-box, heavy dependencies must be downloaded. If your system have none of those dependencies, the first installation of Datashare can take up to 30 minutes.
MacPorts (if neither Homebrew or MacPorts are installed)
Apache Tesseract with MacPorts or Homenrew
Java JRE 17
Datashare executable
Note: previous versions of this document refered to a "Docker Installer". We do not provide this installer anymore but Datashare is still published on the Docker Hub and supported with Docker.
Go to your "Downloads" directory in Finder and double-click "datashare-X.Y.Z.pkg":
3. Go through the Datashare Installer
Click 'Continue', 'Install', enter your password and 'Install Software':
The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.
You can see what it actually does by typing command+L, it will open a window which logs every action made.
Datashare allows you to search within your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).
What is Datashare?
Welcome to Datashare - a self-hosted documents search software. It is a free and open-source software developed by the International Consortium of Investigative Journalists (ICIJ). Initially created to combine multiple named-entity recognition pipelines, this tool is now a fully-featured search interface to dig into your documents. With the help of several open source tools (Extract, Apache Tika, Apache Tesseract, CoreNLP, OpenNLP, Elasticsearch, etc), Datashare can be used on one single personal computer as well as on 100 interconnected servers.
Who uses it?
Datashare is developed by the ICIJ, a collective of investigative journalists. Datashare is built at the top of technologies and methods already tested with investigations like the Panama Papers or the Luanda Leaks. Seeing the growing interest for ICIJ's technology, we decided to open source this key component of our investigations so a single journalist as well as big media organizations could use it on their own documents.
We setup a Demo instance of Datashare with a small set of documents from the Luxleaks investigation (2014). When using this instance, you will be assigned a temporary user which can star, tag, recommend and explore documents.
Can I run it on my server?
Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to the server documentation to know how it works.
Can I customize Datashare?
When building Datashare, one of our first decisions was to use Elasticsearch to create an index of documents. It would be fair to describe Datashare as a nice looking web interface for Elasticsearch. We want our search platform to be user-friendly while keeping all the powerful Elasticsearch features available for advanced users. This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers who want to query the index directly with our API.
We implemented the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform. Please have a look at the documentation.
Translations
This project is currently available in English, French, Spanish and Japanese. You can help us to improve and complete translations on Crowdin.
Start Datashare
Find the application on your computer and have it running locally in your browser.
Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)
A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.
Keep this Terminal window open as long as you use Datashare.
Install on Linux
Install Datashare will help you set up the software on your computer.
Currently, only a .deb package for Debian/Ubuntu is provided.
1. Download Datashare
Save the Debian package as a file
2. Install the package
3. Run Datashare with:
Add documents to Datashare
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
Find your Datashare folder on your Mac
Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':
On the menu bar at the top of your computer, click 'Go'. Click on 'Home' (the house icon).
You will see a folder called 'Datashare':
If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:
Add documents in your Datashare folder
Copy or place the documents you want to have in Datashare in this Datashare folder.
Launch Datashare
Open your Applications. You should see Datashare. Double click on it:
Datashare opens in your default internet browser. Click 'Tasks':
Click the 3rd tab 'Analyze your documents':
Datashare should now automatically open in your default internet browser.
If it doesn’t, type "" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: ).
It's now time to .
If you want to run it with another Linux distribution, you can download the latest version of the Datashare jar here :
And adapt the following launch script : to your environment.
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
When you open your desktop, you will see a folder called 'Datashare Data'.Move or copy and paste the documents you want to add to Datashare to this folder:
Once Datashare has opened, click on 'Analyze documents' on the top navigation bar in Datashare:
Now open Datashare, which you will find in your main menu (see above: ')
Find the application on your computer and run it locally on your browser.
Start Datashare by launching it from the command-line:
datashare
Datashare should now automatically open in your default internet browser. If it doesn’t, type "localhost:8080" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see: Can I use Datashare with no internet connection?).
This page explain how to start Datashare within a Docker.
Prerequisites
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
To start Datashare within a Docker container, you can use this command:
docker run --mount src=$HOME/Datashare,target=/home/datashare/data,type=bind -p 8080:8080 icij/datashare:11.1.9 --mode EMBEDDED
Make sure the Datashare folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.
Starting Datashare with multiple containers
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).
By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare with Docker Compose, you can use the following docker-compose.yml file:
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:
docker-compose up -d
The -d flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
docker-compose logs -f datashare
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.
That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:
docker-compose down
This will stop and remove the containers, freeing up system resources.
Add documents to Datashare
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
You can find a folder called 'Datashare' in your home directory.
Move the documents you want to add to Datashare into this folder.
Open Datashare to extract text and eventually find people, organizations and locations in your documents.
It will help you index and have your documents in Datashare. This step is required in order to explore your documents.
Add documents
1. To add your documents in Datashare, click 'Tasks' in the left menu:
2. Click 'Analyze your documents':
3. Click 'Add documents' so Datashare can extract the texts from your files:
Options when adding documents
You can:
Select the specific folder or sub-folder containing the documents you want to add.
Extract text also from images/PDFs (OCR). Be aware the indexing can be up to 10 times longer.
Select the language of you document if you don't want Datashare to guess it automatically.
Note: if you choose to also extract text from images (previous option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Skip already indexed files.
Two extraction tasks are now running: the first is the scanning of your Datashare folder which sees if there are new documents to analyze (ScanTask). The second is the indexing of these files (IndexTask):
It is not possible to 'Find people, organizations and locations' while of these two tasks is still running.
When tasks are done, you can start exploring documents by clicking 'Search' in the left menu but you won't have the named entities (names of people, organizations and locations) yet. To have these, follow the steps below.
Extract names of people, organizations and locations
1. After the text is extracted, you can launch named entities recognition by clicking the button 'Find people, organizations and locations'.
2. In the window below, you are asked to choose between finding Named Entities or finding email addresses (you cannot do both simultaneously, you need to do one after the other, no matter the order):
You can now see running tasks and their progress. After they are done, you can click 'Clear done tasks' to stop displaying tasks that are completed.
3. You can search your indexed documents without having to wait for all tasks to be done. To access your documents, click 'Search':
Extract email addresses
To extract email addresses in your documents:
Re-click on 'Find people, organizations, locations and email addresses' (in Tasks (left menu) > Analyze your documents)
Click the second radio button 'Find email addresses':
This page explains how to setup neo4j, install the neo4j plugin and create a graph on your computer
Prerequisites
Get neo4j up and running
Follow the instructions of the dedicated FAQ page to get neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
Add entities
If it's not done yet analyze your documents and extract both names of people, organizations and locations as well as email addresses.
If your project contains email documents, make sure to also extract email addresses.
This page will explain to you how to install language packages to support Optical Character Recognition (OCR) on more languages.
To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses "language packages" especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide will explain you how to manually install it on your system.
Install packages on Linux
To add ocr languages on linux, simply use the following command:
sudo apt install tesseract-ocr-[lang]
Where `[lang]` is can be :
all if you want to install all languages
a language code (ex: fra, for French), the list of languages is available here
Install packages on Mac
The Datashare Installer for Mac checks for the existence of either MacPorts or Homebrew, which packages managers are used for the installation of Tesseract. If none of those two package managers is present, the Datashare Installer will install MacPorts by default.
With MacPorts (default)
First, you must check that MacPort is installed on your computer. Please run in a Terminal:
port version
You should see an output similar to this:
If you get a command not found: port, this either means you are using Homebrew (see next section) or you did not run the Datashare installer for Mac yet.
If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):
port install tesseract-deu
The full list of supported language packages can be found on MacPorts website.
Once the installation is done, simply close and restart Datashare to be able to use the newly installed packages.
With Homebrew
If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!
If you want to check if Homebrew is installed, run the following command in a Terminal:
brew -v
You should see an output similar to this:
If you get a command not found: brew error, this mean Homebrew is not installed on your system. You might either use MacPorts (see previous section) or run the Datashare installer for Mac on your computer.
Install languages on Windows
Languages packages are available on Tesseract Github repository. Trained data files have to be downloaded and added into tessdata folder in Tesseract's installation folder.
*Additionnal languages can be also added during Tesseract's installation.
The list of installed languages can be checked with Windows command prompt or Powershell with the commandtesseract --list-langs.
Datashare has to be restarted after the language installation.
Install Neo4j plugin
Install the neo4j plugin
Install the neo4j plugin following instructions available in the dedicated page.
Configure the neo4j plugin
1. Go to "Settings":
2. Make sure the following settings are properly set:
Neo4j Host should be localhost or the address where your neo4j instance is running
Neo4j Port should be the port where your neo4j instance is running (7687 by default)
Neo4j User should be set to your neo4j user name (neo4j by default)
Neo4j Passwordshould only be set if your neo4j user is using password authentication
3. When running Neo4j Community Edition, set the Neo4j Single Project value. In community edition, the neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project with the name of the project which will use neo4j plugin. Other projects won't be able to use the neo4j plugin.
4. Restart Datashare to apply the changes
5. You should be able to see the neo4j widget in your project page, after a little while its status should be RUNNING:
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
Launch configuration
Datashare is launched with --mode SERVER and you have to provide:
the external elasticsearch index address elasticsearchAddress
a Redis store address redisAddress
a Redis data bus address messageBusAddress
the host of Datashare (used to generate batch search results URLs) rootHost
an authentication mechanism and its parameters
Example:
save as file
a database dataSourceUrl
docker run -ti ICIJ/datashare:version --mode SERVER \
--redisAddress redis://my.redis-server.org:6379 \
--elasticsearchAddress https://my.elastic-server.org:9200 \
--messageBusAddress my.redis-server.org \
--dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
--rootHost https://my.datashare-server.org
# ... +auth parameters (see authentication providers section)
This page explain how to start Datashare within a Docker in server mode.
Prerequisites
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare in server mode with Docker Compose, you can use the following docker-compose.yml file:
version: "3.7"
services:
# This is the main Datashare service that serves the web interface.
# Here it is configured with a "dummy" authentication backend which
# creates epehemeral user sessions.
datashare_web:
image: icij/datashare:18.1.3
hostname: datashare
ports:
- 8080:8080
environment:
- DS_DOCKER_MOUNTED_DATA_DIR=${HOME}/Datashare
volumes:
- type: bind
source: ${HOME}/Datashare
target: /home/datashare/Datashare
depends_on:
postgresql:
condition: service_healthy
redis:
condition: service_healthy
elasticsearch:
condition: service_healthy
command: >-
--mode SERVER
--dataDir /home/datashare/Datashare
--pluginsDir /home/datashare/plugins
--extensionsDir /home/datashare/extensions
--authFilter org.icij.datashare.session.YesCookieAuthFilter
--busType REDIS
--batchQueueType REDIS
--dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password
--defaultProject secret-project
--elasticsearchAddress http://elasticsearch:9200
--messageBusAddress redis://redis:6379
--queueType REDIS
--redisAddress redis://redis:6379
--rootHost http://localhost:8080
--sessionStoreType REDIS
--sessionTtlSeconds 43200
--tcpListenPort 8080
# We use a service to create the "secret-project". In theory you only need
# to run it the first time you start Datashare.
datashare_create_project:
image: icij/datashare:18.1.3
restart: no
depends_on:
elasticsearch:
condition: service_healthy
command: >-
--defaultProject secret-project
--mode CLI
--stage INDEX
--elasticsearchAddress http://elasticsearch:9200
# This service starts a deamon that wait for background tasks
# so it can run them (and save them in the database).
datashare_task:
image: icij/datashare:18.1.3
depends_on:
- datashare_web
command: >-
--mode TASK_RUNNER
--batchQueueType REDIS
--batchThrottleMilliseconds 500
--busType REDIS
--dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password
--defaultProject secret-project
--elasticsearchAddress http://elasticsearch:9200
--queueType REDIS
--redisAddress redis://redis:6379
--scrollSize 100
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.9.1
restart: on-failure
volumes:
- type: volume
source: elasticsearch-data
target: /usr/share/elasticsearch/data
read_only: false
environment:
- "http.host=0.0.0.0"
- "transport.host=0.0.0.0"
- "cluster.name=datashare"
- "discovery.type=single-node"
- "discovery.zen.minimum_master_nodes=1"
- "xpack.license.self_generated.type=basic"
- "http.cors.enabled=true"
- "http.cors.allow-origin=*"
- "http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE"
healthcheck:
test: ["CMD-SHELL", "curl --silent --fail elasticsearch:9200/_cluster/health || exit 1"]
postgresql:
image: postgres:12-alpine
environment:
- POSTGRES_USER=datashare
- POSTGRES_PASSWORD=password
- POSTGRES_DB=datashare
# This is needed by the heathcheck command
# @see https://stackoverflow.com/a/60194261
- PGUSER=datashare
volumes:
- type: volume
source: postgresql-data
target: /var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready", "-U", "datashare", "-d", "datashare"]
redis:
image: redis:4.0.1-alpine
restart: on-failure
volumes:
- type: volume
source: redis-data
target: /data
healthcheck:
test: ["CMD-SHELL", "redis-cli", "--raw", "incr", "ping"]
volumes:
datashare-batchdownload-dir:
elasticsearch-data:
postgresql-data:
redis-data:
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml file. Then run the following command to start the Datashare service:
docker-compose up -d
The -d flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
docker-compose logs -f datashare_web
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080. This assumes that the default port mapping of 8080:8080 is used for the Datashare container in the YAML file.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml file is located:
docker-compose down
This will stop and remove the containers, freeing up system resources.
Add documents to Datashare
If you reach that point, Datashare is up and running but you will discover very quickly that no documents is available in the search results. Next step: Add documents from the CLI.
Extract named entities
Datashare as the ability to detect email addresses, name of people, organizations and locations. You must perform the named entities extraction in the same fashion than the previous commands. Final step: Add named entities from the CLI.
Create and update Neo4j graph
This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects
Create the graph
Open the 'Projects' page and select your project:
Create the graph by clicking on the 'Create graph' button inside the neo4j widget:
You will see a new import task running:
When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:
Update the graph
When new documents or entities are added or modified inside Datashare, you will need to update the neo4j graph to reflect these changes.
To update the graph click on the 'Update graph' button inside the neo4j widget:
To detect whether a graph update is needed you can compare the number of documents found inside Datashare to the number found in the 'Graph statistics' and run an update in case of mismatch:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
This is likelly to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.
Here is a simple command to scan a directory and index its files:
What's happening here:
The SCAN stage feeds a queue in memory with file to add
The INDEX stage pulls files from the queue to add them to ElasticSearch
We tell Datashare to use the elasticsearch service
Files to add are located in /home/datashare/Datashare/ which is a directory mounted from the host machine
Alternativly, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the redis:
Once the opperation is done, we can easily check the content of queue created by Datashare in redis. In this example we only display the 20 first files in the datashare:queue:
Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.
Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :
Scan the existing ElasticSearch index and gather document paths to store it inside a report queue
Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it
projects
create-graph
import-task
stats
update-graph
datashare-stats
This document assumes you have installed Datashare.
In server , it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differenciate user to offer admin additional tools.
Datashare starts in "CLI"
We ask to process both SCAN and INDEX at the same time
The INDEX can now be executed in the same container:
In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate user to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.
Datashare as the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:
We tell Datashare to use the elasticsearch service
Datashare will pull documents from ElasticSearch directly
Up to 2 documents will be analyzed in parallel
Datashare will use the CORENLP pipeline
Datashare will use the output queue from the previous INDEX stage (by default extract:queue:nlp in Redis) that contains all the document ids to be analyzed.
The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.
The added ENQUEUEIDX stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the ids of those documents into the extract:queue:nlp queue.
OAuth2
OAuth2 authentication with a third-party id service
This is the default authentication mode: if not provided in CLI it will be selected. With OAuth2 you will need a third-party authorization service. The diagram below describes the workflow:
We made a small demo repository to show how it could be setup.
Basic with a database
Basic authentication with a database.
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:
Authorization: Basic dXNlcjpwYXNzd29yZA==
It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.
Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):
$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 -
Then you can insert the user like this in your database:
If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For exammple if you use myindex:
This page explains how to setup neo4j, install the neo4j plugin and create a graph on your server
Prerequisites
Get neo4j up and running
Follow the instructions of the dedicated FAQ page to get neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
If your project contains email documents, make sure to run the EMAIL pipeline together with regular NLP pipeline. To do so add set the follow nlpp flag to --nlpp CORENLP,EMAIL.
This page describes how to create and maintain your neo4j graph up to date with your server's Datashare projects
Run the neo4j extension CLI
The neo4j related features are added to the DatashareCLI through the extension mechanism. In order to run the extended CLI, the Java CLASSPATH must be extended with the path of the datashare-extension-neo4j jar. By default, this jar is located in /home/datashare/extensions, so the CLI will be run as following:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
To detect whether a graph update is needed, open the 'Projects' page and select your project:
compare the number of documents and entities found inside Datashare:
to the numbers found in the 'Graph statistics' and run an update in case of mismatch:
Installing the plugin install the datashare-plugin-neo4j-graph-widget plugin inside /home/datashare/plugings and will also install the datashare-extension-neo4j backend extension inside /home/datashare/extensions. These locations can be changed by updating the docker-compose.yml.
...
services:
datashare_web:
...
environment:
- DS_DOCKER_NEO4J_HOST=neo4j
- DS_DOCKER_NEO4J_PORT=7687
- DS_DOCKER_NEO4J_SINGLE_PROJECT=secret-project # This is for community edition only
If your choose a different neo4j user or set a password for your neo4j user make sure to also set DS_DOCKER_NEO4J_USER and DS_DOCKER_NEO4J_PASSWORD.
When running Neo4j Community Edition, set the DS_DOCKER_NEO4J_SINGLE_PROJECT value. In community edition, the neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the DS_DOCKER_NEO4J_SINGLE_PROJECT with the name of the project which will use neo4j plugin. Other projects won't be able to use the neo4j plugin.
Restart Datasahre
After installing the plugin a restart might be needed for the plugin to display:
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded:
Authorization: Basic dXNlcjpwYXNzd29yZA==
It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.
So you have to provision users. The passwords are sha256 hex encoded. For example using bash:
$ echo -n bar | sha256sum
fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 -
Then insert the user like this in Redis:
$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["local-datashare"]}}'
If you use other indices, you'll have to include them in the group_by_applications, but local-datashare should remain. For exammple if you use myindex:
$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "groups_by_applications":{"datashare":["myindex","local-datashare"]}}'
Then you should see this popup:
Example
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in Redis:
Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with Apache Tika and Tesseract OCR, this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.
Separate Processing Stages
Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.
Distribute the INDEX stage across multiple servers to handle the workload efficiently.We often use multipleg4dn.8xlarge instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.
For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.
Leverage Parallelism
Datashare offer --parallelism and --parserParallelism options to enhance processing speed.
ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.
Adjust JAVA_OPTS
You can fine-tune the JAVA_OPTS environment variable based on your system's configuration to optimize Java Virtual Machine memory usage.
Example (for g4dn.8xlarge8with 120 GB Memory):
JAVA_OPTS="-Xms10g -Xmx50g" datashare --mode CLI --stage INDEX
Specify Document Language
If the document language is known, explicitly setting it can save processing time.
Use --language for general language setting (e.g., FRENCH, ENGLISH).
Use --ocrLanguage for OCR tasks to specify the Tesseract model (e.g., fra, eng).
Example:
datashare --mode CLI --stage INDEX --language FRENCH --ocrLanguage fra
datashare --mode CLI --stage INDEX --language CHINESE --ocrLanguage chi_sim
datashare --mode CLI --stage INDEX --language GREEK --ocrLanguage ell
Manage OCR Tasks Wisely
OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false.
Example:
datashare --mode CLI --stage INDEX --ocr false
Efficient Handling of Large Files
Large PST files or archives can hinder processing efficiency. We recommend extract these files before processing with Datashare. If they are too many of them, keep in mind Datashare will be able to extract them anyway.
Example to split Outlook PST files in multiple .eml files with readpst:
readpst -reD <Filename>.pst
Search documents
You can search with the main search bar, with operators, and also within a document thanks to control or command + F.
Search in documents
1. To see all your documents (you need to have added documents to Datashare and have analyzed them before), click 'Search in documents':
If not collapsed yet, to collapse the left menu in order to gain room, click the 'hamburger menu':
2. Search for specific documents. Type terms in the search bar, press Enter or click 'Search':
IMPORTANT:
To make your searches more precise, you can search with operators (AND, OR, ....):read more here.
If you get a message "Your search query is wrong", it is probably because you are misusing one or some reserved characters (like ^ " ? ( [ * OR AND etc).Please refer to this page.
3. You can search in specific fields like tags, title, author, recipient, content, path or thread ID. Click 'All fields' and select your choice in the dropdown menu:
Choose between views (list, grid, table)
Select the view on the top right.
List:
Grid:
Table:
Search within a document
Once a document is opened, you can search for terms in this document:
Press Command (⌘) + F (on Mac) or Control + F (on Windows and Linux) or click on the search bar above your Extracted Text
Type what you search for
Press ENTER to go from one occurrence to the next one
Presse SHIFT + ENTER to go from one occurrence to the previous one
This also counts the number of occurrences of your searched terms in this document:
If you run email extraction and searched for one or several email addresses, if the email adresses are in the email's metadata (recipient, sender or other field), there will be a 'in metadata' label attached to the email addresses:
Search with operators / Regex
To make your searches more precise, you can use operators in the main search bar.
Double quotes for exact phrase
To have all documents mentioning an exact phrase, you can use double quotes. Use straight double quotes ("example"), not curly double quotes (“example”).
Example: "Alicia Martinez’s bank account in Portugal"
OR (or space)
To have all documents mentioning all or one of the queried terms, you can use a simple space between your queries or 'OR'. You need to write 'OR' with all letters uppercase.
Example: Alicia Martinez
Same search: Alicia OR Martinez
AND (or +)
To have all documents mentioning all the queried terms, you can use 'AND' between your queried words. You need to write 'AND' with all letters uppercase.
Example: Alicia AND Martinez
Same search: +Alicia +Martinez
NOT (or ! or -)
To have all documents NOT mentioning some queried terms, you can use 'NOT' before each word you don't want. You need to write 'NOT' with all letters uppercase.
Example: NOT Martinez
Same search: !Martinez
Same search: -Martinez
Please note that you can combine operators
Parentheses should be used whenever multiple operators are used together and you want to give priority to some.
Example: ((Alicia AND Martinez) OR (Delaware AND Pekin) OR Grey) AND NOT "parking lot"
You can also combine these with 'regular expressions' Regex between two slashes (see below).
Wildcards
If you search faithf?l, the search engine will look for all words with all possible single character between the second f and the l in this word. It also works with * to replace multiple characters.
Example: Alicia Martin?z
Example: Alicia Mar*z
Fuzziness
You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), you can use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
Proximity searches
When you type an exact phrase (in double quotes) and use fuzziness, then the meaning of the fuzziness changes. Now, the fuzziness means the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
"the cat is blue" -> "the small cat is blue" (1 insertion = fuzziness is 1)
"the cat is blue" -> "the small is cat blue" (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elastic).
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than "quick brown fox"(source: Elastic).
Boosting operators
Use the boost operator ^ to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes:
Example: quick^2 fox
The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups:
"A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern." (Wikipedia).
1. You can use Regex in Datashare. Regular expressions (Regex) in Datashare need to be written between 2 slashes.
Example: /.*..*@.*..*/
The example above will search for any expression which is structured like an email address with a dot between two expressions before the @ and a dot between two expressions after the @ like in 'first.lastname@email.com' for instance.
2. Regex can be combined with standard queries in Datashare :
Example: ("Ada Lovelace" OR "Ado Lavelace") AND paris AND /.*..*@.*..*/
3. You need to escape the following characters by typing a backslash just before them (without space): # @ & < > ~
Example: /.*..*@.*..*/ (the @ was escaped by a backslash \ just before it)
4. Important: Datashare relies on Elastic's Regex syntax as explained here. Datashare uses the Standard tokenizer. A consequence of this is that spaces cannot be searched as such in Regex.
We encourage you to use the AND operator to work around this limitation and make sure you can make your search.
If you're looking for French International Bank Account Number (IBAN) that can or cannot contain space and contain FR followed by numbers and/or letters (it could be FR7630001007941234567890185 ou FR76 3000 4000 0312 3456 7890 H43 for example), you can then search for:
Example: /FR[0-9]{14}[0-9a-zA-Z]{11}/ OR (/FR[0-9]{2}.*/ AND /[0-9]{4}.*/ AND /[0-9a-zA-Z]{11}.*/)
Here are a few examples of useful Regex:
You can search for /Dimitr[iyu]/ instead of searching for Dimitri OR Dimitry OR Dimitru. It will find all the Dimitri, Dimitry or Dimitru.
You can search for /Dimitr[^yu]/ if you want to search all the words which begin with Dimitr and do not end with either y nor u.
You can search for /Dimitri<1-5>/ if you want to search Dimitri1, Dimitri2, Dimitri3, Dimitri4 or Dimitri5.
Other common Regex examples:
phone numbers with "-" and/or country code like +919367788755, 8989829304, +16308520397 or 786-307-3615 for instance: /[\+]?[(]?[0-9]{3}[)]?[-\s.]?[0-9]{3}[-\s.]?[0-9]{4,6}/
You can find many other examples on this site. More generally, if you use a regex found on internet, beware that the syntax is not necessarily compatible with elasticsearch's. For example \d, \S and the like are not understood.
(Advanced) Searches using metadata fields
To find the list of existing metadata fields, go to a document's 'Tags and details' tab, click 'Show more details'.
When you hover the lines, you see a magnifying glass on each line. Click on it and Datashare will look for this field. Here is the one for content language:
Here is the one for 'indexing date' (also called extraction date here) for instance:
So for example, if you are looking for documents that:
contains term1, term2 and term3
and were created after 2010
you can use the 'Date' filter or type in the search bar:
term1 AND term2 AND term3 AND metadata.tika_metadata_creation_date:>=2010-01-01
Explanations:
'metadata.tika_metadata_creation_date:' means that we filter with creation date
'>="'means 'since January 1st included'
'2010-01-01' stands for January 2010 and the search will include January 2010
For other searches:
'<' will mean 'strictly after (with January 1st excluded)'
nothing will mean 'at this exact date'
You can search for numbers in a range. Ranges can be specified for date, numeric or string fields amont the ones you can find by clicking the magnifying glass when you hover the fields in a document's tab 'Tags and Details'.
Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}. For more details, please refer to Elastic's page on ranges.
Sort documents
You can sort documents by:
relevance (by default): it is a score calculated by the search engine
indexing date: when you analyzed the document, the day and time you 'put' them in Datashare
creation date: the day and time the document was created, as it is written in the document's metadata
size of the documents
path of the documents
You can also decide the number of documents displayed by page (10, 25, 50 or 100):
Search documents in batch
It allows to get the results of each query of a list, but all at once.
If you want to search a list of queries in Datashare, instead of doing each of them one by one, you can upload the list directly in Datashare.
To do so, you will:
Create a list of terms that you want to search in the first column of a spreadsheet
Export the spreadsheet as a CSV (a special format available in any spreadsheet software)
Upload this CSV in the "new Batch Search" form in Datashare
Get the results for each query in Datashare - or in a CSV.
Prepare your batch search
Write your queries in a spreadsheet
Write your queries, one per line and per cell, in the first column of a spreadsheet (Excel, Google Sheets, Numbers, Framacalc, etc.). In the example below, there are 4 queries:
Do not put line break(s) in any of your cells.
To delete line break(s) in your spreadsheet, you can use the "Find and replace all" functionality. Find all "\n" and replace them all by nothing or a space.
Write 2 characters minimum in the cells. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to 'failure'.
If you have blank cells in your spreadsheet...
...the CSV (which stand for 'Comma-separated values') will keep these blank cells. It will separate them with semicolons (the 'commas'). You will thus have semicolons in your batch search results (see screenshot below). To avoid that, you need to remove blank cells in your spreadsheet before exporting it as a CSV.
If there is a comma in one of your cells (like in "1,8 million" in our example above), the CSV will formally put the content of the cell in double quotes in your results and search for the exact phrase in double quotes.
Want to search only on some documents?
For instance, if you want to search only in some documents with certain tag(s), you can write your queries like this: "Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)".
Use operators in your CSV
Please also note that searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.
Export your CSV encoded in UTF-8
Export your spreadsheet in a CSV format like this:
LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':
Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".
Other spreadsheet softwares: please refer to each software's user guide.
Launch your batch search
Open Datashare, click 'Batch searches' in the left menu and click 'New batch search' on the top right:
Type a name for your batch search:
Upload your CSV:
Add a description (optional):
Set the advanced filters ('Do phrase matches', 'Fuzziness' or 'Proximity searches', 'File types' and 'Path') according to your preferences:
What is 'Do phrase matches'?
'Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query.
What is fuzziness?
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness.
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
What are proximity searches?
When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Click 'Add'. Your batch search will appear in the table of batch searches.
Get your results
Open your batch search by clicking its name:
You see your results and you can sort them by clicking the column's name. 'Rank' means the order by which each queries would be sorted out if run in Datashare's main search bar. They are thus sorted by relevance score by default.
You can click on a document's name and it will open it in a new tab:
You can filter your results by query and read how many documents there are for each query:
You can search for specific queries:
You can also download your results in a CSV format:
Relaunch your batch search
If you add more and more files in Datashare, you might want to relaunch existing batch search on your new documents too.
Notes:
In the server collaborative mode, you can only relaunch your own batch searches, not others'.
The relaunched batch search will apply to your whole corpus, newly indexed documents and previously indexed documents (not only the newly indexed ones).
To do so, open the batch search that you'd like to relaunch and click 'Relaunch':
Edit the name and the description of your batch search if needed:
You can choose to delete the current batch search after relaunching it:
Note: if you're worried about losing your previous results because of an error, we recommend to keep your current batch search (turn off this toggle button) and delete it only after the relaunch is a success.
Click 'Submit':
You can see your relaunched batch search running in the batch search's list:
I get a "failure". What does that mean?
Failures in batch searches can be due to several causes.
Click the 'See error' button to open the error window:
The first query containing an error makes the batch search fail and stop.
Check this first failure-generating query in the error window:
We recommend to remove the slash, as well as any reserved characters, and re-run the batch search again.
'elasticsearch: Name does not resolve'
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
__
'Data too large'
One of your queries can lead to a 'Data too large' error.
It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.
We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.
****
'SearchException: query='AND ada' '
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators (see below).
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your queries as some errors can remain after correcting the first one.
They are more likely to happen when 'do phrase matches' toggle button is turned off:
When 'Do phrase matches' is on, syntax error can still happen though:
Here are the most common errors:
- A query starts with AND (all uppercase)
- A query starts with OR (all uppercase)
- A query contains only one double quote or a double quote in a word
- A query starts with or contains tilde (~) inside a term
- A query starts with or contains a caret (^)
- A query contains one slash (/)
- A query uses square brackets ([ ])
Delete your batch search
Open your batch search and click the trash icon:
Then click 'Yes':
Filter documents
You can use several filters on the left of the main screen. Applied filters are reminded on the top of the results' column. You can also 'contextualize' and reset the filters.
Apply filters
On the left column, you can apply filters by ticking them, like 'Portable Document Format (PDF)' in File Types and 'English' in Languages in the example below:
A reminder of the currently applied filters, as well as your queried terms, are displayed at the top of the results' column. You can easily unselect these filters from there by clicking them or clear all of them:
The currently available filters are:
Projects: if you have more than one project, you can select several of them and run searches in multiple projects at once.
Starred: If you have starred documents, you can easily find them again.
Tags: If you wrote some tags, you will be able to select and search for them.
Recommended by: available only on server (collaborative) mode, this functionality helps you find the document recommended by you and/or others.
File type: This is the 'Content type' of the file (Word, PDF, JEPG image, etc.) as you can read it in a document's 'Tags & Details'.
Creation dates: the calendar allows you to select a single creation date or a date range. This is when the document was created as it is noticed in their properties. You can find this in a document's 'Tags & Details'.
Languages: Datashare detects the main language of each document.
People / Organizations / Locations: you can selected these named entities and search them.
Path: This is the location of your documents as it is indicated in your original files (ex: desktop/importantdocuments/mypictures). You can find this in a document's 'Tags & Details'.
Indexing date: This date corresponds to when you indexed the documents in Datashare.
Extraction level: This regards embedded documents. The file on disk is level zero. If a document (pictures, etc) is attached or contained in a file on disk, extraction level is “1st”. If a document is attached or contained in a document itself contained in a file on disk, extraction level is “2nd”, etc.
Filters can be combined together and combined with searches in order to refine results.
Filter by named entities
If you have asked Datashare to 'Find people, organizations and locations', you can see names of individuals, organizations and locations in the filters. These are the named entities automatically detected by Datashare.
Search for named entities in the filter's search bar:
Select all of them, one or several of them to filter the documents that mention them:
Use the 'Exclude' button
If you want to select all items except one or several of them, you can use the 'Exclude button'.
It allows you to search for all documents which do not correspond to the filter(s) you selected, that is to say to the currently strikethrough filters.
Contextualize filters
In several filters, you can tick 'Contextualize' : this will update the number of documents indicated in the filters in order to reflect the results. The filter will only count what you selected.
In the example below, the 'Contextualize' checkboxes are not ticked:
After the Contextualize button in Tags filter is ticked:
After the Languages button in Tags filter is ticked:
Clear all filters
To reset all filters at the same time, click 'Clear all':
In the , you will be able to select some file types and some paths if you want to search only in some documents.
But you can also use .
AND NOT * ? ! + - do work in batch searches (as they do in the regular search bar) but only if "Do phrase match" in Advanced filters is turned off.
Reserved characters, when misused, can lead to because of syntax errors.
Important: Use the.
Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, .
When you run a , you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
In the case above, the slash (/) used between 'Heroin' and 'Opiates' is a so Datashare interpreted this query as a syntax error, failed and didn't go further so the batch search stopped.
In that case, you need to re-open Datashare: ****here are the instructions for , or .
You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. .
You cannot start a query with AND all uppercase, neither in Datashare's main search bar nor in your CSV. .
You cannot start a query with OR all uppercase, neither in Datashare's main search bar nor in your CSV. .
You cannot type a query with only one double quote, neither in Datashare's main search bar nor in your CSV. .
You cannot start a query with tilde (~) or make one contain a tilde, neither in Datashare's main search bar nor in your CSV. Tilde is reserved as a search operator for or .
You cannot start a query with caret (^) or make it contain a caret, neither in Datashare's main search bar nor in your CSV. .
You cannot start a query with slash (/) or make it contain a slash, neither in Datashare's main search bar nor in your CSV. . Note that you can use Regex in batch searches.
Once you opened a document, you can explore the document's data through different tabs.
Extracted text
In 'Extracted Text', you can read the text of a document as extracted by Datashare:
Please beware that Datashare show named entities by default. This can overwrite some original text with wrong named entities. It is thus important to always verify original text by deactivating named entity overwriting. To do so, please:
Turn off the toggle button ‘Show named entities’ and read the extracted text
Check the ‘Preview’ of original document if available
Check the original document at its original location or by clicking the pink button ‘Download’
****
Search for attachments
If the documents has attachments (technically called 'children documents'), find them at the end of the document. Click their pink button to open them:
To open all the attachments in Datashare, click 'See children documents' in Tags and Details:
****
Search for terms within this document
Press Command(⌘) + F (on Mac) or Control + F (on Windows and Linux) or click on the search bar above your Extracted Text
Type what you search for
Press ENTER to go from one occurrence to the next one
Presse SHIFT + ENTER to go from one occurrence to the previous one
This also counts the number of occurrences of your searched terms in this document:
If you run email extraction and searched for one or several email addresses, if the email adresses are in the email's metadata (recipient, sender or other field), there will be a 'in metadata' label attached to the email addresses:
Tags & Details
In 'Tags & Details', you can read the document's details. It's all the metadata as they appear in the original file. Please click 'Show more details' to get all metadata:
You can also read the tags you previously wrote for this document, like 'test1', 'test2' and 'test3' in the example below:
Use this functionality to delete all line break(s)
Remove blank cells in your spreadsheet in order to avoid this.
Tag documents
You can tag documents, search for tagged documents and delete your tag(s).
Tag a document
Open the document by clicking on its title
Click the second tab 'Tags & Details'
Type your tag
Press 'Enter'
Tags can contain any character but cannot contain space.
Your new tag is now displayed on this page.
You can add several tags.
Search tags with tag filter
Open the second filter untitled 'Tags'
You see the tags by frequency and the number of tagged documents
You can search using the search bar
You can select one or multiple tags
Search tags with main search bar
To find all your documents tagged with specific tag(s):
Type the tag(s) in the main search bar
Select 'Tags' in the field dropdown menu
Click 'Search' or press 'Enter'
The results are all the documents tagged with the tag(s) you typed in the search bar.
To find all your tagged documents, whatever the tags:
Type nothing in the search bar
Select 'Tags' in the field selector
Click 'Search'
The results are all the tagged documents.
Delete a tag
Click the cross at the end of the tag that you want to delete.
Recommend documents
Recommend a document
Open the document by clicking on its title
Click the button 'Mark as recommended':
Your recommendation is now displayed on this page and in the left 'Recommended by' filter.
Search recommended documents with the filter
Open the filter untitled 'Recommended by'
Unmark a document as recommended
Open the document and click on "Unmark as recommended".
Keyboard shortcuts
Shortcuts help do some actions faster.
Find these shortcuts in Datashare here:
It will open a window which recalls the shortcuts:
Go to the next / previous document
Windows / Linux
Control + →
Control + ←
Mac
Command (⌘) + →
Command (⌘) + ←
Find in document...
Windows / Linux
Control + F
Mac
Command (⌘) + F
... and go from one occurrence to the next / previous occurrence
Go to next occurrence
Enter
or
F3
Go to previous occurrence
Shift + Enter
or
Shift + F3
Navigate a document's tabs
Windows / Linux
Control (ctrl) + alt + ⇞ (pageup)
Control (ctrl) + alt + ⇟ (pagedown)
Mac
Command (⌘) + option (⌥) + ↑ (arrow up)
Command (⌘) + option (⌥) + ↓ (arrow down)
Go back to search results
Once you opened a document, go back to search results:
Esc
Can I use Datashare with no internet connection?
You need an internet connection to install Datashare.
You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.
You don't need internet connection:
to add documents to Datashare
to find named entities (except for the first time you use an NLP options - see above)
to search and explore documents
to download documents
This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.
Can I use an external drive as data source?
Warning: this requires some technological knowledge.
You can make Datashare follow soft links : add --followSymlinks when Datashare is launched.
If you're on Mac or Windows, you must change the launch script.
If you're on Linux, you can add the option after the Datashare command.
Create a Neo4j graph and explore it
This page explains how to leverage neo4j to explore your Datashare projects. We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the "Other platforms and version
The documents and entities graph
neo4j is a graph database technology which lets you represent your data as a graph. Inside Datashare, neo4j lets you connect entities between them through documents in which they appear.
After creating a graph from your Datashare project, you will be able to explore it and visualize these kinds of relationships between you project entities:
In the above graph, we can see 3 email document nodes in orange, 3 email address nodes in red, 1 person node in green and 1 location node in yellow. Reading the relationship types on the arrows, we can deduce the following information from the graph:
shapp@caiso.com emailed 20participants@caiso.com, the sent email has an id starting with f4db344...
one person named vincent is mentioned inside this email, as well as the california location
finally, the email also mentions the dle@caiso.com email address which is also mentioned in 2 other email documents (with id starting with 11df197... and 033b4a2...)
If you are not familiar with graph and neo4j, take a look at the following resources:
The neo4j graph is composed of :Document nodes representing Datashare documents and :NamedEntity nodes representing entities mentioned in these documents.
The :NamedEntity nodes are additionally annotated with their entity types: :NamedEntity:PERSON, :NamedEntity:ORGANIZATION, :NamedEntity:LOCATION, :NamedEntity:EMAIL...
Graph relationships
In most cases, an entity :APPEARS_IN a document, which means that it was detected in the document content. In the particular case of email documents and EMAIL addresses, it is most of the time possible to identify richer relationships from the email metadata, such as who sent (:SENT relationship) and who received (:RECEIVED relationship) the email.
When an :EMAIL address entity is neither :SENT or :RECEIVED, like it is the case in the above graph for dle@caiso.com, it means that the address was mentioned in the email document body.
When a document is embedded inside another document (as an email attachment for instance), the child document is connected to its parent through the :HAS_PARENT relationship.
Create your Datashare project's graph
The creation of a neo4j graph inside Datashare is supported through a plugin. To use the plugin to create a graph, follow these instructions:
After the graph is created, navigate to the 'Projects' page and select your project. You should be able to visualize a new neo4j widget displaying the number of documents and entities found inside the graph:
Access your project's graph
Depending on your access to the neo4j database behind Datashare, you might need to export the neo4j graph and import it locally to access it from visualization tools.
Exporting and importing the graph into your own DB is also useful when you want to perform write operations on your graph without any consequences on Datashare.
With read access to Datashare's neo4j database
If you have read access to the neo4j database (it should be the case if you are running Datashare on your computer), you will be able to plug visualization tools to it and start exploring.
Without read access to Datashare's neo4j database
If you can't have read access to the database, you will need to export it and import it into your own neo4j instance (running on your laptop for instance).
In case you don't have access to the DB and can't be provided with a dump, you can export the graph from inside. Be aware that limits might be applied on the size of the exported graph.
To export the graph, navigate to Datashare's 'Projects' page, select your project, select the 'Cypher shell' export format and click the 'Export graph' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using the 'File types' and 'Project directory' filters.
You will now be able to explore the graph imported in your own neo4j instance.
Explore and visualize entity links
Once your graph is created and that you can access it (see this section if you can't access the Datashare's neo4j instance), you will be able to use your favorite tool to extract meaningful information from it.
Neo4j Bloom is a simple and powerful tool developed by neo4j to quickly visualize and query graphs, if you run Neo4j Enterprise Edition. Bloom lets you navigate and explore the graph through a user interface similar to the one below:
Neo4j Bloom is accessible from inside Neo4j Desktop app.
Find out more information about to use Neo4j Bloom to explore your graph with:
The Neo4j Browser lets you run Cypher queries on your graph to explore it and retrieve information from it. Cypher is like SQL for graphs, running Cypher queries inside the neo4j browser lets you explore the results as shown below:
The Neo4j Browser is available for both Enterprise and Community distributions. You can access it:
inside the Neo4j Desktop app when running neo4j from the Desktop app
Gephi is a simple open-source visualization software. It is possible to export graphs from Datashare into the GraphML File Format and import them into Gephi.
To export the graph in the GraphML file format, navigate to the 'Projects', select your project, choose the 'Graph ML' export format and click the 'Export graph' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using the 'File types' and 'Project directory' filters.
You can download a document by going to it on Datashare. Click on the download icon to the right of the screen on on the right of the document's title.
If you can't download a document, it means that Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on Datashare's Github.
Can I remove document(s) from Datashare?
Yes, you can remove documents from Datashare. But at the moment, it will remove all your documents. You cannot remove only some documents.
Click the pink trash icon on the bottom left of Datashare:
And then click 'Yes':
You can them re-analyze a new corpus.
For advanced users only - if you'd like to do it with the Terminal, here are the instructions:
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
How can we use Datashare on a collaborative mode on a server?
You can use Datashare with multiple users accessing a centralized database on a server.
Warning: to put the server mode in place and to maintain it requires some technical knowledge.
Do you recommend OS or machines for large corpuses?
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
How can I contact ICIJ for help, bug reporting or suggestions?
You can send an email to datashare@icij.org.
When reporting a bug, please share:
your OS (Mac, Windows or Linux) and version
the problem, with screenshots if possible
the actions that led to the problem
graph-widget
graph-dump
desktop-shell
bloom-viz
browser-viz
graph-ml-dump
Please find the .
The most complex operation is OCR (we use ) so if your documents don't contain many images, it might be worth deactivating it (--ocr false).
Advanced users can post an issue with their logs on Datashare's GitHub :
Why results from a simple search and a batch search can be slightly different?
If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.
Why?
For technical reasons, Datashare processes both queries in 2 different ways:
a. Search bar (a simple search processed in the browser):
The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.
b. Batch search (several searches processed by the server):
Datashare's server processes each of the batch search's queries
Each query is sent to Elasticsearch. The results are saved into a database
When the batch search is finished, you get the results from Datashare
Datashare sends back the results stored into the database/
Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.
How to run Neo4j?
This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)
Run neo4j inside docker
1. enrich the services section of the docker-compose.yml of the install with Docker page, with the following neo4j service:
Pipelines of Natural Language Processing are tools that automatically identify named entities in your documents. You can only choose one at a time.
Select 'CoreNLP' if you want to use the one with the highest probability of working in most of your documents:
What should I do if I get more than 10,000 results?
In Datashare, for technical reasons, it is not possible to open the 10,000th document.
Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.
As it is not possible to fix this, here are some tips:
Refine your search: use filters to narrow down your results and ensure you have less than 10,000 matching documents
Change thesorting of your results: use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring
Search your query in abatch search: you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents
What is a named entity?
A named entity in Datashare is the name of an individual, an organization or a location.
Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically highlight named entities in your documents.
Install plugins and extensions
It will help you locally add plugins and extensions to Datashare.
Plugins are small programs that you can add to Datashare's front-end to get new features (the front-end is the interface, "the part of the software with which the user interacts directly" - source).
Extensions are small programs that you can add to Datashare's back-end to add new features (the back-end is "the part of the software that is not directly accessed by the user, typically responsible for storing and manipulating data" - source).
Add plugins to Datashare (front-end)
Go to "Settings":
Click "Plugins":
Choose the plugin you want to add and click "Install now":
If you want to install a plugin from an URL, click "Install plugin from URL".
Your plugin is installed.
Refresh your page to see your new plugin activated in Datashare.
Add extensions to Datashare (back-end)
Go to "Settings":
Click "Extensions":
Choose the extension you want to add and click "Install now":
If you want to install an extension from an URL, click "Install extension from URL".
Your extension is installed.
Restart Datashare to see your new extension activated in Datashare.
Update plugin or extension with latest version
When a newer version of a plugin or extension is available, you can click on the "Update" button to get the latest version.
After that, if it is a plugin, refresh your page to activate the latest version.
If it is an extension, restart Datashare to activate the latest version.
Create your own plugin or extension
People who code can create their own plugins and extensions by following these steps:
In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elastic).
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than "quick brown fox"(source: Elastic).
In Batch Searches
When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
What is fuzziness?
As a search operator
In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
In Batch searches
When you run a batch search, you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
'Your search query is wrong.' What should I do?
This can be due to some syntax error(s) in the way you wrote your query.
Here are the most common errors that you should correct:
You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).
The query starts with or contains tilde (~)
You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for fuzziness or proximity searches.
Double quotes need to be straight in Datashare's search bar, not curly.
Straight double quotes: "example"
Curly double quotes: “example” (these are tilted)
This search works because double quotes are straight in the search bar:
This search doesn't work because double quotes are curly in the search bar:
What if Datashare says 'No documents found'?
If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of loosing your previously selected filters, click 'Reset filters'.
You may not have added documents to Datashare yet. To add documents, see: 'Add documents to Datashare' for Mac, Windows or Linux.
In 'Analyzed documents', if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you analyzed, it can take multiple hours.
List of common errors leading to "failure" in Batch Searches
SearchException: query='AND ada'
One or several of your queries contains syntax errors.
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your queries as some errors can remain after correcting the first one.
Example of a syntax error message:
SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'
elasticsearch: Name does not resolve
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: ****here are the instructions for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
Nothing works, everything crashes. What can I do?
If you are using Datashare version with Docker (not the standard version) and if Datashare crashes, please try to restart Docker Desktop.
On Mac:
Click the Docker Desktop icon on the top menu bar. The following drop-down menu appears:
Click 'Restart'.
As long as the icon's little points move, it means that Docker Desktop is still restarting.
Once these points stopped moving, either Datashare restarted automatically or you can restart Datashare manually (see 'Open Datashare').
On Windows:
Right-click the Docker Desktop icon (a little whale) on the bottom menu bar.
Click 'Restart'.
Click 'Restart' again.
Wait for Docker Desktop to restart.
When it says 'Docker Desktop is running', either Datashare restarted automatically or you can restart Datashare manually (see 'Open Datashare').
On Linux, please execute: sudo service docker restart
If you see a progress of less than 100%, please wait.
If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on Datashare Github with the application logs.
'You are not allowed to use Docker, you must be in the "docker-users" group'. What should I do?
It means that you are on Windows.
Search and open 'Computer management':
Go to 'Local users and groups':
In 'Groups', double-click 'docker-users':
If you are not in 'docker-users', go to 'Users' on the left filter and add you in the 'docker-users' group by clicking on you and 'Add...':
What do I do if Datashare opens a blank screen in my browser?
If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:
First wait 30 seconds and reload the page.
If the screen remains blank, restart Datashare following instructions forMac,WindowsorLinux.
If you still see a blank screen, please uninstall and reinstall Datashare
To uninstall Datashare:
On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.
On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.
To reinstall Datashare, see 'Install Datashare' for Mac, Windows or Linux.
I see people, organizations and locations in the filters but not in the documents
Datashare's filters keep the named entities (people, organizations and locations) previously recognized.
"Old" named entities stay in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later. It means that you removed the documents which contained the named entities after extracting them, you run new analysis, but the named entities stayed in the filters:
In the future, removing the documents from Datashare before indexing new ones will remove the named entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can click the little pink trash icon on the bottom of the left column:
What if a 'Preview' of my documents is 'not available'?
Datashare can display 'Preview' for some document types only: images, pdf, csv, xlsx and tiff. Other document types are not supported yet.
What does 'Windows named pipe error' mean?
If you use Datashare with Docker (not the standard version), if a dark window called the Terminal displays a phrase beginning with "Windows named pipe error: The system cannot find the file specified" appears, it means that Docker Desktop, one of the 3 components of Datashare, is not working. Relaunching Docker Desktop should solve the problem.
Find Docker Desktop in your Applications or the whale icon on the menu bar of your computer and click 'Restart'.
Datashare doesn't open. What should I do?
It can be due to extensions priorly installed. The tech team is fixing the issue. In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:
To allow external developers to add their own components, we added markers called "hooks" in strategic locations on the user interface where a user can define new Vue Component through plugins.
Retrieve the batch search list for the user issuing the request filter with the given criteria, and the total of batch searches matching the criteria.
It needs a Query json body with the parameters :
from : index offset of the first document to return (mandatory)
size : window size of the results (mandatory)
sort : field to sort (prj_id name user_id description state batch_date batch_results published) (default "batch_date")
order : "asc" or "desc" (default "asc")
project : projects to include in the filter (default null / empty list)
batchDate : batch search with a creation date included in this range (default null / empty list)
state : states to include in the filter (default null / empty list)
publishState : publish state to filter (default null)
If from/size are not given their default values are 0, meaning that all the results are returned. BatchDate must be a list of 2 items (the first one for the starting date and the second one for the ending date) If defined publishState is a string equals to "0" or "1"
Return 200 and the list of batch searches with the total batch searches for the query. See example for the JSON format.
Retrieve the batch search with the given id The query param "withQueries" accepts a boolean value When "withQueries" is set to false, the list of queries is empty and nbQueries contains the number of queries.
Returns 200 and 404 if there is no batch id If the user issuing the request is not the same as the batch owner in database, it will do nothing (thus returning 404)
Creates a new batch search. This is a multipart form with 8 fields : name, description, csvFile, published, fileTypes, paths, fuzziness, phrase_matches
No matter the order. The name and csv file are mandatory else it will return 400 (bad request) Csv file must have under 60 000 lines else it will return 413 (payload too large) Queries with less than two characters are filtered
To do so with bash you can create a text file like :
Delete batch searches and results for the current user.
Returns 204 (No Content): idempotent
Return 204
Example :
curl -XDELETE localhost:8080/api/batch/search
/api
Get /api/:project/documents/src/:id?routing=:routing
Returns the file from the index with the index id and the root document (if embedded document).
The routing can be omitted if it is a top level document, or it can be the same as the id.
Returns 404 if it doesn't exist
Returns 403 if the user has no access to the requested index.
Parameter project
Parameter id
Parameter routing
Return 200 or 404 or 403 (Forbidden)
Example :
curl -i http://localhost:8080/api/apigen-datashare/documents/src/bd2ef02d39043cc5cd8c5050e81f6e73c608cafde339c9b7ed68b2919482e8dc7da92e33aea9cafec2419c97375f684f
HTTP/1.1 200 OK
Content-Disposition: attachment;filename="doc1.txt"
Access-Control-Allow-Origin: *
Content-Type: text/plain
ETag: 8f1cdb75be4a54bfc6bcfe8be5a2c6f4
Content-Length: 854
Connection: keep-alive
Set-Cookie: _ds_session_id={"login":null,"roles":null,"sessionId":null,"redirectAfterLogin":"/"}; version=1; path=/; expires=Mon, 30-Jul-2091 14:00:23 GMT; max-age=2147483647
This is an embedded doc test on behalf of our client 'Skype Ltd.'.
What is Skype Ltd. ?
Skype Ltd. is for doing things together, whenever you’re apart.
With Skype Ltd., you can share a story, celebrate a birthday, learn a language, hold a meeting, work with colleagues – just about anything you need to do together every day. You can use Skype Ltd. on whatever works best for you - on your phone or computer or a TV with Skype Ltd. on it. It is free to start using Skype Ltd. - to speak, see and instant message other people on Skype Ltd. for example. You can even try out group video, with the latest version of Skype Ltd.
Skype is heavily used by a lot of people around the world. For example Trump is using frequently skype app to discuss all political issues with his advisors.
Don't hesitate to contact us on contact@skype.com.
Get /api/:project/documents/content/:id?routing=:routing
Fetch extracted text by slice (pagination)
Parameter project Project id
Parameter id Document id
Parameter offset Starting byte (starts at 0)
Parameter limit Size of the extracted text slice in bytes
Parameter targetLanguage Target language (like "ENGLISH") to get slice from translated content
Return 200 and a JSON containing the extracted text content ("content":text), the max offset as last rank index ("maxOffset":number), start ("start":number) and size ("size":number) parameters.
if a body is provided, the body will be sent to ES as source=urlencoded(body)&source_content_type=application%2Fjson in that case, request parameters are not taken into account.
When datashare is launched in NER mode (without index) it exposes a name finding HTTP API. The text is sent with the HTTP body.
Parameter pipeline to use
Parameter text to analyse in the request body
Return list of NamedEntities annotations
Example :
curl -XPOST http://dsenv:8080/api/ner/findNames/CORENLP -d "Please find attached a PDF copy of the advance tax clearance obtained for our client John Doe."
/api
Get /api/:project/notes/:path:
Gets the list of notes for a project and a document path.
if we have on disk:
/a/b/doc1
/a/c/doc2
/d/doc3
And in database
projectId
path
note
variant
p1
a
note A
info
p1
a/b
note B
danger
then :
GET /api/p1/notes/a/b/doc1 will return note A and B
GET /api/p1/notes/a/c/doc2 will return note A
GET /api/p1/notes/d/doc3 will return an empty list
If the user doesn't have access to the project she gets a 403 Forbidden
Parameter project the project the note belongs to
Parameter documentPath the document path
Parameter context HTTP context containing the user
Return list of Note that match the document path
Example:
curl localhost:8080/api/apigen-datashare/notes/path/to/note/for/doc.txt
[{"project":{"name":"apigen-datashare","sourcePath":"file:///vault/apigen-datashare","label":"apigen-datashare","description":null,"publisherName":null,"maintainerName":null,"logoUrl":null,"sourceUrl":null,"creationDate":"2023-07-12T10:46:18.152+00:00","updateDate":"2023-07-12T10:46:18.152+00:00"},"path":"path/to/note","note":"this is a note","variant":"info"}]
Get /api/:project/notes
Gets the list of notes for a project.
If the user doesn't have access to the project she gets a 403 Forbidden
Parameter project the project the note belongs to
Parameter context HTTP context containing the user
Return list of Note related to the project
Example:
curl localhost:8080/api/apigen-datashare/notes
[{"project":{"name":"apigen-datashare","sourcePath":"file:///vault/apigen-datashare","label":"apigen-datashare","description":null,"publisherName":null,"maintainerName":null,"logoUrl":null,"sourceUrl":null,"creationDate":"2023-07-12T10:46:18.168+00:00","updateDate":"2023-07-12T10:46:18.168+00:00"},"path":"path/to/note","note":"this is a note","variant":"info"}]
/api/openapi
Operation /api/openapidescription
ApiResponse /api/openapiresponseCode
Get /api/openapi
/api/plugins
Get /api/plugins
Gets the plugins set in JSON
If a request parameter "filter" is provided, the regular expression will be applied to the list.
see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for pattern syntax.
Example:
curl localhost:8080/api/plugins?filter=.*paginator
[{"deliverableFromRegistry":{"id":"datashare-plugin-text-paginator","name":"Text Paginator","version":"0.0.14","description":"A Datashare plugin to detect pages in text to display them nicely.","url":"https://github.com/ICIJ/datashare-plugin-text-paginator/releases/download/0.0.14/datashare-plugin-text-paginator-0.0.14.tgz","homepage":"https://github.com/ICIJ/datashare-plugin-text-paginator","type":"PLUGIN"},"installed":false,"version":"0.0.14","name":"Text Paginator","id":"datashare-plugin-text-paginator","type":"PLUGIN","description":"A Datashare plugin to detect pages in text to display them nicely."}]
Options /api/plugins/install
Preflight request
Return OPTIONS,PUT
Put /api/plugins/install
Download (if necessary) and install plugin specified by its id or url
request parameter id or url must be present.
Return 200 if the plugin is installed
Return 404 if the plugin is not found by the provided id or url
Gets the project information for the given project id.
Parameter id
Return 200 and the project from database if it exists
Example :
``` curl -H 'Content-Type:application/json' localhost:8080/api/project/apigen-datashare {"error":"java.lang.NullPointerException"} ``` ) ##Get /api/project/isDownloadAllowed/:id Returns if the project is allowed with this network route : in datashare database there is the project table that can specify an IP mask that is allowed per project. If the client IP is not in the range, then the file download will be forbidden.
in that project table there is a field called allow_from_mask that can have a mask with IP and star wildcard.
Ex : 192.168.*.* will match all subnetwork 192.168.0.0 IP's and only users with an IP in this range will be granted for downloading documents.
Retrieve the status of databus connection, database connection, shared queues and index. Adding "format=openmetrics" parameter to the url will return the status witn openmetrics format.
download files from a search query. Expected parameters are :
project: string
query: string or elasticsearch JSON query
if the query is a string it is taken as an ES query string, else it is a raw JSON query (without the query part) @see org.elasticsearch.index.query.WrapperQueryBuilder that is used to wrap the query
List all files and directory for the given path. This endpoint returns a JSON using the same specification than the tree command on UNIX. It is roughly the equivalent of:
tree -L 1 -spJ --noreport /home/datashare/data
Parameter dirPath
Return 200 and the list of files and directory
Example $(curl -XGET localhost:8080/api/tree/home/datashare/data)
## <a name="put__api_users_me_history"></a> Put /api/users/me/history
Add event to history. The event's type, the project ids and the uri are passed in the request body.
The project list related to the event is stored in database but is never queried (no filters on project)
It answers 200 when event is added or updated.
* **Parameter** query
* **Return** 200
Example :
## <a name="delete__api_users_me_history?type=_type"></a> Delete /api/users/me/history?type=:type
Delete user history by type.
Returns 204 (No Content) : idempotent
* **Parameter** type
* **Return** 204
Example :
List all projects name ids this user has access to.
Kind: global variable
defaultProject ⇒ String
Get the name of the default project
Kind: global variable
findComponent(name) ⇒ Promise.<(object|null)>
Asynchronously find a component in the lazyComponents object by its name.
Kind: global function
Returns: Promise.<(object|null)> - - A promise that resolves with the found component object, or null if not found.
getComponent(name) ⇒ Promise.<(object|Error)>
Asynchronously get a component from the lazyComponents object based on its name.
Kind: global function
Returns: Promise.<(object|Error)> - - A promise that resolves with the found component object, or rejects with an Error if not found.
sameComponentNames(...names) ⇒ boolean
Check if multiple component names are the same when slugified.
Kind: global function
Returns: boolean - - True if all names are the same when slugified, false otherwise.
componentNameSlug(name) ⇒ string
Generate a slug from the component name using kebab case and lowercase.
Kind: global function
Returns: string - - The slugified component name.
lazyComponents() ⇒ Object
Get the lazyComponents object using require.context for lazy loading of components.
Kind: global function
Returns: Object - - The lazyComponents object generated using require.context.
defaultProjectExists() ⇒ Promise:Boolean
Return true if the default project exists
Kind: global function
findProject(name) ⇒ Object
Retrieve a project by its name
Kind: global function
Returns: Object - The project matching with this name
deleteProject(name) ⇒ Promise:Integer
Delete a project by it name identifier.
Kind: global function
Returns: Promise:Integer - Index of the project deleted or -1 if project does not exist
deleteProjectFromSearch(name)
Delete a project from the search store
Kind: global function
setProject(project) ⇒ Object
Update a project in the list or add it if it doesn't exist yet.
Kind: global function
Returns: Object - The project
Is this widget displayed as ?
Number of columns on which the widget should be displayed according to the
This documentation is intended to help you create plugins for Datashare client. All methods currently exposed in the class are available to a global variable called datashare.
Kind: global class
Mixes: , , , , ,
: Promise.<Object>
:
:
⇒ Plugin
: I18n
: VueRouter
: Vuex.Store
⇒ *
: Auth
: Object
: Api
: Vue
: VueWait
: String
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒ VueCore
⇒ Promise.<Object>
⇒ Vue
⇒
⇒ Promise.<Object>
⇒ Promise
⇒ Promise
⇒
Param
Default
Description
Kind: instance property of
Fullfil: Object The actual application core instance.
datashare.app :
Kind: instance property of
datashare.core :
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
Kind: instance property of
datashare.use(Plugin, options) ⇒
Kind: instance method of
Returns: - the current instance of Core
Param
Type
Description
datashare.useAll() ⇒
Kind: instance method of
Returns: - the current instance of Core
datashare.useI18n() ⇒
Kind: instance method of
Returns: - the current instance of Core
datashare.useBootstrapVue() ⇒
Kind: instance method of
Returns: - the current instance of Core
datashare.useRouter() ⇒
Kind: instance method of
Returns: - the current instance of Core
datashare.useVuex() ⇒
Kind: instance method of
Returns: - the current instance of Core
datashare.useCommons() ⇒
Kind: instance method of
Returns: - the current instance of Core
datashare.useWait() ⇒
Kind: instance method of
Returns: - the current instance of Core
datashare.useCore() ⇒
Kind: instance method of
Returns: - the current instance of Core
Kind: instance method of
Kind: instance method of
Fullfil: - The instance of the core application
Reject: Object - The Error object
Kind: instance method of
Returns: Vue - The instantiated Vue
Param
Type
Default
Description
Kind: instance method of
datashare.dispatch(name, ...args) ⇒
Kind: instance method of
Returns: - the current instance of Core
Param
Type
Description
Kind: instance method of
Fullfil: Object Current user
Kind: instance method of
Kind: instance method of
Kind: instance method of
Param
Type
Default
Description
Kind: instance method of
Param
Type
Kind: instance method of
Param
Type
Kind: instance method of
Param
Type
Core.init(...options) ⇒
Kind: static method of
Param
Type
Description
Kind: instance method of
Param
Type
Default
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Kind: instance method of
Param
Type
Description
⇒ Promise
⇒ String
⇒ Boolean
⇒ Promise
Kind: instance method of
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
Kind: instance method of
Param
Type
Description
⇒ Promise:Object
Kind: instance method of
Param
Type
Description
Kind: instance method of
Returns: Promise:Object - The HTTP response object
Retrieves the batch search with the given id. The query param "withQueries" accepts a boolean value.When "withQueries" is set to false, the list of queries is empty and nbQueries contains the number of queries.
Search GET request to Elasticsearch. As it is a GET method, all paths are accepted.
if a body is provided, the body will be sent to ES as source=urlencoded(body)&source_content_type=application%2FjsonIn that case, request parameters are not taken into account.
Path parameters
pathstringrequired
elasticsearch path
Responses
cURL
JavaScript
Python
HTTP
curl -L \
--url '/api/index/search/{path}'
200
400
No body
head
Head request useful for JavaScript API (for example to test if an index exists)
Responses
cURL
JavaScript
Python
HTTP
curl -L \
--request HEAD \
--url '/api/index/search/{path}'
Returns 200 if the project is allowed with this network route : in Datashare database there is the project table that can specify an IP mask that is allowed per project. If the client IP is not in the range, then the file download will be forbidden. In that project table there is a field called allow_from_mask that can have a mask with IP and star wildcard.
Ex : 192.168.*.* will match all subnetwork 192.168.0.0 IP's and only users with an IP in.
Gets the public (i.e. without user's information) datashare settings parameters.
These parameters are used for the client app for the init process.
The endpoint is removing all fields that contain Address or Secret or Url or Key
Responses
*/*
cURL
JavaScript
Python
HTTP
curl -L \
--url '/settings'
200
{
"ANY_ADDITIONAL_PROPERTY": "anything"
}
get
Gets the versions (front/back/docker) of datashare.
Responses
*/*
cURL
JavaScript
Python
HTTP
curl -L \
--url '/version'
200
{
"ANY_ADDITIONAL_PROPERTY": "text"
}
get
Retrieve the status of databus connection, database connection and index.
Query parameters
format=openmetricsstringoptional
if provided in the URL it will return the status in openmetrics format
Responses
*/*
*/*
*/*
cURL
JavaScript
Python
HTTP
curl -L \
--url '/api/status'
200
503
504
{
"error": true,
"success": true
}
get
Gets all the user tasks.
Filters can be added with name=value. For example if name=foo is given in the request url query,
the tasks containing the term "foo" are going to be returned. It can contain also dotted keys for nested properties matching.
For example if args.dataDir=bar is provided, tasks with an argument "dataDir" containing "bar" are going to be selected.
Lists all files and directory for the given path. This endpoint returns a JSON using the same specification than the tree command on UNIX. It is roughly the equivalent of:
Add or update an event to user's history. The event's type, the project ids and the uri are passed in the request body.
To update the event's name, the eventId is required to retrieve the corresponding event.
The project list related to the event is stored in database but is never queried (no filters on project).
Query parameters
queryobjectoptional
user history query to save
Responses
cURL
JavaScript
Python
HTTP
curl -L \
--request PUT \
--url '/api/users/me/history'
Retrieves the batch search list for the user issuing the request filter with the given criteria, and the total of batch searches matching the criteria.
If from/size are not given their default values are 0, meaning that all the results are returned. BatchDate must be a list of 2 items (the first one for the starting date and the second one for the ending date) If defined publishState is a string equals to "0" or "1"
If the query is a string it is taken as an ES query string, else it is a raw JSON query (without the query part),
see org.elasticsearch.index.query.WrapperQueryBuilder that is used to wrap the query.