Loading...
Loading...
This page list all the concepts implemented by Datashare that users might want to understand before starting to search within documents.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
👷♀️ This page is currently being written by Datashare team.
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In local mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
To report a bug, please contribute in our GitHub detailing your logs with:
your Operating System (Mac, Windows or Linux)
the version of your Operating System
the version of Datashare
screenshots of your issue
a description of your issue.
If for confidentiality reasons you don't want to open an issue on Github, please write to datashare@icij.org and our team will do its best to answer you in a timely manner.
Datashare allows you to search within your files, regardless of their format. It is a free open-source software developed by the International Consortium of Investigative Journalists (ICIJ).
Welcome to Datashare - a self-hosted documents search software. It is a free and open-source software developed by the International Consortium of Investigative Journalists (ICIJ). Initially created to combine multiple named-entity recognition pipelines, this tool is now a fully-featured search interface to dig into your documents. With the help of several open source tools (Extract, Apache Tika, Apache Tesseract, CoreNLP, OpenNLP, Elasticsearch, etc), Datashare can be used on one single personal computer as well as on 100 interconnected servers.
Datashare is developed by the ICIJ, a collective of investigative journalists. Datashare is built at the top of technologies and methods already tested with investigations like the Panama Papers or the Luanda Leaks. Seeing the growing interest for ICIJ's technology, we decided to open source this key component of our investigations so a single journalist as well as big media organizations could use it on their own documents.
Curious to know more about how we use Datashare?
We setup a Demo instance of Datashare with a small set of documents from the Luxleaks investigation (2014). When using this instance, you will be assigned a temporary user which can star, tag, recommend and explore documents.
Datashare was also built to run on a server. This is how we use it for our collaborative projects. Please refer to the server documentation to know how it works.
When building Datashare, one of our first decisions was to use Elasticsearch to create an index of documents. It would be fair to describe Datashare as a nice looking web interface for Elasticsearch. We want our search platform to be user-friendly while keeping all the powerful Elasticsearch features available for advanced users. This way we ensure that Datashare is usable by non tech-savvy reporters, but still robust enough to satisfy data analysts and developers who want to query the index directly with our API.
We implemented the possibility to create plugins, to make this process more accessible. Instead of modifying Datashare directly, you could isolate your code with a specific set of features and then configure Datashare to use it. Each Datashare user could pick the plugins they need or want, and have a fully customized installation of our search platform. Please have a look at the documentation.
This project is currently available in English, French, Spanish and Japanese. You can help us to improve and complete translations on Crowdin.
Find the application on your computer and run it locally on your browser.
Once Datashare is installed, go to "Finder", then "Applications", and double-click on "Datashare".
A Terminal window called 'Datashare.command' opens and describes the technical operations going on during the opening.
Keep this Terminal window open as long as you use Datashare.
It will help you set up and install Datashare on your computer.
This guide will explain to you how to install Datashare on Mac. The installer will take care of checking your system have all the dependencies to run Datashare. Because this software use (to perform Optical Character Recognition) and Mac doesn't support them out-of-box, heavy dependencies must be downloaded. If your system have none of those dependencies, the first installation of Datashare can take up to 30 minutes.
The installer will setup:
Xcode Command Line Tools (if neither XCode or are installed)
MacPorts (if neither Homebrew or MacPorts are installed)
Apache Tesseract with MacPorts or Homenrew
Java JRE 17
Datashare executable
Note: previous versions of this document refered to a "Docker Installer". We do not provide this installer anymore but Datashare is still and supported with Docker.
Go to , scroll down and click 'Download for Mac'.
Go to your "Downloads" directory in Finder and double-click "datashare-X.Y.Z.pkg":
Click 'Continue', 'Install', enter your password and 'Install Software':
The installation begins. You see a progress bar. It stays a long time on "Running package scripts" because it is installing XCode Command Line Tools, MacPort, Tesseract OCR, Java Runtime Environment and finally Datashare.
You can see what it actually does by typing command+L, it will open a window which logs every action made.
In the end, you should see this screen:
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
Open your Mac's 'Finder' by clicking on the blue smiling icon in your Mac's 'Dock':
On the menu bar at the top of your computer, click 'Go'. Click on 'Home' (the house icon).
You will see a folder called 'Datashare':
If you want to quickly access it in the future, you can drag and drop it in 'Favorites' on the left of this window:
Copy or place the documents you want to have in Datashare in this Datashare folder.
Open your Applications. You should see Datashare. Double click on it:
Datashare opens in your default internet browser. Click 'Tasks':
Click the 3rd tab 'Analyze your documents':
Find the application on your computer and have it running locally in your browser.
Open the Windows main menu at the left of the bar at the bottom of your computer screen and click on 'Datashare'. (The numbers after 'Datashare' just indicate which version of Datashare you installed.)
A window called 'Terminal' will have opened, showing the progress of opening Datashare. Do not close this black window as long as you use Datashare.
Keep this Terminal window open as long as you use Datashare.
Find the application on your computer and run it locally on your browser.
Start Datashare by launching it from the command-line:
Datashare should now automatically open in your default internet browser. If it doesn’t, type "" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see: ).
Datashare should now automatically open in your default internet browser. If it doesn’t, type "" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: ).
You can now .
You can now !
You can now .
Datashare should now automatically open in your default internet browser. If it doesn’t, type "" in your browser. Datashare must be accessed from your internet browser (Firefox, Chome, etc), even though it works offline without Internet connection (see FAQ: ).
It's now time to .
It's now time to .
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
You can find a folder called 'Datashare' in your home directory.
Move the documents you want to add to Datashare into this folder.
Open Datashare to extract text and eventually find people, organizations and locations in your documents.
You can now analyze your documents.
This page explain how to start Datashare within a Docker.
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Read more about how to install Docker on your system.
To start Datashare within a Docker container, you can use this command:
Make sure the Datashare
folder exists in your homedir or this command will fail. This is an example about how to use Datashare with Docker, data will not be persisted.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components such as the database (PostgreSQL), the search index (Elasticsearch), and the key-value store (Redis).
By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare with Docker Compose, you can use the following docker-compose.yml file:
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml
file. Then run the following command to start the Datashare service:
The -d
flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080
. This assumes that the default port mapping of 8080:8080
is used for the Datashare container in the YAML file.
That's it! You should now have the Datashare service up and running, accessible through your web browser. Remember that the containers will continue to run until you explicitly stop them.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml
file is located:
This will stop and remove the containers, freeing up system resources.
It will help you set up the software on your computer.
Before we start, please uninstall any prior standard version of Datashare if you had already installed it. You can follow these steps: https://www.laptopmag.com/articles/uninstall-programs-windows-10
Go to datashare.icij.org, scroll down and click 'Download for free'.
The file "datashare-X.Y.Z.exe" is now downloaded. Double click on the name of the file in order to execute it.
As Datashare is not signed, this popup asks for your permission. Don't click 'Don't run' but click 'More info':
Click 'Run anyway':
It asks if you want to allow the app to make changes to your device. Click 'Yes':
On the Installer Wizard, as you need to download and install OpenJDK11 if it is not installed on your device, click 'Install':
The following windows with progress bars will be displayed:
Choose a language and click 'OK':
To install Tesseract OCR, click the following buttons on the Installer Wizard's windows:
Untick 'Show README' and click 'Finish':
Finally, click "Close" to close the installer of TesseractOCR.
It now downloads the back end and the front end, Datashare.jar:
When it is finished, click 'Close':
You can now start Datashare!
It will help you index and have your documents in Datashare. This step is required in order to explore your documents.
1. To add your documents in Datashare, click 'Tasks' in the left menu:
2. Click 'Analyze your documents':
3. Click 'Add documents' so Datashare can extract the texts from your files:
You can:
Select the specific folder or sub-folder containing the documents you want to add.
Extract text also from images/PDFs (OCR). Be aware the indexing can be up to 10 times longer.
Select the language of you document if you don't want Datashare to guess it automatically. Note: if you choose to also extract text from images (previous option), you might need to install the appropriate language package on your system. Datashare will tell you if the language package is missing. Refer to the documentation to know how to install language packages.
Skip already indexed files.
Two extraction tasks are now running: the first is the scanning of your Datashare folder which sees if there are new documents to analyze (ScanTask). The second is the indexing of these files (IndexTask):
It is not possible to 'Find people, organizations and locations' while of these two tasks is still running.
When tasks are done, you can start exploring documents by clicking 'Search' in the left menu but you won't have the named entities (names of people, organizations and locations) yet. To have these, follow the steps below.
1. After the text is extracted, you can launch named entities recognition by clicking the button 'Find people, organizations and locations'.
2. In the window below, you are asked to choose between finding Named Entities or finding email addresses (you cannot do both simultaneously, you need to do one after the other, no matter the order):
You can now see running tasks and their progress. After they are done, you can click 'Clear done tasks' to stop displaying tasks that are completed.
3. You can search your indexed documents without having to wait for all tasks to be done. To access your documents, click 'Search':
To extract email addresses in your documents:
Re-click on 'Find people, organizations, locations and email addresses' (in Tasks (left menu) > Analyze your documents)
Click the second radio button 'Find email addresses':
You can now search documents.
It will help you locally add plugins and extensions to Datashare.
Plugins are small programs that you can add to Datashare's front-end to get new features (the front-end is the interface, "the part of the software with which the user interacts directly" - ).
Extensions are small programs that you can add to Datashare's back-end to add new features (the back-end is "the part of the software that is not directly accessed by the user, typically responsible for storing and manipulating data" - ).
Go to "Settings":
Click "Plugins":
Choose the plugin you want to add and click "Install now":
If you want to install a plugin from an URL, click "Install plugin from URL".
Your plugin is installed.
Refresh your page to see your new plugin activated in Datashare.
Go to "Settings":
Click "Extensions":
Choose the extension you want to add and click "Install now":
If you want to install an extension from an URL, click "Install extension from URL".
Your extension is installed.
Restart Datashare to see your new extension activated in Datashare.
When a newer version of a plugin or extension is available, you can click on the "Update" button to get the latest version.
After that, if it is a plugin, refresh your page to activate the latest version.
If it is an extension, restart Datashare to activate the latest version.
People who code can create their own plugins and extensions by following these steps:
Datashare provides a folder to use to collect documents on your computer to index in Datashare.
When you open your desktop, you will see a folder called 'Datashare Data'. Move or copy and paste the documents you want to add to Datashare to this folder:
Now open Datashare, which you will find in your main menu (see above: ')
Once Datashare has opened, click on 'Analyze documents' on the top navigation bar in Datashare:
Install Datashare will help you set up the software on your computer.
Currently, only a .deb package for Debian/Ubuntu is provided.
If you want to run it with another Linux distribution, you can download the latest version of the Datashare jar here :
And adapt the following launch script : to your environment.
Go to , scroll down and click 'Download .deb'
Save the Debian package as a file
This page explains how to setup neo4j, install the neo4j plugin and create a graph on your computer
Follow the instructions of the to get neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0
) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
If it's not done yet and extract both names of people, organizations and locations as well as email addresses.
If your project contains email documents, make sure to also extract email addresses.
You can now !
This page will explain to you how to install language packages to support Optical Character Recognition (OCR) on more languages.
To be able to perform OCR, Datashare uses an open source technology called Apache Tesseract. When Tesseract extracts text from images, it uses "language packages" especially trained for each specific languages. Unfortunately, those packages can be heavy and to ensure a lightweight installation of Datashare, the installer doesn't use them all by default. In the case Datashare informs you of a missing package, this guide will explain you how to manually install it on your system.
To add ocr languages on linux, simply use the following command:
Where `[lang]` is can be :
all
if you want to install all languages
a language code (ex: fra
, for French), the list of languages is available
The Datashare Installer for Mac checks for the existence of either or , which packages managers are used for the installation of Tesseract. If none of those two package managers is present, the Datashare Installer will install MacPorts by default.
First, you must check that MacPort is installed on your computer. Please run in a Terminal:
You should see an output similar to this:
If MacPort is installed on your computer, you should be able to add the missing Tesseract language package with the following command (for German):
Once the installation is done, simply close and restart Datashare to be able to use the newly installed packages.
If Homebrew was already present on your system when Datashare was installed, Datashare used it to install Tesseract and its language packages. Because Homebrew doesn't package each Tesseract language individually, all languages are already supported by your system. In other words, you have nothing to do!
If you want to check if Homebrew is installed, run the following command in a Terminal:
You should see an output similar to this:
*Additionnal languages can be also added during Tesseract's installation.
The list of installed languages can be checked with Windows command prompt or Powershell with the commandtesseract --list-langs.
Datashare has to be restarted after the language installation.
Plugins:
Extensions:
You're now ready to .
You can now !
If you get a command not found: port
, this either means you are using Homebrew (see next section) or you did not yet.
The full list of supported language packages can be found .
If you get a command not found: brew
error, this mean Homebrew is not installed on your system. You might either use MacPorts (see previous section) or on your computer.
Languages packages are available on Tesseract . Trained data files have to be downloaded and added into tessdata
folder in Tesseract's installation folder.
When running Datashare from the command-line, you can pick which "stage" to apply to analyse your documents.
The CLI stages are primarly intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heaving tasks between servers.
This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.
The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue create in the previous step, then use a combination of Apache Tika and Tesseract to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurent value at the time. This allows users to distribute this command on serveral servers.
Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entities mentions in ElasticSearch, it will mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be paralelized as well on several servers.
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
Datashare is launched with --mode SERVER
and you have to provide:
the external elasticsearch index address elasticsearchAddress
a Redis store address redisAddress
a Redis data bus address messageBusAddress
a database JDBC URL dataSourceUrl
the host of Datashare (used to generate batch search results URLs) rootHost
an authentication mechanism and its parameters
Example:
This page explain how to start Datashare within a Docker in server mode.
Datashare platform is designed to function effectively by utilizing a combination of various services. To streamline the development and deployment workflows, Datashare relies on the use of Docker containers. Docker provides a lightweight and efficient way to package and distribute software applications, making it easier to manage dependencies and ensure consistency across different environments.
Read more about how to install Docker on your system.
Within Datashare, Docker Compose can play a significant role in enabling the setup of separated and persistent services for essential components. By utilizing Docker Compose, you can define and manage multiple containers as part of a unified service. This allows for seamless orchestration and deployment of interconnected services, each serving a specific purpose within the Datashare ecosystem.
Specifically, Docker Compose allows you to configure and launch separate containers for PostgreSQL, Elasticsearch, and Redis. These containers can be set up in a way that ensures their data is persistently stored, meaning that any information or changes made to the database, search index, or key-value store will be retained even if the containers are restarted or redeployed.
This separation of services using Docker Compose provides several advantages. It enhances modularity, scalability, and maintainability within the Datashare platform. It allows for independent management and scaling of each service, facilitating efficient resource utilization and enabling seamless upgrades or replacements of individual components as needed.
To start Datashare in server mode with Docker Compose, you can use the following docker-compose.yml file:
Open a terminal or command prompt and navigate to the directory where you saved the docker-compose.yml
file. Then run the following command to start the Datashare service:
The -d
flag runs the containers in detached mode, allowing them to run in the background.
Docker Compose will pull the necessary Docker images (if not already present) and start the containers defined in the YAML file. Datashare will take a few seconds to start. You can check the progression of this opperation with:
Once the containers are up and running, you can access the Datashare service by opening a web browser and entering http://localhost:8080
. This assumes that the default port mapping of 8080:8080
is used for the Datashare container in the YAML file.
To stop the Datashare service and remove the containers, you can run the following command in the same directory where the docker-compose.yml
file is located:
This will stop and remove the containers, freeing up system resources.
If you reach that point, Datashare is up and running but you will discover very quickly that no documents is available in the search results. Next step: Add documents from the CLI.
Datashare as the ability to detect email addresses, name of people, organizations and locations. You must perform the named entities extraction in the same fashion than the previous commands. Final step: Add named entities from the CLI.
Authentication with Datashare in server mode is the most impacting choice that has to be made. It can be one of the followings:
basic authentication with credentials stored in database (PostgreSQL)
basic authentication with credentials stored in Redis
OAuth2 with credentials provided by an identity provider (KeyCloak for example)
dummy basic auth to accept any user (⚠️ if the service is exposed to internet, it will leak your documents)
Install the neo4j plugin following instructions available in the dedicated page.
1. Go to "Settings":
2. Make sure the following settings are properly set:
Neo4j Host
should be localhost
or the address where your neo4j instance is running
Neo4j Port
should be the port where your neo4j instance is running (7687
by default)
Neo4j User
should be set to your neo4j user name (neo4j
by default)
Neo4j Password
should only be set if your neo4j user is using password authentication
3. When running Neo4j Community Edition
, set the Neo4j Single Project
value. In community edition, the neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the Neo4j Single Project
with the name of the project which will use neo4j plugin. Other projects won't be able to use the neo4j plugin.
4. Restart Datashare to apply the changes
5. You should be able to see the neo4j widget in your project page, after a little while its status should be RUNNING
:
You can now create the graph !
Basic authentication with Redis
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization
with user:password
base64 encoded:
It is secure as long as the communication to the server is encrypted (with SSL
for example).
On the server side, you have to provide a user store for Datashare. For now we are using a Redis data store.
So you have to provision users. The passwords are sha256 hex
encoded. For example using bash
:
Then insert the user like this in Redis:
If you use other indices, you'll have to include them in the group_by_applications
, but local-datashare
should remain. For exammple if you use myindex
:
Then you should see this popup:
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in Redis:
This document assumes you have installed Datashare .
In server , it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differenciate user to offer admin additional tools.
This is likelly to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.
Here is a simple command to scan a directory and index its files:
What's happening here:
Datashare starts in "CLI"
We ask to process both SCAN and INDEX at the same time
The SCAN stage feeds a queue in memory with file to add
The INDEX stage pulls files from the queue to add them to ElasticSearch
We tell Datashare to use the elasticsearch
service
Files to add are located in /home/datashare/Datashare/
which is a directory mounted from the host machine
Alternativly, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the redis:
Once the opperation is done, we can easily check the content of queue created by Datashare in redis. In this example we only display the 20 first files in the datashare:queue
:
Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.
Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :
Scan the existing ElasticSearch index and gather document paths to store it inside a report queue
Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it
Basic authentication with a database.
Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization
with user:password
base64 encoded:
It is secure as long as the communication to the server is encrypted (with SSL for example).
On the server side, you have to provide a database user inventory. You can launch datashare first with the full database URL, then datashare will automatically migrate your database schema. Datashare supports SQLite and PostgreSQL as back-end databases. SQLite is not recommended for a multi-user server because it cannot be multithreaded, so it will introduce contention on users' DB SQL requests.
Then you have to provision users. The passwords are sha256 hex encoded (for example with bash):
Then you can insert the user like this in your database:
If you use other indices, you'll have to include them in the group_by_applications
, but local-datashare
should remain. For exammple if you use myindex
:
Or you can use COPY
statement if you want to create them all at once.
Then when accessing Datashare, you should see this popup:
Here is an example of launching Datashare with Docker and the basic auth provider filter backed in database:
This document assumes you have installed Datashare and already .
In server , it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate user to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.
Datashare as the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's :
What's happening here:
Datashare starts in "CLI"
We ask to process the NLP
We tell Datashare to use the elasticsearch
service
Datashare will pull documents from ElasticSearch directly
Up to 2 documents will be analyzed in parallel
Datashare will use the CORENLP pipeline
Datashare will use the output queue from the previous INDEX
stage (by default extract:queue:nlp
in Redis) that contains all the document ids to be analyzed.
The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.
You can also use chain the 3 stages altogether:
The added ENQUEUEIDX
stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the ids of those documents into the extract:queue:nlp
queue.
This page explains how to setup neo4j, install the neo4j plugin and create a graph on your server
Follow the instructions of the to get neo4j up and running.
We recommend using a recent release of Datashare (>= 14.0.0
) to use this feature, click on the 'Other platforms and versions' button when downloading to access versions if necessary.
If it's not done yet add entities to your project .
If your project contains email documents, make sure to run the EMAIL
pipeline together with regular NLP pipeline. To do so add set the follow nlpp
flag to --nlpp CORENLP,EMAIL
.
You can now !
The INDEX can now be executed in the same container:
We made a small demo to show how it could be setup.
As for the previous you may want to restore the output queue from the INDEX
stage. You can do:
Install the neo4j plugin using the Datashare CLI so that users can access it from the frontend:
Installing the plugin install the datashare-plugin-neo4j-graph-widget
plugin inside /home/datashare/plugings
and will also install the datashare-extension-neo4j
backend extension inside /home/datashare/extensions
. These locations can be changed by updating the docker-compose.yml
.
Update the docker-compose.yml
to reflect your neo4j docker service settings.
If your choose a different neo4j user or set a password for your neo4j user make sure to also set DS_DOCKER_NEO4J_USER
and DS_DOCKER_NEO4J_PASSWORD
.
When running Neo4j Community Edition
, set the DS_DOCKER_NEO4J_SINGLE_PROJECT
value. In community edition, the neo4j DBMS is restricted to a single database. Since Datashare supports multiple projects, you must set the DS_DOCKER_NEO4J_SINGLE_PROJECT
with the name of the project which will use neo4j plugin. Other projects won't be able to use the neo4j plugin.
After installing the plugin a restart might be needed for the plugin to display:
You can now create the graph !
To make your searches more precise, you can use operators in the main search bar.
To have all documents mentioning an exact phrase, you can use double quotes. Use straight double quotes ("example"), not curly double quotes (“example”).
Example: "Alicia Martinez’s bank account in Portugal"
To have all documents mentioning all or one of the queried terms, you can use a simple space between your queries or 'OR'. You need to write 'OR' with all letters uppercase.
Example: Alicia Martinez
Same search: Alicia OR Martinez
To have all documents mentioning all the queried terms, you can use 'AND' between your queried words. You need to write 'AND' with all letters uppercase.
Example: Alicia AND Martinez
Same search: +Alicia +Martinez
To have all documents NOT mentioning some queried terms, you can use 'NOT' before each word you don't want. You need to write 'NOT' with all letters uppercase.
Example: NOT Martinez
Same search: !Martinez
Same search: -Martinez
Parentheses should be used whenever multiple operators are used together and you want to give priority to some.
Example: ((Alicia AND Martinez) OR (Delaware AND Pekin) OR Grey) AND NOT "parking lot"
You can also combine these with 'regular expressions' Regex between two slashes (see below).
If you search faithf?l, the search engine will look for all words with all possible single character between the second f and the l in this word. It also works with * to replace multiple characters.
Example: Alicia Martin?z
Example: Alicia Mar*z
You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), you can use fuzziness. Use the tilde symbol at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you type an exact phrase (in double quotes) and use fuzziness, then the meaning of the fuzziness changes. Now, the fuzziness means the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
"the cat is blue" -> "the small cat is blue" (1 insertion = fuzziness is 1)
"the cat is blue" -> "the small is cat blue" (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elastic).
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox"
would be considered more relevant than "quick brown fox"
(source: Elastic).
Use the boost operator ^
to make one term more relevant than another. For instance, if we want to find all documents about foxes, but we are especially interested in quick foxes:
Example: quick^2 fox
The default boost value is 1, but can be any positive floating point number. Boosts between 0 and 1 reduce relevance. Boosts can also be applied to phrases or to groups:
Example: "john smith"^2 (foo bar)^4
(source: Elastic)
"A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern." (Wikipedia).
1. You can use Regex in Datashare. Regular expressions (Regex) in Datashare need to be written between 2 slashes.
Example: /.*..*@.*..*/
The example above will search for any expression which is structured like an email address with a dot between two expressions before the @ and a dot between two expressions after the @ like in 'first.lastname@email.com' for instance.
2. Regex can be combined with standard queries in Datashare :
Example: ("Ada Lovelace" OR "Ado Lavelace") AND paris AND /.*..*@.*..*/
3. You need to escape the following characters by typing a backslash just before them (without space): # @ & < > ~
Example: /.*..*@.*..*/ (the @ was escaped by a backslash \ just before it)
4. Important: Datashare relies on Elastic's Regex syntax as explained here. Datashare uses the Standard tokenizer. A consequence of this is that spaces cannot be searched as such in Regex.
We encourage you to use the AND operator to work around this limitation and make sure you can make your search.
If you're looking for French International Bank Account Number (IBAN) that can or cannot contain space and contain FR followed by numbers and/or letters (it could be FR7630001007941234567890185 ou FR76 3000 4000 0312 3456 7890 H43 for example), you can then search for:
Example: /FR[0-9]{14}[0-9a-zA-Z]{11}/ OR (/FR[0-9]{2}.*/ AND /[0-9]{4}.*/ AND /[0-9a-zA-Z]{11}.*/)
Here are a few examples of useful Regex:
You can search for /Dimitr[iyu]/ instead of searching for Dimitri OR Dimitry OR Dimitru. It will find all the Dimitri, Dimitry or Dimitru.
You can search for /Dimitr[^yu]/ if you want to search all the words which begin with Dimitr and do not end with either y nor u.
You can search for /Dimitri<1-5>/ if you want to search Dimitri1, Dimitri2, Dimitri3, Dimitri4 or Dimitri5.
Other common Regex examples:
phone numbers with "-" and/or country code like +919367788755, 8989829304, +16308520397 or 786-307-3615 for instance: /[\+]?[(]?[0-9]{3}[)]?[-\s.]?[0-9]{3}[-\s.]?[0-9]{4,6}/
emails (simplified): /[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+/
credit cards: /(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35[0-9]{3})[0-9]{11})/
You can find many other examples on this site. More generally, if you use a regex found on internet, beware that the syntax is not necessarily compatible with elasticsearch's. For example \d
, \S
and the like are not understood.
To find the list of existing metadata fields, go to a document's 'Tags and details' tab, click 'Show more details'.
When you hover the lines, you see a magnifying glass on each line. Click on it and Datashare will look for this field. Here is the one for content language:
Here is the one for 'indexing date' (also called extraction date here) for instance:
So for example, if you are looking for documents that:
contains term1, term2 and term3
and were created after 2010
you can use the 'Date' filter or type in the search bar:
term1 AND term2 AND term3 AND metadata.tika_metadata_creation_date:>=2010-01-01
Explanations:
'metadata.tika_metadata_creation_date:' means that we filter with creation date
'>="'means 'since January 1st included'
'2010-01-01' stands for January 2010 and the search will include January 2010
For other searches:
'<' will mean 'strictly after (with January 1st excluded)'
nothing will mean 'at this exact date'
You can search for numbers in a range. Ranges can be specified for date, numeric or string fields amont the ones you can find by clicking the magnifying glass when you hover the fields in a document's tab 'Tags and Details'.
Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}. For more details, please refer to Elastic's page on ranges.
This page describes how to create and maintain your neo4j graph up to date with your server's Datashare projects
The neo4j related features are added to the DatashareCLI through the extension mechanism. In order to run the extended CLI, the Java CLASSPATH
must be extended with the path of the datashare-extension-neo4j
jar. By default, this jar is located in /home/datashare/extensions
, so the CLI will be run as following:
In order to create the graph, run the --fullImport
command for your project:
the CLI will display the import task progress and log import related information.
When new documents or entities are added or modified inside Datashare, you will need to update the neo4j graph to reflect these changes.
To update the graph, you can just re-run the full export:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
To detect whether a graph update is needed, open the 'Projects' page and select your project:
compare the number of documents and entities found inside Datashare:
to the numbers found in the 'Graph statistics' and run an update in case of mismatch:
explore your graph using your favorite visualization tool
You can use several filters on the left of the main screen. Applied filters are reminded on the top of the results' column. You can also 'contextualize' and reset the filters.
On the left column, you can apply filters by ticking them, like 'Portable Document Format (PDF)' in File Types and 'English' in Languages in the example below:
A reminder of the currently applied filters, as well as your queried terms, are displayed at the top of the results' column. You can easily unselect these filters from there by clicking them or clear all of them:
The currently available filters are:
Projects: if you have more than one project, you can select several of them and run searches in multiple projects at once.
Starred: If you have starred documents, you can easily find them again.
Tags: If you wrote some tags, you will be able to select and search for them.
Recommended by: available only on server (collaborative) mode, this functionality helps you find the document recommended by you and/or others.
File type: This is the 'Content type' of the file (Word, PDF, JEPG image, etc.) as you can read it in a document's 'Tags & Details'.
Creation dates: the calendar allows you to select a single creation date or a date range. This is when the document was created as it is noticed in their properties. You can find this in a document's 'Tags & Details'.
Languages: Datashare detects the main language of each document.
People / Organizations / Locations: you can selected these named entities and search them.
Path: This is the location of your documents as it is indicated in your original files (ex: desktop/importantdocuments/mypictures). You can find this in a document's 'Tags & Details'.
Indexing date: This date corresponds to when you indexed the documents in Datashare.
Extraction level: This regards embedded documents. The file on disk is level zero. If a document (pictures, etc) is attached or contained in a file on disk, extraction level is “1st”. If a document is attached or contained in a document itself contained in a file on disk, extraction level is “2nd”, etc.
Filters can be combined together and combined with searches in order to refine results.
If you have asked Datashare to 'Find people, organizations and locations', you can see names of individuals, organizations and locations in the filters. These are the named entities automatically detected by Datashare.
Search for named entities in the filter's search bar:
Select all of them, one or several of them to filter the documents that mention them:
If you want to select all items except one or several of them, you can use the 'Exclude button'.
It allows you to search for all documents which do not correspond to the filter(s) you selected, that is to say to the currently strikethrough filters.
In several filters, you can tick 'Contextualize' : this will update the number of documents indicated in the filters in order to reflect the results. The filter will only count what you selected.
In the example below, the 'Contextualize' checkboxes are not ticked:
After the Contextualize button in Tags filter is ticked:
After the Languages button in Tags filter is ticked:
To reset all filters at the same time, click 'Clear all':
You can sort documents by:
relevance (by default): it is a score calculated by the search engine
indexing date: when you analyzed the document, the day and time you 'put' them in Datashare
creation date: the day and time the document was created, as it is written in the document's metadata
size of the documents
path of the documents
You can also decide the number of documents displayed by page (10, 25, 50 or 100):
Improving the performance of Datashare involves several techniques and configurations to ensure efficient data processing. Extracting text from multiple file types and images is an heavy process so be aware that even if we take care of getting the best performances possible with Apache Tika and Tesseract OCR, this process can be expensive. Below are some tips to enhance the speed and performance of your Datashare setup.
Execute the SCAN and INDEX stages independently to optimize resource allocation and efficiency.
Examples:
Distribute the INDEX stage across multiple servers to handle the workload efficiently. We often use multipleg4dn.8xlarge
instances (32 CPUs, 128 GB of memory) with a remote Redis and a remote ElasticSearch instance to alleviate processing load.
For projects like the Pandora Papers (2.94 TB), we ran the INDEX stage to up to 10 servers at the same time which cost ICIJ several thousand of dollars.
Datashare offer --parallelism
and --parserParallelism
options to enhance processing speed.
Example (for g4dn.8xlarge
with 32 CPUs):
ElasticSearch can significantly consume CPU and memory, potentially becoming a bottleneck. For production instance of Datashare, we recommend deploying ElasticSearch on a remote server to improve performances.
You can fine-tune the JAVA_OPTS
environment variable based on your system's configuration to optimize Java Virtual Machine memory usage.
Example (for g4dn.8xlarge8
with 120 GB Memory):
If the document language is known, explicitly setting it can save processing time.
Use --language
for general language setting (e.g., FRENCH
, ENGLISH
).
Use --ocrLanguage
for OCR tasks to specify the Tesseract model (e.g., fra
, eng
).
Example:
OCR tasks are resource-intensive. If not needed, disabling OCR can significantly improve processing speed. You can disable OCR with --ocr false
.
Example:
Large PST files or archives can hinder processing efficiency. We recommend extract these files before processing with Datashare. If they are too many of them, keep in mind Datashare will be able to extract them anyway.
Example to split Outlook PST files in multiple .eml
files with readpst:
It allows to get the results of each query of a list, but all at once.
If you want to search a list of queries in Datashare, instead of doing each of them one by one, you can upload the list directly in Datashare. To do so, you will:
Create a list of terms that you want to search in the first column of a spreadsheet
Export the spreadsheet as a CSV (a special format available in any spreadsheet software)
Upload this CSV in the "new Batch Search" form in Datashare
Get the results for each query in Datashare - or in a CSV.
Write your queries, one per line and per cell, in the first column of a spreadsheet (Excel, Google Sheets, Numbers, Framacalc, etc.). In the example below, there are 4 queries:
Do not put line break(s) in any of your cells.
To delete line break(s) in your spreadsheet, you can use the "Find and replace all" functionality. Find all "\n" and replace them all by nothing or a space.
Write 2 characters minimum in the cells. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to 'failure'.
If you have blank cells in your spreadsheet...
...the CSV (which stand for 'Comma-separated values') will keep these blank cells. It will separate them with semicolons (the 'commas'). You will thus have semicolons in your batch search results (see screenshot below). To avoid that, you need to remove blank cells in your spreadsheet before exporting it as a CSV.
If there is a comma in one of your cells (like in "1,8 million" in our example above), the CSV will formally put the content of the cell in double quotes in your results and search for the exact phrase in double quotes.
In the new Batch Search's form > Advanced Filters, you will be able to select some file types and some paths if you want to search only in some documents.
But you can also use fields directly in your queries in the CSV.
For instance, if you want to search only in some documents with certain tag(s), you can write your queries like this: "Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)".
The operators AND NOT * ? ! + - do work in batch searches (as they do in the regular search bar) but only if "Do phrase match" in Advanced filters is turned off.
Reserved characters, when misused, can lead to failures because of syntax errors.
Please also note that searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.
Export your spreadsheet in a CSV format like this:
Important: Use the UTF-8 encoding.
LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':
Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, as explained here.
Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".
Other spreadsheet softwares: please refer to each software's user guide.
Open Datashare, click 'Batch searches' in the left menu and click 'New batch search' on the top right:
Type a name for your batch search:
Upload your CSV:
Add a description (optional):
Set the advanced filters ('Do phrase matches', 'Fuzziness' or 'Proximity searches', 'File types' and 'Path') according to your preferences:
'Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query.
When you run a batch search, you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Click 'Add'. Your batch search will appear in the table of batch searches.
Open your batch search by clicking its name:
You see your results and you can sort them by clicking the column's name. 'Rank' means the order by which each queries would be sorted out if run in Datashare's main search bar. They are thus sorted by relevance score by default.
You can click on a document's name and it will open it in a new tab:
You can filter your results by query and read how many documents there are for each query:
You can search for specific queries:
You can also download your results in a CSV format:
If you add more and more files in Datashare, you might want to relaunch existing batch search on your new documents too.
Notes:
In the server collaborative mode, you can only relaunch your own batch searches, not others'.
The relaunched batch search will apply to your whole corpus, newly indexed documents and previously indexed documents (not only the newly indexed ones).
To do so, open the batch search that you'd like to relaunch and click 'Relaunch':
Edit the name and the description of your batch search if needed:
You can choose to delete the current batch search after relaunching it:
Note: if you're worried about losing your previous results because of an error, we recommend to keep your current batch search (turn off this toggle button) and delete it only after the relaunch is a success.
Click 'Submit':
You can see your relaunched batch search running in the batch search's list:
Failures in batch searches can be due to several causes.
Click the 'See error' button to open the error window:
The first query containing an error makes the batch search fail and stop.
Check this first failure-generating query in the error window:
In the case above, the slash (/) used between 'Heroin' and 'Opiates' is a reserved character that was not escaped by a backslash so Datashare interpreted this query as a syntax error, failed and didn't go further so the batch search stopped.
We recommend to remove the slash, as well as any reserved characters, and re-run the batch search again.
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: ****here are the instructions for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
__
One of your queries can lead to a 'Data too large' error.
It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.
We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.
****
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators (see below).
You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. Click here to learn how to launch a batch search.
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your queries as some errors can remain after correcting the first one.
They are more likely to happen when 'do phrase matches' toggle button is turned off:
When 'Do phrase matches' is on, syntax error can still happen though:
Here are the most common errors:
You cannot start a query with AND all uppercase, neither in Datashare's main search bar nor in your CSV. AND is reserved as a search operator.
You cannot start a query with OR all uppercase, neither in Datashare's main search bar nor in your CSV. OR is reserved as a search operator.
You cannot type a query with only one double quote, neither in Datashare's main search bar nor in your CSV. Double quotes are reserved as a search operator.
You cannot start a query with tilde (~) or make one contain a tilde, neither in Datashare's main search bar nor in your CSV. Tilde is reserved as a search operator for fuzziness or proximity searches.
You cannot start a query with caret (^) or make it contain a caret, neither in Datashare's main search bar nor in your CSV. Caret is reserved as a boosting operator.
You cannot start a query with slash (/) or make it contain a slash, neither in Datashare's main search bar nor in your CSV. Slash is a reserved character to open Regex ('regular expressions'). Note that you can use Regex in batch searches.
Open your batch search and click the trash icon:
Then click 'Yes':
You can search with the main search bar, with operators, and also within a document thanks to control or command + F.
1. To see all your documents (you need to have added documents to Datashare and have analyzed them before), click 'Search in documents':
If not collapsed yet, to collapse the left menu in order to gain room, click the 'hamburger menu':
2. Search for specific documents. Type terms in the search bar, press Enter or click 'Search':
IMPORTANT:
To make your searches more precise, you can search with operators (AND, OR, ....): read more here.
If you get a message "Your search query is wrong", it is probably because you are misusing one or some reserved characters (like ^ " ? ( [ * OR AND etc). Please refer to this page.
3. You can search in specific fields like tags, title, author, recipient, content, path or thread ID. Click 'All fields' and select your choice in the dropdown menu:
Select the view on the top right.
List:
Grid:
Table:
Once a document is opened, you can search for terms in this document:
Press Command (⌘) + F (on Mac) or Control + F (on Windows and Linux) or click on the search bar above your Extracted Text
Type what you search for
Press ENTER to go from one occurrence to the next one
Presse SHIFT + ENTER to go from one occurrence to the previous one
(To know all the shortcuts in Datashare, please read 'Use keyboard shortcuts'.)
This also counts the number of occurrences of your searched terms in this document:
If you run email extraction and searched for one or several email addresses, if the email adresses are in the email's metadata (recipient, sender or other field), there will be a 'in metadata' label attached to the email addresses:
Open the document by clicking on its title
Click the button 'Mark as recommended':
Your recommendation is now displayed on this page and in the left 'Recommended by' filter.
Open the filter untitled 'Recommended by'
Open the document and click on "Unmark as recommended".
Starring documents will help you easily find important documents again when needed.
Hover the document's title in the main column. A button with the labe 'Star document' appears.
Click it:
The starred document is now marked with a star icon:
Click again on its star icon. The document won't have a star anymore.
Open the 'Starred' filter on the left:
Tick 'Starred' (respectively 'Unstarred').
Only starred (respectively unstarred) documents will be displayed in the main column:
Once you opened a document, you can explore the document's data through different tabs.
In 'Extracted Text', you can read the text of a document as extracted by Datashare:
Please beware that Datashare show named entities by default. This can overwrite some original text with wrong named entities. It is thus important to always verify original text by deactivating named entity overwriting. To do so, please:
Turn off the toggle button ‘Show named entities’ and read the extracted text
Check the ‘Preview’ of original document if available
Check the original document at its original location or by clicking the pink button ‘Download’
****
If the documents has attachments (technically called 'children documents'), find them at the end of the document. Click their pink button to open them:
To open all the attachments in Datashare, click 'See children documents' in Tags and Details:
****
Press Command(⌘) + F (on Mac) or Control + F (on Windows and Linux) or click on the search bar above your Extracted Text
Type what you search for
Press ENTER to go from one occurrence to the next one
Presse SHIFT + ENTER to go from one occurrence to the previous one
(To know all the shortcuts in Datashare, please read 'Use keyboard shortcuts'.)
This also counts the number of occurrences of your searched terms in this document:
If you run email extraction and searched for one or several email addresses, if the email adresses are in the email's metadata (recipient, sender or other field), there will be a 'in metadata' label attached to the email addresses:
In 'Tags & Details', you can read the document's details. It's all the metadata as they appear in the original file. Please click 'Show more details' to get all metadata:
You can also read the tags you previously wrote for this document, like 'test1', 'test2' and 'test3' in the example below:
You can then search for the documents you tagged:
Type the tag(s) in the main search bar
Click 'All fields' and select 'Tags'
Click 'Search' or press 'Enter'
To learn more about tags, please read 'Tag a document'.
In 'Named Entities', you can read the name of people, organizations and locations as well as the number of their occurrences in the document:
Please beware that there can still be some errors due to the technology of Named Entity Extraction (NER) on which Datashare relies.
If you run email extraction, you will see a list of the extracted emails:
In 'Preview', you can read the original document.
'Preview' is available for some formats only.
This page explains how to leverage neo4j to explore your Datashare projects. We recommend using a recent release of Datashare (>= 14.0.0) to use this feature, click on the "Other platforms and version
neo4j is a graph database technology which lets you represent your data as a graph. Inside Datashare, neo4j lets you connect entities between them through documents in which they appear.
After creating a graph from your Datashare project, you will be able to explore it and visualize these kinds of relationships between you project entities:
In the above graph, we can see 3 email document nodes in orange, 3 email address nodes in red, 1 person node in green and 1 location node in yellow. Reading the relationship types on the arrows, we can deduce the following information from the graph:
shapp@caiso.com
emailed 20participants@caiso.com
, the sent email has an id starting with f4db344...
one person named vincent
is mentioned inside this email, as well as the california
location
finally, the email also mentions the dle@caiso.com
email address which is also mentioned in 2 other email documents (with id starting with 11df197...
and 033b4a2...
)
If you are not familiar with graph and neo4j, take a look at the following resources:
Find out what is a graph database?
Learn neo4j fundamentals
The neo4j graph is composed of :Document
nodes representing Datashare documents and :NamedEntity
nodes representing entities mentioned in these documents.
The :NamedEntity
nodes are additionally annotated with their entity types: :NamedEntity:PERSON
, :NamedEntity:ORGANIZATION
, :NamedEntity:LOCATION
, :NamedEntity:EMAIL
...
In most cases, an entity :APPEARS_IN
a document, which means that it was detected in the document content. In the particular case of email documents and EMAIL
addresses, it is most of the time possible to identify richer relationships from the email metadata, such as who sent (:SENT
relationship) and who received (:RECEIVED
relationship) the email.
When an :EMAIL
address entity is neither :SENT
or :RECEIVED
, like it is the case in the above graph for dle@caiso.com
, it means that the address was mentioned in the email document body.
When a document is embedded inside another document (as an email attachment for instance), the child document is connected to its parent through the :HAS_PARENT
relationship.
The creation of a neo4j graph inside Datashare is supported through a plugin. To use the plugin to create a graph, follow these instructions:
when using Datashare on your computer
when Datashare is running on your server
After the graph is created, navigate to the 'Projects' page and select your project. You should be able to visualize a new neo4j widget displaying the number of documents and entities found inside the graph:
Depending on your access to the neo4j database behind Datashare, you might need to export the neo4j graph and import it locally to access it from visualization tools.
Exporting and importing the graph into your own DB is also useful when you want to perform write operations on your graph without any consequences on Datashare.
If you have read access to the neo4j database (it should be the case if you are running Datashare on your computer), you will be able to plug visualization tools to it and start exploring.
If you can't have read access to the database, you will need to export it and import it into your own neo4j instance (running on your laptop for instance).
If it's possible, ask you system administrator for a DB dump obtained using the neo4j-admin database dump command.
In case you don't have access to the DB and can't be provided with a dump, you can export the graph from inside. Be aware that limits might be applied on the size of the exported graph.
To export the graph, navigate to Datashare's 'Projects' page, select your project, select the 'Cypher shell' export format and click the 'Export graph' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using the 'File types' and 'Project directory' filters.
DB import
Depending on how you run neo4j on your laptop use one of the following ways to import your graph into your DB:
Docker
identify your neo4j instance container ID:
copy your the graph dump inside your neo4j container import directory:
import the dumped file using the cypher-shell command:
Neo4j Desktop import
open 'Cypher shell':
copy your the graph dump inside your neo4j instance import directory:
import the dumped file using the cypher-shell command:
You will now be able to explore the graph imported in your own neo4j instance.
Once your graph is created and that you can access it (see this section if you can't access the Datashare's neo4j instance), you will be able to use your favorite tool to extract meaningful information from it.
Once you can access your neo4j database, you can use different tools to visualize and explore it. You can start by connection the Neo4j Desktop to your DB.
Neo4j Bloom is a simple and powerful tool developed by neo4j to quickly visualize and query graphs, if you run Neo4j Enterprise Edition. Bloom lets you navigate and explore the graph through a user interface similar to the one below:
Neo4j Bloom is accessible from inside Neo4j Desktop app.
Find out more information about to use Neo4j Bloom to explore your graph with:
Bloom's User Guide
Bloom's Quick Start
this series of videos about graph exploration with Bloom
The Neo4j Browser lets you run Cypher queries on your graph to explore it and retrieve information from it. Cypher is like SQL for graphs, running Cypher queries inside the neo4j browser lets you explore the results as shown below:
The Neo4j Browser is available for both Enterprise and Community distributions. You can access it:
inside the Neo4j Desktop app when running neo4j from the Desktop app
at http://localhost:7474/browser/ when running neo4j inside Docker
Linkurious is a proprietary software which, similarly to Neo4j Bloom, lets you visualize and query your graph through a powerful UI.
Find out more information about Linkurious:
Gephi is a simple open-source visualization software. It is possible to export graphs from Datashare into the GraphML File Format and import them into Gephi.
Find out more information about:
Gephi features
how to get started with Gephi
To export the graph in the GraphML file format, navigate to the 'Projects', select your project, choose the 'Graph ML' export format and click the 'Export graph' button:
In case you want to restrict the size of the exported graph, you can restrict the export to a subset of documents and their entities using the 'File types' and 'Project directory' filters.
You will now be able to visualize the graph using Gephi by opening the exported GraphML file in it.
You can tag documents, search for tagged documents and delete your tag(s).
Open the document by clicking on its title
Click the second tab 'Tags & Details'
Type your tag
Press 'Enter'
Tags can contain any character but cannot contain space.
Your new tag is now displayed on this page.
You can add several tags.
Open the second filter untitled 'Tags'
You see the tags by frequency and the number of tagged documents
You can search using the search bar
You can select one or multiple tags
To find all your documents tagged with specific tag(s):
Type the tag(s) in the main search bar
Select 'Tags' in the field dropdown menu
Click 'Search' or press 'Enter'
The results are all the documents tagged with the tag(s) you typed in the search bar.
To find all your tagged documents, whatever the tags:
Type nothing in the search bar
Select 'Tags' in the field selector
Click 'Search'
The results are all the tagged documents.
Click the cross at the end of the tag that you want to delete.
Shortcuts help do some actions faster.
It will open a window which recalls the shortcuts:
Windows / Linux
Control + →
Control + ←
Mac
Command (⌘) + →
Command (⌘) + ←
Windows / Linux
Control + F
Mac
Command (⌘) + F
... and go from one occurrence to the next / previous occurrence
Go to next occurrence
Enter
or
F3
Go to previous occurrence
Shift + Enter
or
Shift + F3
Windows / Linux
Control (ctrl) + alt + ⇞ (pageup)
Control (ctrl) + alt + ⇟ (pagedown)
Mac
Command (⌘) + option (⌥) + ↑ (arrow up)
Command (⌘) + option (⌥) + ↓ (arrow down)
Once you opened a document, go back to search results:
Esc
You can download a document by going to it on Datashare. Click on the download icon to the right of the screen on on the right of the document's title.
If you can't download a document, it means that Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on .
You cannot use square brackets except for searching for ranges.
This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)
1. enrich the services
section of the docker-compose.yml
of the install with Docker page, with the following neo4j service:
make sure not to forget the APOC plugin (NEO4J_PLUGINS: '["apoc"]'
).
2. enrich the volumes
section of the docker-compose.yml
of the install with Docker page, with the following neo4j volumes:
3. Start the neo4j
service using:
install with Neo4j Desktop, follow installation instructions found here
create a new local DBMS and save your password for later
if the installer notifies you of any ports modification, check the DBMS settings and save the server.bolt.listen_address
for later
make sure to install the APOC Plugin
Additional options to install neo4j are listed here.
If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.
Why?
For technical reasons, Datashare processes both queries in 2 different ways:
a. Search bar (a simple search processed in the browser):
The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.
b. Batch search (several searches processed by the server):
Datashare's server processes each of the batch search's queries
Each query is sent to Elasticsearch. The results are saved into a database
When the batch search is finished, you get the results from Datashare
Datashare sends back the results stored into the database/
Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.
1. Go to Applications
2. Click right on 'Datashare' and click 'Move to Bin'
Follow the steps here: https://support.microsoft.com/en-us/windows/uninstall-or-remove-apps-and-programs-in-windows-10-4b55f974-2cc6-2d2b-d092-5905080eaf98
Use the following command:
sudo apt remove datashare-dist
In Datashare, for technical reasons, it is not possible to open the 10,000th document.
Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.
As it is not possible to fix this, here are some tips:
: use filters to narrow down your results and ensure you have less than 10,000 matching documents
Change the : use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring
Search your query in a : you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents
This can be due to some syntax error(s) in the way you wrote your query.
Here are the most common errors that you should correct:
You cannot start a query with AND all uppercase. AND is reserved as a search operator.
You cannot start a query with OR all uppercase. OR is reserved as a search operator.
You cannot start or type a query with only one double quote. Double quotes are reserved as a search operator for exact phrase.
You cannot start or type a query with only one parenthesis. Parenthesis are reserved for combining operators.
You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).
You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for fuzziness or proximity searches.
You cannot end a query with question mark (!). Question mark is reserved as a search operator for excluding a term.
You cannot start a query with caret (^) or write one which contains caret. Caret is reserved as a boosting operator.
You cannot use square brackets except for searching for ranges.
In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you run a , you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
Datashare runs using different modes with their own specifities.
LOCAL
Web
To run Datashare on a single computer for a single user.
SERVER
Web
To run Datashare on a server for multiple users.
CLI
CLI
TASK_RUNNER
Daemon
Those two modes are the only one who create
In local mode and embedded mode, Datashare provides a self-contained software application that users can install and run on their own local machines. The software allows users to search into their documents within their own local environments, without relying on external servers or cloud infrastructure. This mode offers enhanced data privacy and control, as the datasets and analysis remain entirely within the user's own infrastructure.
In server mode, Datashare operates as a centralized server-based system. Users can access to the platform through a web interface, and the documents are stored and processed on Datashare's servers. This mode offers the advantage of easy accessibility from anywhere with an internet connection, as users can log in to the platform remotely. It also facilitate seamless collaboration among users, as all the documents and analysis are centralized.
The running modes offer advantages and limitations. This matrix summarizes the differences:
local
server
Multi-users
❌
✅
Multi-projects
❌
✅
Access-control
❌
✅
Indexing UI
✅
❌
Plugins UI
✅
❌
Extension UI
✅
❌
HTTP API
✅
✅
API Key
✅
✅
Single JVM
✅
❌
Tasks execution
✅
❌
When running Datashare in local mode, users can choose to use embedded services (like ElasticSearch, SQLITE, in-memory key/value store) on the same JVM than Datashare. This variant of the local mode is called "embedded mode" and allows user to run Datashare without having to setup any additional software. The embedded mode is used by default.
In cli mode, Datashare starts without a web server and allow user to perform task over their documents. This mode can be used in conjunction both with local and server modes, while allowing users to distribute heaving task between several servers.
If you want to learn more about which tasks you can execute in this mode, checkout the stages documentation.
Those modes are intended to be used for action that requires to "wait" for pendings tasks.
In batch download mode, the daemon wait for a user to request a batch download of documents. When a request is receive, the daemon start a task to download the document matching the user search, a bundle them into a zip file.
In batch search mode, the daemon wait for a user to request a batch search of documents. To create a batch search, users must go through the dedicated form on Datashare where they can to upload a list of search terms (in CSV format). The daemon will then start a task to search all matching document and store every occurences in the database.
Datashare is shipped as a single executable, with all modes available. As previously mentioned, the default mpode is the embedded mode. Yet when starting Datashare in command line, you can explicitely specify the running mode. For instance on Ubuntu/Debian:
To index documents and analyze them directly .
To execute async tasks (, batch downloads, scan, index, NER extraction
A named entity in Datashare is the name of an individual, an organization or a location.
Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically highlight named entities in your documents.
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators: read the list of syntax errors by clicking here.
You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. Click here to learn how to launch a batch search.
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your queries as some errors can remain after correcting the first one.
Example of a syntax error message:
SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: ****here are the instructions for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
Double quotes need to be straight in Datashare's search bar, not curly.
Straight double quotes: "example"
Curly double quotes: “example” (these are tilted)
This search works because double quotes are straight in the search bar:
This search doesn't work because double quotes are curly in the search bar:
This page describes how to create and maintain your neo4j graph up to date with your computer's Datashare projects
Open the 'Projects' page and select your project:
Create the graph by clicking on the 'Create graph' button inside the neo4j widget:
You will see a new import task running:
When the graph creation is complete, 'Graph statistics' will reflect the number of documents and entities nodes found in the graph:
When new documents or entities are added or modified inside Datashare, you will need to update the neo4j graph to reflect these changes.
To update the graph click on the 'Update graph' button inside the neo4j widget:
To detect whether a graph update is needed you can compare the number of documents found inside Datashare to the number found in the 'Graph statistics' and run an update in case of mismatch:
The update will always add missing nodes and relationships, update existing ones if they were modified, but will never delete graph nodes or relationships.
explore your graph using your favorite visualization tool
If you are using Datashare version with Docker (not the standard version) and if Datashare crashes, please try to restart Docker Desktop.
On Mac:
Click the Docker Desktop icon on the top menu bar. The following drop-down menu appears:
Click 'Restart'.
As long as the icon's little points move, it means that Docker Desktop is still restarting.
Once these points stopped moving, either Datashare restarted automatically or you can restart Datashare manually (see 'Open Datashare').
On Windows:
Right-click the Docker Desktop icon (a little whale) on the bottom menu bar.
Click 'Restart'.
Click 'Restart' again.
Wait for Docker Desktop to restart.
When it says 'Docker Desktop is running', either Datashare restarted automatically or you can restart Datashare manually (see 'Open Datashare').
On Linux, please execute: sudo service docker restart
In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: Elastic).
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox"
would be considered more relevant than "quick brown fox"
(source: Elastic).
When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
You need an internet connection to install Datashare.
You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.
You don't need internet connection:
to add documents to Datashare
to find named entities (except for the first time you use an NLP options - see above)
to search and explore documents
to download documents
This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.
Warning: this requires some technological knowledge.
You can make Datashare follow soft links : add --followSymlinks
when Datashare is launched.
If you're on Mac or Windows, you must change the launch script.
If you're on Linux, you can add the option after the Datashare command.
Yes, you can remove documents from Datashare. But at the moment, it will remove all your documents. You cannot remove only some documents.
Click the pink trash icon on the bottom left of Datashare:
And then click 'Yes':
You can them re-analyze a new corpus.
For advanced users only - if you'd like to do it with the Terminal, here are the instructions:
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
You started tasks, and they are running as you can see on 'http://localhost:8080/#/indexing' but they are not completing.
There are two possible causes:
If you see a progress of less than 100%, please wait.
If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on Datashare Github with the application logs.
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
The most complex operation is OCR (we use Apache Tesseract) so if your documents don't contain many images, it might be worth deactivating it (--ocr false
).
To fix the issue:
Stop Datashare. If Datashare is running, close the Terminal window (the window that opens when you start Datashare):
Click 'Terminate':
Open your Terminal (or a new window in your Terminal) and copy and paste:
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
Press Enter
Restart Datashare (here are the instructions for Mac, for Windows and for Linux)
Index documents again: go to 'Analyse your documents' and click 'Extract text':
Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:
Please find all the use cases in Datashare Tarentula's .
It can be due to extensions priorly installed. The tech team is fixing the issue. In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:
On Mac
On Linux
On Windows
Press Enter. Open Datashare again.
You can use Datashare with multiple users accessing a centralized database on a server.
Warning: to put the server mode in place and to maintain it requires some technical knowledge.
Please find the documentation here.
Pipelines of Natural Language Processing are tools that automatically identify named entities in your documents. You can only choose one at a time.
Select 'CoreNLP' if you want to use the one with the highest probability of working in most of your documents:
Generated with https://github.com/ICIJ/fluent-http-apigen
Retrieve the batch search list for the user issuing the request.
Return 200 and the list of batch searches
Example :
Retrieve the batch search list for the user issuing the request filter with the given criteria, and the total of batch searches matching the criteria.
It needs a Query json body with the parameters :
from : index offset of the first document to return (mandatory)
size : window size of the results (mandatory)
sort : field to sort (prj_id name user_id description state batch_date batch_results published) (default "batch_date")
order : "asc" or "desc" (default "asc")
project : projects to include in the filter (default null / empty list)
batchDate : batch search with a creation date included in this range (default null / empty list)
state : states to include in the filter (default null / empty list)
publishState : publish state to filter (default null)
If from/size are not given their default values are 0, meaning that all the results are returned. BatchDate must be a list of 2 items (the first one for the starting date and the second one for the ending date) If defined publishState is a string equals to "0" or "1"
Return 200 and the list of batch searches with the total batch searches for the query. See example for the JSON format.
Example :
Retrieve the batch search with the given id The query param "withQueries" accepts a boolean value When "withQueries" is set to false, the list of queries is empty and nbQueries contains the number of queries.
Parameter batchId
Return 200 and the batch search
Example :
Retrieve the batch search queries with the given batch id and returns a list of strings UTF-8 encoded
if the request parameter format is set with csv, then it will answer with content-disposition attachment (file downloading)
the optional request parameters are :
from: if not provided it starts from 0
size: if not provided all queries are returned from the "from" parameter
search: if provided it will filter the queries accordingly
orderBy: field name to order by asc, "query_number" by default (if it does not exist it will return a 500 error)
maxResult: number of maximum results for each returned query (-1 means no maxResults)
Parameter batchId
Return 200 and the batch search queries map [(query, nbResults), ...]
Example :
preflight request
Return 200 DELETE
preflight resquest for removal of one batchsearch
Parameter batchId
Return 200 DELETE
Delete batch search with the given id and its results. It won't delete running batch searches, because results are added and would be orphans.
Returns 204 (No Content) : idempotent
Return 204
Example :
Update batch search with the given id.
Returns 200 and 404 if there is no batch id If the user issuing the request is not the same as the batch owner in database, it will do nothing (thus returning 404)
Return 200 or 404
Example :
Creates a new batch search. This is a multipart form with 8 fields : name, description, csvFile, published, fileTypes, paths, fuzziness, phrase_matches
No matter the order. The name and csv file are mandatory else it will return 400 (bad request) Csv file must have under 60 000 lines else it will return 413 (payload too large) Queries with less than two characters are filtered
To do so with bash you can create a text file like :
Then replace with with a sed like this:
sed -i 's/$/^M/g' ~/multipart.txt
Then make a curl request with this file :
Parameter comaSeparatedProjects
Parameter context : the request body
Return 200 or 400 or 413
preflight request
Return 200 POST
Create a new batch search based on a previous one given its id, and enqueue it for running
it returns 404 if the source BatchSearch object is not found in the repository.
Parameter sourceBatchId: the id of BatchSearch to copy
Parameter context : the context of request (containing body)
Return 200 or 404
Example:
Retrieve the results of a batch search as JSON.
It needs a Query json body with the parameters :
from : index offset of the first document to return (mandatory)
size : window size of the results (mandatory)
queries: list of queries to be downloaded (default null)
sort: field to sort ("doc_nb", "doc_id", "root_id", "doc_path", "creation_date", "content_type", "content_length", "creation_date") (default "doc_nb")
order: "asc" or "desc" (default "asc")
If from/size are not given their default values are 0, meaning that all the results are returned.
Parameter batchId
Parameter webQuery
Return 200
Example :
Retrieve the results of a batch search as a CSV file.
The search request is by default all results of the batch search.
Parameter batchId
Return 200 and the CSV file as attached file
Example :
Delete batch searches and results for the current user.
Returns 204 (No Content): idempotent
Return 204
Example :
Returns the file from the index with the index id and the root document (if embedded document).
The routing can be omitted if it is a top level document, or it can be the same as the id.
Returns 404 if it doesn't exist
Returns 403 if the user has no access to the requested index.
Parameter project
Parameter id
Parameter routing
Return 200 or 404 or 403 (Forbidden)
Example :
Fetch extracted text by slice (pagination)
Parameter project Project id
Parameter id Document id
Parameter offset Starting byte (starts at 0)
Parameter limit Size of the extracted text slice in bytes
Parameter targetLanguage Target language (like "ENGLISH") to get slice from translated content
Return 200 and a JSON containing the extracted text content ("content":text), the max offset as last rank index ("maxOffset":number), start ("start":number) and size ("size":number) parameters.
Throws IOException
Example :
Search query occurrences in content or translated content (pagination)
Parameter project Project id
Parameter id Document id
Parameter query Query string to search occurrences (starts at 0)
Parameter targetLanguage Target language (like "ENGLISH") to search in translated content
Return 200 and a JSON containing the occurrences offsets in the text, and the count of occurrences.
Throws IOException
Example :
Group star the documents. The id list is passed in the request body as a json list.
It answers 200 if the change has been done and the number of documents updated in the response body.
Parameter projectId
Parameter docIds as json
Return 200 and the number of documents updated
Example :
Group unstar the documents. The id list is passed in the request body as a json list.
It answers 200 if the change has been done and the number of documents updated in the response body.
Parameter projectId
Parameter docIds as json in body
Return 200 and the number of documents unstarred
Example :
Retrieves the list of starred document for a given project.
Parameter projectId
Return 200
Example :
Retrieves the list of tagged document with tag "tag" for the given project id.
This service doesn't need to have the document stored in the database (no join is made)
Parameter projectId
Parameter comaSeparatedTags
Return 200
Example :
preflight request
Parameter projectId
Parameter docId
Return 200 PUT
Parameter projectId
Parameter docId
Parameter routing
Parameter tags
Return 201 if created else 200
Example :
Gets all the tags from a document with the user and timestamp.
Parameter projectId
Parameter docId
Return 200 and the list of tags
Example :
Group tag the documents. The document id list and the tag list are passed in the request body.
It answers 200 if the change has been done.
Parameter projectId
Parameter query
Return 200
Example :
Group untag the documents. The document id list and the tag list are passed in the request body.
It answers 200 if the change has been done.
Parameter projectId
Parameter query
Return 200
Example :
preflight request
Parameter projectId
Parameter docId
Return 200 PUT
Untag one document
Parameter projectId
Parameter docId
Parameter routing
Parameter tags
Return 201 if untagged else 200
Retrieves the list of starred document for all projects.
This service needs to have the document stored in the database.
Return 200 and the list of Documents
Retrieves the list of users who recommended a document with the total count of recommended documents for the given project id
Parameter projectId
Return 200
Example :
Get all users who recommended a document with the count of all recommended documents for project and documents ids.
Parameter projectId
Parameter comaSeparatedDocIds
Return 200 and the list of tags
Example :
Retrieves the set of marked read documents for the given project id and a list of users provided in the url.
This service doesn't need to have the document stored in the database (no join is made)
Parameter projectId
Parameter comaSeparatedUsers
Return 200
Example :
Group mark the documents "read". The id list is passed in the request body as a json list.
It answers 200 if the change has been done and the number of documents updated in the response body.
Parameter projectId
Parameter docIds as json
Return 200 and the number of documents marked
Example :
Group unmark the documents. The id list is passed in the request body as a json list.
It answers 200 if the change has been done and the number of documents updated in the response body.
Parameter projectId
Parameter docIds as json
Return 200 and the number of documents unmarked
Example :
Gets the extension set in JSON
If a request parameter "filter" is provided, the regular expression will be applied to the list.
see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for pattern syntax.
Example:
Return
Preflight request
Return OPTIONS,PUT
Download (if necessary) and install extension specified by its id or url
request parameter id
or url
must be present.
Return 200 if the extension is installed
Return 404 if the extension is not found by the provided id or url
Return 400 if neither id nor url is provided
Throws IOException
Example:
Preflight request
Return OPTIONS,DELETE
Uninstall extension specified by its id
Parameter extensionId
Return 204 if the extension is uninstalled (idempotent)
Throws IOException if there is a filesystem error
Example:
Create the index for the current user if it doesn't exist.
Return 201 (Created) or 200 if it already exists
Example :
Preflight for index creation.
Parameter index
Return 200 with PUT
Head request useful for JS api (for example to test if an index exists)
Parameter path
Return 200
The search endpoint is just a proxy in front of Elasticsearch, everything sent is forwarded to Elasticsearch. DELETE method is not allowed.
Path can be of the form :
_search/scroll
index_name/_search
index_name1,index_name2/_search
index_name/_count
index_name1,index_name2/_count
index_name/doc/_search
index_name1,index_name2/doc/_search
Parameter path
Return 200 or http error from Elasticsearch
Example :
Search GET request to Elasticsearch
As it is a GET method, all paths are accepted.
if a body is provided, the body will be sent to ES as source=urlencoded(body)&source_content_type=application%2Fjson in that case, request parameters are not taken into account.
Parameter path
Return 200 or http error from Elasticsearch
Example :
Prefligth option request
Parameter path
Return 200
Returns the named entity with given id and document id.
Parameter id
Parameter documentId the root document
Return 200
Example :
preflight request for hide
Parameter mentionNorm
Return 200 PUT
hide all named entities with the given normalized mention
Parameter mentionNorm
Parameter project
Return 200
Example :
Get the list of registered pipelines.
Return pipeline set Example:
When datashare is launched in NER mode (without index) it exposes a name finding HTTP API. The text is sent with the HTTP body.
Parameter pipeline to use
Parameter text to analyse in the request body
Return list of NamedEntities annotations
Example :
Gets the list of notes for a project and a document path.
if we have on disk:
And in database
then :
GET /api/p1/notes/a/b/doc1
will return note A and B
GET /api/p1/notes/a/c/doc2
will return note A
GET /api/p1/notes/d/doc3
will return an empty list
If the user doesn't have access to the project she gets a 403 Forbidden
Parameter project the project the note belongs to
Parameter documentPath the document path
Parameter context HTTP context containing the user
Return list of Note that match the document path
Example:
Gets the list of notes for a project.
If the user doesn't have access to the project she gets a 403 Forbidden
Parameter project the project the note belongs to
Parameter context HTTP context containing the user
Return list of Note related to the project
Example:
Gets the plugins set in JSON
If a request parameter "filter" is provided, the regular expression will be applied to the list.
see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for pattern syntax.
Example:
Preflight request
Return OPTIONS,PUT
Download (if necessary) and install plugin specified by its id or url
request parameter id
or url
must be present.
Return 200 if the plugin is installed
Return 404 if the plugin is not found by the provided id or url
Return 400 if neither id nor url is provided
Throws IOException
Throws ArchiveException
Example:
Preflight request
Return OPTIONS,DELETE
Uninstall plugin specified by its id Always returns 204 or error 500.
Parameter pluginId
Return 204
Throws IOException if there is a filesystem error
Example:
Gets the project information for the given project id.
Parameter id
Return 200 and the project from database if it exists
Example :
``` curl -H 'Content-Type:application/json' localhost:8080/api/project/apigen-datashare {"error":"java.lang.NullPointerException"} ``` ) ##Get /api/project/isDownloadAllowed/:id Returns if the project is allowed with this network route : in datashare database there is the project table that can specify an IP mask that is allowed per project. If the client IP is not in the range, then the file download will be forbidden.
in that project table there is a field called allow_from_mask
that can have a mask with IP and star wildcard.
Ex : 192.168.*.*
will match all subnetwork 192.168.0.0 IP's and only users with an IP in this range will be granted for downloading documents.
Parameter id
Return 200 or 403 (Forbidden)
Example :
Example :
Preflight option request
Parameter id
Return 200 DELETE
Delete the project from database and elasticsearch indices.
It always returns 204 (no content) or 500 if an error occurs.
If the project id is not the current user project (local-datashare in local mode), then it will return 401 (unauthorized)
Parameter id
Return 204
Example :
gets the root of the front-end app ie: ./app/index.html
if pluginsDir is set, it will add in the index the tag else it will return the index.html content as is
Return the content of index.html file
Gets the public (i.e. without user's information) datashare settings parameters. These parameters are used for the client app for the init process.
The endpoint is removing all fields that contain Address or Secret or Url or Key
Return 200
Example :
Gets the versions (front/back/docker) of datashare.
Return 200
Example :
Preflight for settings.
Parameter context
Return 200 with PATCH
update the datashare settings with provided body. It will save the settings on disk.
Returns 404 if settings is not found. It means that the settings file has not been set (or is not readable) Returns 403 if we are in SERVER mode
The settings priority is basically DS_DOCKER_* variables > -s file > classpath:datashare.properties > command line. I.e. :
DS_DOCKER_* variables will be taken and override all keys (if any similar keys exist)
if a file is given (w/ -c path/to/file) to the command line it will be read and used (it can be empty or not present)
if no file is given, we are looking for datashare.properties in the classpath (for example in /dist)
if none of the two above cases is fulfilled we are taking the default CLI parameters (and those given by the user)
parameters are common between CLI and settings file, the settings file "wins"
if a settings file is not writable then 404 will be returned (and a WARN will be logged at start)
Return 200 or 404 or 403
Example :
List all available language in Tesseract
Returns 503 if Tesseract is not installed
Return 200 or 503
List all available language in the text extractor
Return 200
Retrieve the status of databus connection, database connection, shared queues and index. Adding "format=openmetrics" parameter to the url will return the status witn openmetrics format.
Return the status of datashare elements
Example:
gets all the user tasks a filter can be added with a pattern contained in the task name.
Return 200 and the list of tasks
Example :
gets one task with its id
Parameter id
Return 200
Example :
gets task result with its id
Parameter id
Return 200 and the result, 204 if there is no result 404 if the tasks doesn't exist 403 if the task is not belonging to current user
Example :
download files from a search query. Expected parameters are :
project: string
query: string or elasticsearch JSON query
if the query is a string it is taken as an ES query string, else it is a raw JSON query (without the query part) @see org.elasticsearch.index.query.WrapperQueryBuilder that is used to wrap the query
Parameter optionsWrapper wrapper for options json
Return 200 and json task
Example :
index files from the queue
Parameter optionsWrapper wrapper for options json
Return 200 and json task
Example :
Indexes files in a directory (with docker, it is the mounted directory that is scanned)
Parameter optionsWrapper
Return 200 and the list of tasks created
Example :
Indexes all files of a directory with the given path.
Parameter filePath
Parameter optionsWrapper
Return 200 and the list of created tasks
Example $(curl -XPOST localhost:8080/api/task/batchUpdate/index/home/dev/myfile.txt)
Scans recursively a directory with the given path
Parameter filePath
Parameter optionsWrapper
Return 200 and the task created
Example :
Cleans all DONE tasks.
Return 200 and the list of removed tasks
Example :
Cleans a specific task.
Parameter taskName
Return
Example :
Cancels the task with the given name. It answers 200 with the cancellation status true|false
Parameter taskId
Return
Cancels the running tasks. It returns a map with task name/stop statuses. If the status is false, it means that the thread has not been stopped.
Return 200 and the tasks stop result map
Example : curl -XPUT localhost:8080/api/task/stopAll
Find names using the given pipeline :
OPENNLP
CORENLP
IXAPIPE
GATENLP
MITIE
This endpoint is going to find all Documents that are not taggued with the given pipeline, and extract named entities for all these documents.
Parameter pipelineName
Parameter optionsWrapper
Return 200 and the list of created tasks
Example :
List all files and directory for the given path. This endpoint returns a JSON using the same specification than the tree
command on UNIX. It is roughly the equivalent of:
Parameter dirPath
Return 200 and the list of files and directory
Example $(curl -XGET localhost:8080/api/tree/home/datashare/data)
Gets the user's session information
Return 200 and the user map
Example :
Preflight for history.
Return 200 with OPTIONS, GET, PUT and DELETE
Gets the user's history by type
Parameter type String included in 'document' or 'search'
Parameter from the offset of the list, starting from 0
Parameter size the number of element retrieved
Parameter sort the name of the parameter to sort on (default: modificationDate)
Parameter desc the list is sorted in descending order (default: true)
Parameter projects projectIds separated by comma to filter by projects (default: none)
Return 200, the user's list of events and the total number of events
Example : ``` curl -i localhost:8080/api/users/me/history?type=document&from=0&size=10&sort=modificationDate&desc=true&projects=project1,project2 HTTP/1.1 200 OK Access-Control-Allow-Origin: * Content-Type: application/json;charset=UTF-8 ETag: 9a3f093e2dc5d929bb25879501d527c7 Content-Length: 22 Connection: keep-alive Set-Cookie: _ds_session_id={"login":null,"roles":null,"sessionId":null,"redirectAfterLogin":"/"}; version=1; path=/; expires=Mon, 30-Jul-2091 14:00:32 GMT; max-age=2147483647
{"items":[],"total":0}
curl -i -XPUT -H "Content-Type: application/json" localhost:8080/api/users/me/history -d '{"type": "SEARCH", "projectIds": ["apigen-datashare","local-datashare"], "name": "foo AND bar", "uri": "?q=foo%20AND%20bar&from=0&size=100&sort=relevance&index=luxleaks&field=all&stamp=cotgpe"}' HTTP/1.1 500 Internal Server Error Content-Type: application/json;charset=UTF-8 ETag: b1b6023e69d8821fc4e1e8418ab85f30 Content-Length: 77 Connection: keep-alive
{"error":"org.jooq.exception.DataAccessException: Cannot commit transaction"}
curl -i -XDELETE localhost:8080/api/users/me/history?type=search HTTP/1.1 500 Internal Server Error Content-Type: application/json;charset=UTF-8 ETag: b1b6023e69d8821fc4e1e8418ab85f30 Content-Length: 77 Connection: keep-alive
{"error":"org.jooq.exception.DataAccessException: Cannot commit transaction"}
curl -i -XDELETE localhost:8080/api/users/me/history/event?id=1 HTTP/1.1 500 Internal Server Error Content-Type: application/json;charset=UTF-8 ETag: b1b6023e69d8821fc4e1e8418ab85f30 Content-Length: 77 Connection: keep-alive
{"error":"org.jooq.exception.DataAccessException: Cannot commit transaction"}
It means that you are on Windows.
Search and open 'Computer management':
Go to 'Local users and groups':
In 'Groups', double-click 'docker-users':
If you are not in 'docker-users', go to 'Users' on the left filter and add you in the 'docker-users' group by clicking on you and 'Add...':
p1
a
note A
info
p1
a/b
note B
danger
This documentation is intended to help you create plugins for Datashare client. All methods currently exposed in the Core class are available to a global variable called datashare
.
Example
Class representing the core application with public methods for plugins.
Kind: global class
Mixes: FiltersMixin
, HooksMixin
, I18nMixin
, PipelinesMixin
, ProjectsMixin
, WidgetsMixin
instance
.ready : Promise.<Object>
.bootstrapVue ⇒ Plugin
.i18n : I18n
.router : VueRouter
.store : Vuex.Store
.plugin ⇒ *
.auth : Auth
.config : Object
.api : Api
.vue : Vue
.wait : VueWait
.mode : String
.buildCorePlugin() ⇒ VueCore
.configure() ⇒ Promise.<Object>
.mount([selector]) ⇒ Vue
.getUser() ⇒ Promise.<Object>
.loadUser() ⇒ Promise
.loadSettings() ⇒ Promise
static
Create an application
api
Datashare api interface
mode
mode of authentication ('local' or 'server'
Get a promise that is resolved when the application is ready
Kind: instance property of Core
Fullfil: Object The actual application core instance.
Core
Deprecated
The application core instance. Deprecated in favor or the core
property.
Kind: instance property of Core
Core
The application core instance
Kind: instance property of Core
The Bootstrap Vue plugin instance.
Kind: instance property of Core
The I18n instance
Kind: instance property of Core
The VueRouter instance
Kind: instance property of Core
The Vuex instance
Kind: instance property of Core
The CorePlugin instance
Kind: instance property of Core
The Auth module instance
Kind: instance property of Core
The configuration object provided by Murmur
Kind: instance property of Core
The Datashare api interface
Kind: instance property of Core
The Vue app
Kind: instance property of Core
The VueWait
Kind: instance property of Core
Get current Datashare mode
Kind: instance property of Core
Core
Add a Vue plugin to the app
Kind: instance method of Core
Returns: Core
- the current instance of Core
Plugin
Object
The actual Vue plugin class
options
Object
Option to pass to the plugin
Core
Configure all default Vue plugins for this application
Kind: instance method of Core
Returns: Core
- the current instance of Core
Core
Configure vue-i18n plugin
Kind: instance method of Core
Returns: Core
- the current instance of Core
Core
Configure bootstrap-vue plugin
Kind: instance method of Core
Returns: Core
- the current instance of Core
Core
Configure vue-router plugin
Kind: instance method of Core
Returns: Core
- the current instance of Core
Core
Configure vuex plugin
Kind: instance method of Core
Returns: Core
- the current instance of Core
Core
Configure most common Vue plugins (Murmur, VueShortkey, VueScrollTo and VueCalendar)
Kind: instance method of Core
Returns: Core
- the current instance of Core
Core
Configure vue-wait plugin
Kind: instance method of Core
Returns: Core
- the current instance of Core
Core
Add a $core property to the instance's Vue
Kind: instance method of Core
Returns: Core
- the current instance of Core
Build a VueCore instance with the current Core instance as parameter of the global properties.
Kind: instance method of Core
Load settings from the server and instantiate most the application configuration.
Kind: instance method of Core
Fullfil: Core
- The instance of the core application
Reject: Object - The Error object
Mount the instance's vue application
Kind: instance method of Core
Returns: Vue - The instantiated Vue
[selector]
String
#app
Query selector to the mounting point
Build a promise to be resolved when the application is configured.
Kind: instance method of Core
Core
Dispatch an event from the document root, passing the core application through event message.
Kind: instance method of Core
Returns: Core
- the current instance of Core
name
String
Name of the event to fire
...args
Mixed
Additional params to pass to the event
Get the current signed user.
Kind: instance method of Core
Fullfil: Object Current user
Get and update user definition in place
Kind: instance method of Core
Get settings (both from the server settings and the current mode)
Kind: instance method of Core
Append the given title to the page title
Kind: instance method of Core
title
String
Title to append to the page
[suffix]
String
Datashare
Suffix to the title
Register a callback to an event using the EventBus singleton.
Kind: instance method of Core
event
String
callback
*
Unregister a callback to an event using the EventBus singleton.
Kind: instance method of Core
event
String
callback
*
Emit an event using the EventBus singleton.
Kind: instance method of Core
event
String
payload
*
Core
instantiate a Core class (useful for chaining usage or mapping)
Kind: static method of Core
...options
Mixed
Options to pass to the Core constructor
Mixin class extending the core to add helpers for components.
Kind: global mixin
Mixin class extending the core to add helpers for filters.
Kind: global mixin
Register a filter
Kind: instance method of FiltersMixin
...args
Mixed
Filter's params.
args.type
String
Type of the filter.
args.options
Object
Options to pass to the filter constructor.
args.options.name
String
Name of the filter.
args.options.key
String
Key of the filter. Typically ElasticSearch field name.
[args.options.icon]
String
Icon of the filter.
[args.options.isSearchable]
Boolean
false
Set if this filter should be searchable or not.
[args.options.alternativeSearch]
function
()=>{})
Set a function about how to transform query term before searching for it.
[args.options.order]
Number
Order of the filter. Will be added as last filter by default.
Unregister a filter
Kind: instance method of FiltersMixin
name
String
Name of the filter to unregister
Register a filter only for a specific project
Kind: instance method of FiltersMixin
name
String
Name of the project
...args
Mixed
Filter's options.
args.name
String
Name of the filter
args.type
String
Type of the filter.
args.options
Object
Options to pass to the filter constructor
Unregister a filter only for a specific project
Kind: instance method of FiltersMixin
name
String
Name of the project
name
String
Name of the filter
Mixin class extending the core to add helpers for hooks.
Kind: global mixin
Register a hook
Kind: instance method of HooksMixin
...args
Mixed
Hook's options
args.name
String
Name of the hook
args.target
String
Target of the hook
args.order
Number
Priority of the hook
args.definition
Object
Options to pass to the hook constructor
Unregister a specific hook
Kind: instance method of HooksMixin
name
String
Name of the hook
Unregister all hooks from a target
Kind: instance method of HooksMixin
name
String
Name of the target
Unregister all hooks, on every targets
Kind: instance method of HooksMixin
Register a hook for a specific project
Kind: instance method of HooksMixin
project
String
Project to add this hook to
options
Object
Hook's options
options.name
String
Name of the hook
options.target
String
Target of the hook
options.order
Number
Priority of the hook
options.definition
Object
Options to pass to the hook constructor
Mixin class extending the core to add helpers for i18n.
Kind: global mixin
.initializeI18n() ⇒ Promise
.setI18nLocale(locale) ⇒ String
.hasI18Locale(locale) ⇒ Boolean
.loadI18Locale(locale) ⇒ Promise
Initialize i18N using the local storage and load the necessary locale's messages
Kind: instance method of I18nMixin
Set the active locale both in local stoage and VueI18n.
Kind: instance method of I18nMixin
locale
String
Key of the local (fr, de, en, ja, ...)
Check the given locale storage was loaded.
Kind: instance method of I18nMixin
locale
String
Key of the local (fr, de, en, ja, ...)
Load i18n messages for the given locale (if needed) and set it as the current locale.
Kind: instance method of I18nMixin
locale
String
Key of the local (fr, de, en, ja, ...)
Mixin class extending the core to add helpers for pipelines.
Kind: global mixin
Register a pipeline
Kind: instance method of PipelinesMixin
...args
Mixed
Pipeline's options.
args.name
String
Name of the pipeline
args.type
String | function
Type of the pipeline.
category
String
The pipeline to target
Unregister a pipeline
Kind: instance method of PipelinesMixin
name
String
Name of the pipeline
Register a pipeline for a specific project
Kind: instance method of PipelinesMixin
project
String
Name of the project
...args
Mixed
Pipeline's options.
args.name
String
Name of the pipeline
args.type
String | function
Type of the pipeline.
category
String
The pipeline to target
Unregister a pipeline for a specific project
Kind: instance method of PipelinesMixin
project
String
Name of the project
name
String
Name of the pipeline
Mixin class extending the core to add helpers for projects.
Kind: global mixin
Call a function when a project is selected
Kind: instance method of ProjectsMixin
name
String
Name of the project
withFn
function
Function to call when the project is selected
withoutFn
function
Function to call when the project is unselected
mutationType
String
Mutation type that will be watched for changes.
storePath
String
Path to the project in the store
Create a default project on Datashare using the API
Kind: instance method of ProjectsMixin
Returns: Promise:Object - The HTTP response object
Mixin class extending the core to add helpers for widgets.
Kind: global mixin
Register a widget
Kind: instance method of WidgetsMixin
...args
Mixed
Widget's options passed to widget constructor
args.name
String
Name of the widget
args.card
Boolean
Either or not this widget should be a card
component from Boostrap.
args.cols
Number
Number of columns to fill in the grid (from 1 to 12)
[args.type]
String
WidgetEmpty
Type of the widget
Unregister a widget
Kind: instance method of WidgetsMixin
name
String
Name of the widget to unregister
Unregister all widgets
Kind: instance method of WidgetsMixin
Register a widget for a specific project
Kind: instance method of WidgetsMixin
project
String
Name of the project to add this widget to
options
Object
Widget's options passed to widget constructor
options.name
String
Name of the widget
options.card
Boolean
Either or not this widget should be a card
component from Boostrap
options.cols
Number
Number of columns to fill in the grid (from 1 to 12)
[options.type]
String
WidgetEmpty
Type of the widget
Replace an existing widget
Kind: instance method of WidgetsMixin
name
String
Name of the widget to replace
options
Object
Widget's options passed to widget constructor.
options.card
Boolean
Either or not this widget should be a card
component from Boostrap
options.cols
Number
Number of columns to fill in the grid (from 1 to 12)
[options.type]
String
WidgetEmpty
Type of the widget
Example
Replace an existing widget for a specific project
Kind: instance method of WidgetsMixin
project
String
Name of the project to add this widget to
name
String
Name of the widget to replace
options
Object
Widget's options passed to widget constructor. Each widget class can define its own default values.
options.card
Boolean
Either or not this widget should be a card
component from Boostrap
options.cols
Number
Number of columns to fill in the grid (from 1 to 12)
[options.type]
String
WidgetEmpty
Type of the widget
List all projects this user has access to.
Kind: global variable
List all projects name ids this user has access to.
Kind: global variable
Get the name of the default project
Kind: global variable
Asynchronously find a component in the lazyComponents object by its name.
Kind: global function Returns: Promise.<(object|null)> - - A promise that resolves with the found component object, or null if not found.
name
string
The name of the component to find.
Asynchronously get a component from the lazyComponents object based on its name.
Kind: global function Returns: Promise.<(object|Error)> - - A promise that resolves with the found component object, or rejects with an Error if not found.
name
string
The name of the component to retrieve.
Check if multiple component names are the same when slugified.
Kind: global function Returns: boolean - - True if all names are the same when slugified, false otherwise.
...names
string
The component names to compare.
Generate a slug from the component name using kebab case and lowercase.
Kind: global function Returns: string - - The slugified component name.
name
string
The name of the component to slugify.
Get the lazyComponents object using require.context for lazy loading of components.
Kind: global function Returns: Object - - The lazyComponents object generated using require.context.
Return true if the default project exists
Kind: global function
Retrieve a project by its name
Kind: global function Returns: Object - The project matching with this name
name
String
Name of the project to retrieve
Delete a project by it name identifier.
Kind: global function Returns: Promise:Integer - Index of the project deleted or -1 if project does not exist
name
String
Name of the project to retrieve
Delete a project from the search store
Kind: global function
name
String
Name of the project to delete fropm the store
Update a project in the list or add it if it doesn't exist yet.
Kind: global function Returns: Object - The project
project
Object
You can send an email to datashare@icij.org.
When reporting a bug, please share:
your OS (Mac, Windows or Linux) and version
the problem, with screenshots if possible
the actions that led to the problem
Advanced users can post an issue with their logs on Datashare's GitHub : https://github.com/ICIJ/datashare/issues
Datashare can display 'Preview' for some document types only: images, pdf, csv, xlsx and tiff. Other document types are not supported yet.
To allow external developers to add their own components, we added markers called "hooks" in strategic locations on the user interface where a user can define new Vue Component through plugins.
search.nav:before
search.nav:after
app-sidebar.menu:before
app-sidebar.menu:after
app-sidebar.help:before
app-sidebar.help:after
app-sidebar.guides:before
app-sidebar.guides:after
app-sidebar.locales:before
app-sidebar.locales:after
document.content:before
document.content.toolbox:before
document.content.toolbox:after
document.content.ner:before
document.content.togglers:before
document.content.togglers:after
document.content.ner:after
document.content.body:before
document.content.body:after
document.content:after
filters-panel:before
filters-panel.toolbar:before
filters-panel.toolbar:after
filters-panel.filters:before
filters-panel.filters:after
filters-panel:after
app:before
app:after
document.header:before
document.header.name:before
document.header.name:after
document.header.tags:before
document.header.tags:after
document.header.nav:before
document.header.nav.items:before
document.header.nav.items:after
document.header.nav:after
document.header:after
landing.form:before
landing.form.heading:before
landing.form.heading:after
landing.form:after
landing.form.project:before
landing.form.project:after
search:before
search.body:before
search.body:after
search:after
Widget to display the disk space occupied by indexed files on the insights page.
Kind: global class
Widget to display the number of file by creation date on the insights page.
Kind: global class
Widget to display number of files by creation date by path
Kind: global class
Widget for the insights page indicating the proportion of duplicates in the data.
Kind: global class
Class representing the Empty widget. This widget is not intended to be used directly.
Kind: global class
Create a new WidgetEmpty
Widget to display text on the insights page
Kind: global class
Widget to display a list of items or links on the insights page
Kind: global class
Create a new WidgetFacets
Widget to display the number of indexed files on the insights page
Kind: global class
Widget to display a list of items or links on the insights page
Kind: global class
Create a new WidgetListGroup
Widget to display names
Kind: global class
Widget to display nested widgets
Kind: global class
Create a new WidgetProject
Widget to to display a search bar
Kind: global class
Create a new WidgetProject
Widget to display latest recommend documents.
Kind: global class
Create a new WidgetRecommendedBy
Widget to to display a search bar
Kind: global class
Create a new WidgetSearchBar
Widget to display text on the insights page
Kind: global class
Create a new WidgetText based on a WidgetEmpty
Class representing the TreeMap widget
Kind: global class
Create a new WidgetTreeMap based on a WidgetEmpty
Datashare's filters keep the named entities (people, organizations and locations) previously recognized.
"Old" named entities stay in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later. It means that you removed the documents which contained the named entities after extracting them, you run new analysis, but the named entities stayed in the filters:
In the future, removing the documents from Datashare before indexing new ones will remove the named entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can click the little pink trash icon on the bottom of the left column:
If you use Datashare with Docker (not the standard version), if a dark window called the Terminal displays a phrase beginning with "Windows named pipe error: The system cannot find the file specified" appears, it means that Docker Desktop, one of the 3 components of Datashare, is not working. Relaunching Docker Desktop should solve the problem.
Find Docker Desktop in your Applications or the whale icon on the menu bar of your computer and click 'Restart'.
If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of loosing your previously selected filters, click 'Reset filters'.
In 'Analyzed documents', if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you analyzed, it can take multiple hours.
If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:
First wait 30 seconds and reload the page.
If you still see a blank screen, please uninstall and reinstall Datashare
To uninstall Datashare:
On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.
On Windows, please follow these steps.
On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.
To reinstall Datashare, see 'Install Datashare' for Mac, Windows or Linux.
api_key
api_key_pkey PRIMARY KEY, btree (id)
api_key_user_id_key UNIQUE CONSTRAINT, btree (user_id)
batch_search
batch_search_pkey PRIMARY KEY, btree (uuid)
batch_search_date btree (batch_date)
batch_search_published btree (published)
batch_search_user_id btree (user_id)
batch_search_pkey PRIMARY KEY, btree (uuid)
batch_search_date btree (batch_date)
batch_search_published btree (published)
batch_search_user_id btree (user_id)
Referenced by:
TABLE batch_search_project CONSTRAINT batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)
batch_search_project
batch_search_project_unique UNIQUE, btree (search_uuid, prj_id)
batch_search_project_batch_search_uuid_fk FOREIGN KEY (search_uuid) REFERENCES batch_search(uuid)
batch_search_query
idx_query_result_batch_unique UNIQUE, btree (search_uuid, query)
batch_search_query_search_id btree (search_uuid)
batch_search_result
batch_search_result_prj_id btree (prj_id)
batch_search_result_query btree (query)
batch_search_result_uuid btree (search_uuid)
document
document_pkey PRIMARY KEY, btree (id)
document_parent_id btree (parent_id)
document_status btree (status)
document_tag
idx_document_tag_unique UNIQUE, btree (doc_id, label)
document_tag_doc_id btree (doc_id)
document_tag_label btree (label)
document_tag_project_id btree (prj_id)
document_user_recommendation
idx_document_mark_read_unique UNIQUE, btree (doc_id, user_id, prj_id)
document_user_mark_read_doc_id btree (doc_id)
document_user_mark_read_project_id btree (prj_id)
document_user_mark_read_user_id btree (user_id)
document_user_star
idx_document_star_unique UNIQUE, btree (doc_id, user_id, prj_id)
document_user_star_doc_id btree (doc_id)
document_user_star_project_id btree (prj_id)
document_user_star_user_id btree (user_id)
named_entity
named_entity_pkey PRIMARY KEY, btree (id)
named_entity_doc_id btree (doc_id)
note
idx_unique_note_path_project UNIQUE, btree (project_id, path)
note_project btree (project_id)
project
project_pkey PRIMARY KEY, btree (id)
user_history
user_history_pkey PRIMARY KEY, btree (id)
idx_user_history_unique UNIQUE, btree (user_id, uri)
user_history_creation_date btree (creation_date)
user_history_type btree (type)
user_history_user_id btree (user_id)
user_history_pkey PRIMARY KEY, btree (id)
idx_user_history_unique UNIQUE, btree (user_id, uri)
user_history_creation_date btree (creation_date)
user_history_type btree (type)
user_history_user_id btree (user_id)
Referenced by:
TABLE user_history_project CONSTRAINT user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)
user_history_project
user_history_project_unique UNIQUE, btree (user_history_id, prj_id)
user_history_project_user_history_id_fk FOREIGN KEY (user_history_id) REFERENCES user_history(id)
user_inventory
user_inventory_pkey PRIMARY KEY, btree (id)
name
string
Unique name of the widget
card
boolean
true
Is this widget displayed as card ?
cols
number
12
Number of columns on which the widget should be displayed according to the Bootstrap's grid system
order
number
0
Order to display among the others widgets
title
string
null
The title of the widget
field
string
"\"type\""
Field to build the facet uppon
icon
mixed
routeQueryField
string
null
bucketTranslation
mixed
options
Object
See WidgetEmpty for others options
title
string
null
The title of the widget
items
Array
[
The list of items to display
pipeline
string
"'widget-list-group'"
I do not know
options
Object
See WidgetEmpty for others options
widgets
Array
A list of nested widgets
options
Object
See WidgetEmpty for others options
options
Object
See WidgetEmpty for others options
hideThumbnails
Boolean
Etheir or not we should hide thumbnails
options
Object
See WidgetEmpty for others options
index
string
The Elasticsearch project of the Widget
options
Object
See WidgetEmpty for others options
title
string
null
The title of the widget
content
string
null
The content of the widget
pipeline
string
"'widget-text'"
Transformation to apply to the content
options
Object
See WidgetEmpty for others options
title
string
null
The title of the Widget
index
string
The Elasticsearch project of the Widget
options
Object
See WidgetEmpty for others options
id
character varying(96)
not null
user_id
character varying(96)
not null
creation_date
timestamp without time zone
not null
uuid
character(36)
not null
name
character varying(255)
description
character varying(4096)
user_id
character varying(96)
not null
batch_date
timestamp without time zone
not null
state
character varying(8)
not null
published
integer
not null
0
phrase_matches
integer
not null
0
fuzziness
integer
not null
0
file_types
text
paths
text
error_message
text
batch_results
integer
0
error_query
text
search_uuid
character(36)
not null
prj_id
character varying(96)
not null
search_uuid
character(36)
not null
query_number
integer
not null
query
text
not null
query_results
integer
0
search_uuid
character(36)
not null
query
text
not null
doc_nb
integer
not null
doc_id
character varying(96)
not null
root_id
character varying(96)
not null
doc_path
character varying(4096)
not null
creation_date
timestamp without time zone
content_type
character varying(255)
content_length
bigint
prj_id
character varying(96)
id
character varying(96)
not null
path
character varying(4096)
not null
project_id
character varying(96)
not null
content
text
metadata
text
status
smallint
extraction_level
smallint
language
character(2)
extraction_date
timestamp without time zone
parent_id
character varying(96)
root_id
character varying(96)
content_type
character varying(256)
content_length
bigint
charset
character varying(32)
ner_mask
smallint
doc_id
character varying(96)
not null
label
character varying(64)
not null
prj_id
character varying(96)
user_id
character varying(255)
creation_date
timestamp without time zone
not null
'1970-01-01 00:00:00'::timestamp without time zone
doc_id
character varying(96)
not null
user_id
character varying(96)
not null
prj_id
character varying(96)
doc_id
character varying(96)
not null
user_id
character varying(96)
not null
prj_id
character varying(96)
id
character varying(96)
not null
mention
text
not null
offsets
text
not null
extractor
smallint
not null
category
character varying(8)
doc_id
character varying(96)
not null
root_id
character varying(96)
extractor_language
character(2)
hidden
boolean
project_id
character varying(96)
not null
path
character varying(4096)
note
text
variant
character varying(16)
id
character varying(255)
not null
path
character varying(4096)
allow_from_mask
character varying(64)
label
character varying(255)
publisher_name
character varying(255)
''::character varying
maintainer_name
character varying(255)
''::character varying
source_url
character varying(2048)
''::character varying
logo_url
character varying(2048)
''::character varying
creation_date
timestamp without time zone
now()
update_date
timestamp without time zone
now()
description
character varying(4096)
''::character varying
id
integer
not null
generated by default as identity
creation_date
timestamp without time zone
not null
modification_date
timestamp without time zone
not null
user_id
character varying(96)
not null
type
smallint
not null
name
text
uri
text
not null
user_history_id
integer
not null
prj_id
character varying(96)
not null
id
character varying(96)
not null
email
text
name
character varying(255)
provider
character varying(255)
details
text
'{}'::text
Deletes batch searches and results for the current user.
no content: idempotent
Preflight request
returns 200 with DELETE
Deletes a batch search and its results with the given id. It won't delete running batch searches, because results would be orphans.
Returns 204 (No Content) : idempotent
Preflight request
returns 200 with DELETE
Retrieves the results of a batch search as an attached CSV file.
returns the results of the batch search as CSV attached file.
Preflight options request
returns 200 with OPTIONS and GET
Preflight request
returns OPTIONS and PUT
Preflight request
returns OPTIONS and DELETE
Preflight request for hide endpoint
returns PUT
Preflight request
returns OPTIONS and PUT
Preflight request
returns 200 with OPTIONS and DELETE
Uninstall plugin specified by its id.
returns 204 if the plugin is uninstalled (idempotent)
Deletes all user's projects from database and elasticsearch index.
if projects are deleted
Preflight option request
returns 200 with OPTIONS, POST, GET and DELETE
Preflight option request
returns 200 with OPTIONS and DELETE
Preflight request for batch download.
returns 200 with OPTIONS and POST
Preflight request for task cleaning.
returns OPTIONS and DELETE
Preflight request to stop tasks.
returns 200 with OPTIONS and PUT
Preflight request to stop all tasks.
returns 200 with OPTIONS and PUT
Preflight request for history
returns 200 with OPTIONS, GET, PUT and DELETE
Preflight request for history
returns OPTIONS and DELETE
Preflight for index creation.
returns 200 with PUT
Head request useful for JavaScript API (for example to test if an index exists)
returns 200
Deletes the project from database and elasticsearch index.
if project is deleted
Returns 200 if the project is allowed with this network route : in Datashare database there is the project table that can specify an IP mask that is allowed per project. If the client IP is not in the range, then the file download will be forbidden. In that project table there is a field called allow_from_mask
that can have a mask with IP and star wildcard.
Ex : 192.168.*.*
will match all subnetwork 192.168.0.0
IP's and only users with an IP in.
if project download is allowed for this project and IP
Gets the public (i.e. without user's information) datashare settings parameters.
These parameters are used for the client app for the init process.
The endpoint is removing all fields that contain Address or Secret or Url or Key
returns the list of public settings
Gets the user's session information.
returns the user map
Gets the versions (front/back/docker) of datashare.
returns the list of versions of datashare
Cancels the running tasks. It returns a map with task name/stop statuses.
If the status is false, it means that the thread has not been stopped.
returns 200 and the tasks stop result map
Gets the project information for the given id
if the project is not found in database
Uninstall extension specified by its id.
id of the extension to uninstall
returns 204 if the extension is uninstalled (idempotent)
Gets the list of registered pipelines.
returns the pipeline set
Get the JSON or YAML OpenAPI v3 contract specification
format of openapi description. Possible values are "json" or "yaml". Default="json".
returns the JSON or YAML file
Delete user event by id.
user history event id to delete
Returns 204 (No Content) : idempotent
Deletes an apikey for current user. Only available in SERVER
mode.
user identifier
when key has been deleted
Preflight for key management
user identifier
returns OPTIONS, GET, PUT and DELETE
Retrieves the batch search queries with the given batch id and returns a list of strings UTF-8 encoded
identifier of the batch search
if not provided it starts from 0
if not provided all queries are returned from the "from" parameter
if set to csv, it answers with content-disposition attachment (file downloading)
if provided it will filter the queries accordingly
field name to order by asc, "query_number" by default (if it does not exist it will return a 500 error)
number of maximum results for each returned query (-1 means no maxResults)
the batch search queries map [(query, nbResults), ...]
Preflight request for document tagging
the project id
document id
returns 200 with PUT
Preflight request for document untagging
the project id
document id
returns 200 with PUT
Hide all named entities with the given normalized mention
current project
normalized mention
returns 200 OK
Gets the list of notes for a project.
the project id
if the user is not granted for the project
Preflight request
returns POST
Get the private key for an existing user. Only available in SERVER mode.
user identifier
returns the hashed key JSON
Creates a new private key and saves its SHA384 hash into database for current user. Only available in SERVER mode.
user identifier
returns the api key JSON
Create the index for the current user if it doesn't exist.
index to create
returns 200 if the index already exists
Search GET request to Elasticsearch. As it is a GET method, all paths are accepted.
if a body is provided, the body will be sent to ES as source=urlencoded(body)&source_content_type=application%2FjsonIn that case, request parameters are not taken into account.
elasticsearch path
returns 200
Creates a project
if project and index have been created
Cleans a specific task.
name of the task to delete
returns 200 if the task is removed
Cancels the task with the given name.
name of the task to cancel
returns 200 with the cancellation status (true/false)
Retrieves the set of recommended documents for the given project id and a list of users
comma separated users
default response
Retrieves the list of starred documents for a given project.
the project id
default response
Retrieves the list of tagged documents for a given project id filtered by a given string of coma-separated list of tags.
the project id
comma separated tags
default response
Download (if necessary) and install extension specified by its id or url.Request parameter id
or url
must be present.
id of the extension
url of the extension
returns 200 if the extension is installed
Download (if necessary) and install plugin specified by its id or url.Request parameter id
or url
must be present.
id of the plugin
url of the plugin
returns 200 if the plugin is installed
Delete user history by type.
type of user history event
Returns 204 (No Content) : idempotent
Gets all user's document recommendations.
if not provided it starts from 0
if not provided, the 50 first record from the "from" parameter
if not provided, return every recommendations for every project
returns the user's document recommendations
Gets task result with its id
task id
returns 200 and the result
Gets all users who recommended a document with the count of all recommended documents for project and documents ids.
comma separated document ids
default response
Indexes files in a directory (with docker, it is the mounted directory that is scanned).
wrapper for options json
returns 200 and the list of tasks created
Download files from a search query.
Expected parameters are:
If the query is a string it is taken as an ES query string, else it is a raw JSON query (without the query part),see org.elasticsearch.index.query.WrapperQueryBuilder that is used to wrap the query
the json used to wrap the query
returns 200 and the json task id
Creates a new batch search based on a previous one given its id, and enqueue it for running
source batch id
batch parameters
returns the id of the created batch search
Retrieves the list of users who recommended a document with the total count of recommended documents for the given project id
default response
Add or update an event to user's history. The event's type, the project ids and the uri are passed in the request body.
To update the event's name, the eventId is required to retrieve the corresponding event. The project list related to the event is stored in database but is never queried (no filters on project).
user history query to save
returns 200 when event is added or updated.
Gets tags by document id
the project id
document id
default response
Creates a new batch search. This is a multipart form with 9 fields:
name, description, csvFile, published, fileTypes, paths, fuzziness, phrase_matches, query_template.
Queries with less than two characters are filtered.
To make a request manually, you can create a file like:
--BOUNDARY
Content-Disposition: form-data; name="name"
my batch search
--BOUNDARY
Content-Disposition: form-data; name="description"
search description
--BOUNDARY
Content-Disposition: form-data; name="csvFile"; filename="search.csv"
Content-Type: text/csv
Obama
skype
test
query three
--BOUNDARY--
Content-Disposition: form-data; name="published"
true
--BOUNDARY--
Then curl with
curl -i -XPOST localhost:8080/api/batch/search/prj1,prj2 -H 'Content-Type: multipart/form-data; boundary=BOUNDARY' --data-binary @/home/dev/multipart.txt
you'll maybe have to replace \n with \r\n with sed -i 's/$/^M/g' ~/multipart.txt
Coma-separated list of projects
multipart form
if either name or CSV file is missing
Gets the extension set in JSON. If a request parameter "filter" is provided, the regular expression will be applied to the list.
See https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for pattern syntax.
regular expression to apply
returns the extensions set
Gets the plugins set in JSON.
If a request parameter "filter" is provided, the regular expression will be applied to the list.
See Pattern for pattern syntax.
regular expression to apply
returns the plugins set
Retrieve the status of databus connection, database connection and index.
if provided in the URL it will return the status in openmetrics format
returns the status of datashare elements
Gets all the user tasks.
Filters can be added with name=value
. For example if name=foo
is given in the request url query,
the tasks containing the term "foo" are going to be returned. It can contain also dotted keys.
For example if args.dataDir=bar
is provided, tasks with "dataDir" containing "bar" are going to be selected.
pattern contained in the task name
returns the list of tasks
Indexes files from the queue.
wrapper for options json
returns 200 and the json task id
Lists all files and directory for the given path. This endpoint returns a JSON using the same specification than the tree
command on UNIX. It is roughly the equivalent of:
tree -L 1 -spJ --noreport /home/datashare/data
directory path in the tree
returns the list of files and directory
Gets one task with its id.
task id
returns the task from its id
When datashare is launched in NER mode (without index) it exposes a name finding HTTP API.
The text is sent with the HTTP body.
pipeline to use
text to analyze in the request body
returns the list of NamedEntities annotations
Retrieves the batch search list for the user issuing the request filter with the given criteria, and the total of batch searches matching the criteria.
If from/size are not given their default values are 0, meaning that all the results are returned. BatchDate must be a list of 2 items (the first one for the starting date and the second one for the ending date) If defined publishState is a string equals to "0" or "1"
the json webQuery request body
the list of batch searches with the total batch searches for the query
Cleans all DONE tasks.
returns 200 and the list of removed tasks
Retrieves the list of batch searches
'freetext' search filter. Empty string or '' to select all. Default is ''
specifies field on query filter ('all','author'...). Default is 'all'
list of selected queries in the batch search (to invert selection put 'queriesExcluded' parameter to true)
Associated with 'queries', if true it excludes the listed queries from the results
filters by contentTypes
filters by projects. Empty array corresponds to no projects
filters by date range timestamps with [dateStart, dateEnd]
filters by task status (RUNNING, QUEUED, DONE, FAILED)
filters by published state (0: private to the user, 1: public on the platform)
boolean, if true it includes list of queries
if not provided default is 100
if not provided it starts from 0
default response
Retrieves the batch search with the given id. The query param "withQueries" accepts a boolean value.When "withQueries" is set to false, the list of queries is empty and nbQueries contains the number of queries.
Create a task with JSON body
task id
the task creation body
the task was already existing
Get all user's projects
default response
Get the FtM document from its project and id (content hash)
project identifier
document identifier
returns the JSON document
Retrieves the list of starred document for all projects for the current user.
default response