👷♀️ This page is currently being written by Datashare team.
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
You need an internet connection to install Datashare.
You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.
You don't need internet connection:
to add documents to Datashare
to find named entities (except for the first time you use an NLP options - see above)
to search and explore documents
to download documents
This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.
Warning: this requires some technological knowledge.
You can make Datashare follow soft links : add --followSymlinks
when Datashare is launched.
If you're on Mac or Windows, you must change the launch script.
If you're on Linux, you can add the option after the Datashare command.
You can send an email to datashare@icij.org.
When reporting a bug, please share:
your OS (Mac, Windows or Linux) and version
the problem, with screenshots if possible
the actions that led to the problem
Advanced users can post an issue with their logs on Datashare's GitHub : https://github.com/ICIJ/datashare/issues
You can use Datashare with multiple users accessing a centralized database on a server.
Warning: to put the server mode in place and to maintain it requires some technical knowledge.
Please find the documentation here.
Yes, you can remove documents from Datashare. But at the moment, it will remove all your documents. You cannot remove only some documents.
Click the pink trash icon on the bottom left of Datashare:
And then click 'Yes':
You can them re-analyze a new corpus.
For advanced users only - if you'd like to do it with the Terminal, here are the instructions:
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.
Why?
For technical reasons, Datashare processes both queries in 2 different ways:
a. Search bar (a simple search processed in the browser):
The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.
b. Batch search (several searches processed by the server):
Datashare's server processes each of the batch search's queries
Each query is sent to Elasticsearch. The results are saved into a database
When the batch search is finished, you get the results from Datashare
Datashare sends back the results stored into the database/
Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
The most complex operation is OCR (we use ) so if your documents don't contain many images, it might be worth deactivating it (--ocr false
).
Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:
Please find all the use cases in Datashare Tarentula's .
A named entity in Datashare is the name of an individual, an organization or a location.
Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically highlight named entities in your documents.
1. Go to Applications
2. Click right on 'Datashare' and click 'Move to Bin'
Follow the steps here: https://support.microsoft.com/en-us/windows/uninstall-or-remove-apps-and-programs-in-windows-10-4b55f974-2cc6-2d2b-d092-5905080eaf98
Use the following command:
sudo apt remove datashare-dist
This can be due to some syntax error(s) in the way you wrote your query.
Here are the most common errors that you should correct:
You cannot start a query with AND all uppercase. AND is reserved as a search operator.
You cannot start a query with OR all uppercase. OR is reserved as a search operator.
You cannot start or type a query with only one double quote. Double quotes are reserved as a search operator for exact phrase.
You cannot start or type a query with only one parenthesis. Parenthesis are reserved for combining operators.
You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).
You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for fuzziness or proximity searches.
You cannot end a query with question mark (!). Question mark is reserved as a search operator for excluding a term.
You cannot start a query with caret (^) or write one which contains caret. Caret is reserved as a boosting operator.
You cannot use square brackets except for searching for ranges.
This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)
1. enrich the services
section of the docker-compose.yml
of the install with Docker page, with the following neo4j service:
make sure not to forget the APOC plugin (NEO4J_PLUGINS: '["apoc"]'
).
2. enrich the volumes
section of the docker-compose.yml
of the install with Docker page, with the following neo4j volumes:
3. Start the neo4j
service using:
install with Neo4j Desktop, follow installation instructions found here
create a new local DBMS and save your password for later
if the installer notifies you of any ports modification, check the DBMS settings and save the server.bolt.listen_address
for later
make sure to install the APOC Plugin
Additional options to install neo4j are listed here.
Double quotes need to be straight in Datashare's search bar, not curly.
Straight double quotes: "example"
Curly double quotes: “example” (these are tilted)
This search works because double quotes are straight in the search bar:
This search doesn't work because double quotes are curly in the search bar:
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators: .
You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. .
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your queries as some errors can remain after correcting the first one.
Example of a syntax error message:
SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: ****here are the instructions for , or .
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
If you are using Datashare version with Docker (not the standard version) and if Datashare crashes, please try to restart Docker Desktop.
On Mac:
Click the Docker Desktop icon on the top menu bar. The following drop-down menu appears:
Click 'Restart'.
As long as the icon's little points move, it means that Docker Desktop is still restarting.
On Windows:
Right-click the Docker Desktop icon (a little whale) on the bottom menu bar.
Click 'Restart'.
Click 'Restart' again.
Wait for Docker Desktop to restart.
On Linux, please execute: sudo service docker restart
In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you run a , you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
You can download a document by going to it on Datashare. Click on the download icon to the right of the screen on on the right of the document's title.
If you can't download a document, it means that Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on .
In Datashare, for technical reasons, it is not possible to open the 10,000th document.
Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.
As it is not possible to fix this, here are some tips:
: use filters to narrow down your results and ensure you have less than 10,000 matching documents
Change the : use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring
Search your query in a : you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents
In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
Examples:
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: ).
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox"
would be considered more relevant than "quick brown fox"
(source: ).
When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Pipelines of Natural Language Processing are tools that automatically identify named entities in your documents. You can only choose one at a time.
Select 'CoreNLP' if you want to use the one with the highest probability of working in most of your documents:
If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of loosing your previously selected filters, click 'Reset filters'.
You may not have added documents to Datashare yet. To add documents, see: 'Add documents to Datashare' for , or .
In 'Analyzed documents', if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you analyzed, it can take multiple hours.
Once these points stopped moving, either Datashare restarted automatically or you can restart Datashare manually (see '').
When it says 'Docker Desktop is running', either Datashare restarted automatically or you can restart Datashare manually (see '').
Datashare can display 'Preview' for some document types only: images, pdf, csv, xlsx and tiff. Other document types are not supported yet.
You started tasks, and they are running as you can see on 'http://localhost:8080/#/indexing' but they are not completing.
There are two possible causes:
If you see a progress of less than 100%, please wait.
If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on Datashare Github with the application logs.
Datashare's filters keep the named entities (people, organizations and locations) previously recognized.
"Old" named entities stay in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later. It means that you removed the documents which contained the named entities after extracting them, you run new analysis, but the named entities stayed in the filters:
In the future, removing the documents from Datashare before indexing new ones will remove the named entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can click the little pink trash icon on the bottom of the left column:
It means that you are on Windows.
Search and open 'Computer management':
Go to 'Local users and groups':
In 'Groups', double-click 'docker-users':
If you are not in 'docker-users', go to 'Users' on the left filter and add you in the 'docker-users' group by clicking on you and 'Add...':
To fix the issue:
Stop Datashare. If Datashare is running, close the Terminal window (the window that opens when you start Datashare):
Click 'Terminate':
Open your Terminal (or a new window in your Terminal) and copy and paste:
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
Press Enter
Index documents again: go to 'Analyse your documents' and click 'Extract text':
It can be due to extensions priorly installed. The tech team is . In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:
On Mac
On Linux
On Windows
Press Enter. Open Datashare again.
If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:
First wait 30 seconds and reload the page.
If the screen remains blank, restart Datashare following instructions for , or .
If you still see a blank screen, please uninstall and reinstall Datashare
To uninstall Datashare:
On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.
On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.
Restart Datashare (here are the instructions , and )
On Windows, please follow .
To reinstall Datashare, see 'Install Datashare' for , or .
If you use Datashare with Docker (not the standard version), if a dark window called the Terminal displays a phrase beginning with "Windows named pipe error: The system cannot find the file specified" appears, it means that Docker Desktop, one of the 3 components of Datashare, is not working. Relaunching Docker Desktop should solve the problem.
Find Docker Desktop in your Applications or the whale icon on the menu bar of your computer and click 'Restart'.