arrow-left

All pages
gitbookPowered by GitBook
1 of 28

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Can I use Datashare with no internet connection?

You need an internet connection to install Datashare.

You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.

You don't need internet connection to:

  • Add documents to Datashare

  • Find named entities (except for the first time you use an NLP options - see above)

  • Search and explore documents

  • Download documents

This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.

Can I use an external drive as data source?

Warning: this requires some technological knowledge.

You can make Datashare follow soft links : add --followSymlinks when Datashare is launched.

If you're on Mac or Windows, you must change the launch script.

If you're on Linux, you can add the option after the Datashare command.

Advanced: how can I do bulk actions with Tarentula?

Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:

  • Clean Tags by Queryarrow-up-right

  • Downloadarrow-up-right

Please find all the use cases in Datashare Tarentula's .

Export by Queryarrow-up-right
Taggingarrow-up-right
CSV formatsarrow-up-right
Tagging by Queryarrow-up-right
GitHub documentationarrow-up-right

Definitions

👷‍♀️ This page is currently being written by Datashare team.

What is an entity?

An entity in Datashare is the name of people, organizations or locations or an email address.

Datashare’s Named Entity Recognition (NER) uses pipelines of Natural Language Processing (NLP), a branch of artificial intelligence, to automatically detect entities in your documents.

You can filter documents by their entities and see all the entities mentioned in a document.

What are NLP pipelines?

Pipelines of Natural Language Processing are tools that automatically identify entities in your documents. You can only choose one model at a time for one entity detection task.

Open the menu > 'Tasks' > 'Entities' and . Select 'CoreNLP' if you want to use the model with the highest probability of working in most of documents.

follow these instructions

What is fuzziness?

hashtag
As a search operator

In the main search bar, you can write a query with the search operator tilde (~) with a number, at the end of each word of your query. You can set fuzziness to 1 or 2. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

If you search for similar terms (to catch typos for example), use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.

"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).

Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

hashtag
In batch searches

When you run a , you can set the fuzziness to 0, 1 or 2. It is the same as explained above, it will apply to each word in a query and corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

If you search for similar terms (to catch typos for example), use fuzziness. Use the at the end of the word to set the fuzziness to 1 or 2.

"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: ).

Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

tilde symbolarrow-up-right
Elasticarrow-up-right
batch search
tilde symbolarrow-up-right
Elasticarrow-up-right

What are proximity searches?

hashtag
As a search operator

In the main search bar, you can write an exact query in double quotes with the search operator tilde (~) with a number, at the end of your query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

Examples:

the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)

the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)

"While a phrase query (eg "john smith") expects all of the terms in exactly the same order, a proximity query allows the specified words to be further apart or in a different order. A proximity search allows us to specify a maximum edit distance of words in a phrase." (source: ).

Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

The closer the text in a field is to the original order specified in the query string, the more relevant that document is considered to be. When compared to the above example query, the phrase "quick fox" would be considered more relevant than quick brown fox(source: ).

hashtag
In batch searches

When you run a , if you turn 'Do phrase matches' on, you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

the cat is blue -> the small cat is blue (1 insertion = fuzziness is 1)

the cat is blue -> the small is cat blue (1 insertion + 2 transpositions = fuzziness is 3)

Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

Elasticarrow-up-right
Elasticarrow-up-right
batch search

List of common errors leading to "failure" in Batch Searches

hashtag
SearchException: query='AND ada'

One or several of your queries contains syntax errors.

It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators: read the list of syntax errors by clicking here.

You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. Check how to create a batch search.

Datashare stops at the first syntax error. It reports only the first ​error. You might need to check all your quferies as some errors can remain after correcting the first one.

Example of a syntax error message:

SearchException: query='AND ada' message='org.icij.datashare.batch.SearchException: org.elasticsearch.client.ResponseException: method [POST], host [http://elasticsearch:9200], URI [/local-datashare/doc/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&scroll=60000ms&search_type=query_then_fetch&batched_reduce_size=512], status line [HTTP/1.1 400 Bad Request] {"error":{"root_cause":[{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"local-datashare","node":"_jPzt7JtSm6IgUqrtxNsjw","reason":{"type":"query_shard_exception","reason":"Failed to parse query [AND ada]","index_uuid":"pDkhK33BQGOEL59-4cw0KA","index":"local-datashare","caused_by":{"type":"parse_exception","reason":"Cannot parse 'AND ada': Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n ","caused_by":{"type":"parse_exception","reason":"Encountered " <AND> "AND "" at line 1, column 0.\nWas expecting one of:\n <NOT> ...\n "+" ...\n "-" ...\n <BAREOPER> ...\n "(" ...\n "*" ...\n <QUOTED> ...\n <TERM> ...\n <PREFIXTERM> ...\n <WILDTERM> ...\n <REGEXPTERM> ...\n "[" ...\n "{" ...\n <NUMBER> ...\n <TERM> ...\n "}}}}]},"status":400}'

hashtag
elasticsearch: Name does not resolve

If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.

In that case, you need to re-start Datashare: check how for , or .

Example of a message regarding a problem with ElasticSearch:

SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'

Mac
Windows
Linux

What if tasks are 'running' but not completing?

You started tasks, and they are running as you can see on 'http://localhost:8080/#/indexingarrow-up-right' but they are not completing.

There are two possible causes:

  • If you see a progress of less than 100%, please wait.

  • If the progress is 100%, an error has occurred, and the tasks failed to complete, which may be caused by various reasons. If you're an advanced user, you can create an issue on with the application logs.

Datashare Githubarrow-up-right

Do you recommend OS or machines for large corpuses?

Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.

To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).

The most complex operation is OCR (we use ) so if your documents don't contain many images, it might be worth deactivating it (--ocr false).

Apache Tesseractarrow-up-right

General

👷‍♀️ This page is currently being written by Datashare team.

Can I download a document from Datashare?

Yes, you can download a document from Datashare.

hashtag
Download a document

Open the menu > 'Search' > 'Documents' and click on the download icon on the right of documents' cards:

...or on the top right of an opened document:

hashtag
Batch download documents

You can also batch download all the documents that match a search. It is limited to 100.00MB.

Open the menu > 'Search' > 'Documents', make queries and apply filter. Once all the results of a specific search are relevant to you, click on the download icon on the right of results:

Find your batch downloads as zip files in the menu > 'Tasks' > 'Batch downloads':

Click on a batch download's name to download it:

hashtag
Can't download?

If you can't download a document, it means that:

  • either Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on .

  • or you are using the server collaborative mode and the admins prevented users from downloading documents

Datashare's Githubarrow-up-right
Screenshot of Datashare's search page in List view with a document open and the download icons on the top right of the document highlighted
Screenshot of Datashare's search page in List view with the download icon on the top right of the result column highlighted
Screenshot of Datashare's batch downloads page with the menu open and the Tasks' entry 'Batch downloads' highlighted
Screenshot of Datashare's batch downloads page with the name of one batch download highlighted
Screenshot of Datashare's search page in List view with the download icons in 3 document cards highlighted

How can we use Datashare on a collaborative mode on a server?

You can use Datashare with multiple users accessing a centralized database on a server.

Warning: to put the server mode in place and to maintain it requires some technical knowledge.

Please find the documentation here.

Can I remove document(s) from Datashare?

In local mode, you cannot remove a single document or a selection of documents from Datashare. But you can remove all your projects and documents from Datashare.

Open the menu and on the bottom of the menu, click the trash icon:

A confirmation window opens. The action cannot be undone. It removes all the projects and their documents from Datashare. Click 'Yes' if you are sure:

For advanced users - if you'd like to do it with the Terminal, here are the instructions:

  • If you're using Mac: rm -Rf ~/Library/Datashare/index

  • If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index

  • If you're using Linux: rm -Rf ~/.local/share/datashare/index

How can I contact ICIJ for help, bug reporting or suggestions?

You can send an email to datashare@icij.org.

When reporting a bug, please share:

  • Your OS (Mac, Windows or Linux) and version

  • The problem, with screenshots

  • The actions that led to the problem

Or you can post an issue with your logs on Datashare's GitHub:

How can I uninstall Datashare?

hashtag
Mac

1. Go to Applications

2. Click right on 'Datashare' and click 'Move to Bin'

hashtag
Windows

Follow the steps here:

hashtag
Linux

Use the following command:

sudo apt remove datashare-dist

Why results from a simple search and a batch search can be slightly different?

If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.

Why?

For technical reasons, Datashare processes both queries in 2 different ways:

a. Search bar (a simple search processed in the browser):

The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.

b. Batch search (several searches processed by the server):

  1. Datashare's server processes each of the batch search's queries

  2. Each query is sent to Elasticsearch. The results are saved into a database

  3. When the batch search is finished, you get the results from Datashare

Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.

What should I do if I get more than 10,000 results?

In Datashare, for technical reasons, it is not possible to open the 10,000th document.

Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.

As it is not possible to fix this, here are some tips:

  • Refine your search: use filters to narrow down your results and ensure you have less than 10,000 matching documents

  • Change the sorting of your results: use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring

  • Search your query in a : you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents

https://github.com/ICIJ/datashare/issuesarrow-up-right
batch search
Screenshot of Datashare's homepage with the menu and the trash icon at the bottom right of the menu highlighted
Screenshot of Datashare's homepage with a confirmation modal to delete all projects and documents highlighted
https://support.microsoft.com/en-us/windows/uninstall-or-remove-apps-and-programs-in-windows-10-4b55f974-2cc6-2d2b-d092-5905080eaf98arrow-up-right
Screenshot of a Mac's 'Applications' window with the Datashare's logo highlighted
Screenshot of a Mac's Applications window with the Datashare's logo selected and a dropdown menu with the entry 'Move to Bin' highlighted
Datashare sends back the results stored into the database/
A diagram with the title 'Query from navigator'

What if the 'View' of my documents is 'not available'?

Datashare can display 'View' for some file types only: images, PDF, CSV, xlsx and tiff. Other document types are not supported yet.

FAQ

👷‍♀️ This page is currently being written by Datashare team.

Common errors

👷‍♀️ This page is currently being written by Datashare team.

I see entities in the filters but not in the documents

Datashare's filters keep the entities (people, organizations, locations, e-mail addresses) previously found.

"Old" named entities can remain in the filter of Datashare, even though the documents that contained them were removed from your Datashare folder on your computer later.

In the future, removing the documents from Datashare before indexing new ones will remove the entities of these documents too. They won't appear in the people, organizations or locations' filters anymore. To do so, you can follow these instructions.

'We were unable to perform your search.' What should I do?

This can be due to some syntax errors in the way you wrote your query.‌

Here are the most common errors that you should correct: ‌

hashtag
The query starts with AND

You cannot start a query with AND all uppercase. AND is reserved as a search operator.

hashtag
The query starts with OR

You cannot start a query with OR all uppercase. .

hashtag
The query contains only one double-quote: "

‌You cannot start or type a query with only one double quote. for exact phrase.

hashtag
The query contains only one parenthesis: ( or )

‌You cannot start or type a query with only one parenthesis. .

hashtag
The query contains only one forward slash: /

‌You cannot start or type a query with only one forward slash. Forward slashes are reserved for regular expressions (Regex).

hashtag
The query starts with or contains tilde: ~

‌You cannot start a query with tilde (~) or write one which contains tilde. Tilde is reserved as a search operator for or .

hashtag
The query ends with question mark: !

You cannot end a query with question mark (!). .

hashtag
The query starts with or contains caret: ^

‌You cannot start a query with caret (^) or write one which contains caret. .

hashtag
The query contains square brackets: [ or ]

You cannot use square brackets .

How to run Neo4j?

This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)

hashtag
Run Neo4j inside docker

1. enrich the services section of the docker-compose.yml of the page, with the following neo4j service:

make sure not to forget the (NEO4J_PLUGINS: '["apoc"]'

OR is reserved as a search operator
Double quotes are reserved as a search operator
Parenthesis are reserved for combining operators
fuzziness
proximity searches
Question mark is reserved as a search operator for excluding a term
Caret is reserved as a boosting operator
except for searching for ranges
Screenshot of Datashare's search page with 'OR ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '"ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ikea"' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik"ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '(ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ikea)' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik(ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '/ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '~ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik~ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ikea!' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '^ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'ik^ea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with '[ikea]' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
Screenshot of Datashare's search page with 'AND ikea' in the search bar and the message 'We were unable to perform your search. This might be due to a server error or a syntax error in your query'
).

2. enrich the volumes section of the docker-compose.yml of the install with Docker page, with the following neo4j volumes:

3. Start the neo4j service using:

hashtag
Run Neo4j Desktop

  1. install with Neo4j Desktoparrow-up-right, follow installation instructions found herearrow-up-right

  2. create a new local DBMSarrow-up-right and save your password for later

  3. if the installer notifies you of any ports modification, check the DBMS settingsarrow-up-right and save the server.bolt.listen_address for later

  4. make sure to install the

hashtag
Additional options

Additional options to install neo4j are listed herearrow-up-right.

install with Docker
APOC pluginarrow-up-right

What do I do if Datashare opens a blank screen in my browser?

If Datashare opens a blank screen in your browser, it may be for various reasons. If it does:

  1. First wait 30 seconds and reload the page.

  2. If the screen remains blank, restart Datashare following instructions for Macarrow-up-right, Windowsarrow-up-right or Linuxarrow-up-right.

  3. If you still see a blank screen, please uninstall and reinstall Datashare

To uninstall Datashare:

On Mac, go to 'Applications' and drag the Datashare icon to your dock's 'Trash' or right-click on the Datashare icon and click on 'Move to Trash'.

On Windows, please follow .

On Linux, please delete the 3 containers: Datashare, Redis and Elastic Search, and the script.

To reinstall Datashare, see 'Install Datashare' for , or .

Datashare doesn't open. What should I do?

It can be due to extensions priorly installed. The tech team is fixing the issuearrow-up-right. In the meantime, you need to remove them. To do so, you can open your Terminal, copy and paste the text below:

  • On Mac

rm -rf ~/Library/datashare/plugins ~/Library/datashare/extensions
  • On Linux

rm -rf ~/.local/share/datashare/plugins ~/.local/share/datashare/extensions
  • On Windows

Press Enter. Open Datashare again.

What if Datashare says 'No documents found'?

  • If you were able to see documents during your current session, you might have active filters that prevent Datashare from displaying documents, as no document might correspond to your current search. You can check in your URL if you see active filters and if you're comfortable with the possibility of losing your previously selected filters, open the menu > 'Search' > 'Documents', open the search breadcrumb on the left of the search bar, click 'Clear filters'.

  • You may not have added documents to Datashare yet. Check how to add documents for Mac, Windows or .

  • In 'Tasks' > 'Documents', in the Progress column, if some tasks are not marked as 'Done', please wait for all tasks to be done. Depending on the number of documents you added, it can take multiple hours.

...
services:
    neo4j:
      image: neo4j:5-community
      environment:
        NEO4J_AUTH: none
        NEO4J_PLUGINS: '["apoc"]'
      ports:
        - 7474:7474
        - 7687:7687
      volumes:
        - neo4j_conf:/var/lib/neo4j/conf
        - neo4j_data:/var/lib/neo4j/data
volumes:
  ...
  neo4j_data:
    driver: local
  neo4j_conf:
    driver: local
docker compose up -d neo4j
APOC Pluginarrow-up-right
del /S %APPDATA%\Datashare\Extensions  %APPDATA%\Datashare\Plugins
these stepsarrow-up-right
Mac
Windows
Linux
Linux
Screenshot of Datashare's document search page where a text says 'No documents matched your search. Try using different filters.' and the Search breadcrumb open and the 'Clear filters' button in it highlighted
Screenshot of Mac's 'Applications' window with an arrow pointing at Datashare logo
Screenshot of Datashare's task page to add document where the header of the Progress column is highlighted