👷♀️ This page is currently being written by Datashare team.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Warning: this requires some technological knowledge.
You can make Datashare follow soft links : add --followSymlinks when Datashare is launched.
If you're on Mac or Windows, you must change the launch script.
If you're on Linux, you can add the option after the Datashare command.
Yes, you can remove documents from Datashare. But at the moment, it will remove all your documents. You cannot remove only some documents.
Click the pink trash icon on the bottom left of Datashare:
And then click 'Yes':
You can them re-analyze a new corpus.
For advanced users only - if you'd like to do it with the Terminal, here are the instructions:
If you're using Mac: rm -Rf ~/Library/Datashare/index
If you're using Windows: rd /s /q "%APPDATA%"\Datashare\index
If you're using Linux: rm -Rf ~/.local/share/datashare/index
You need an internet connection to install Datashare.
You also need the internet to find people, organizations and locations in documents the first time you use any new NLP option because the models which find these named entities are downloaded when you ask for finding named entities the first time. Subsequently, you don't need an internet connection to find named entities after.
You don't need internet connection:
to add documents to Datashare
to find named entities (except for the first time you use an NLP options - see above)
to search and explore documents
to download documents
This allows you to work safely on your documents. No third-party should be able to intercept your data and files while you're working offline on your computer.
You can download a document by going to it on Datashare. Click on the download icon to the right of the screen on on the right of the document's title.
If you can't download a document, it means that Datashare has been badly initialized. Please restart Datashare. If you're an advanced user, you can capture the logs and create an issue on Datashare's Github.
Datashare was created with scalability in mind which gave ICIJ the ability to index terabytes of documents.
To do so, we used a cluster of dozens of EC2 instances on AWS, running on Ubuntu 16.04 and 18.04. We used c4.8xlarge instances (36 CPUs / 60 GB RAM).
The most complex operation is OCR (we use Apache Tesseract) so if your documents don't contain many images, it might be worth deactivating it ("--ocr false").
You can use Datashare with multiple users accessing a centralized database on a server.
Warning: to put the server mode in place and to maintain it requires some technical knowledge.
Please find the documentation here.
Tarentula is a tool made for advanced users to run bulk actions in Datashare, like:
Please find all the use cases in Datashare Tarentula's GitHub documentation.
In Datashare, for technical reasons, it is not possible to open the 10,000th document.
Example: you search for "Paris", you get 15,634 results. You'd be able to see the first 9,999th results but no more. This also happens if you didn't run any search.
As it is not possible to fix this, here are some tips:
Refine your search: use filters to narrow down your results and ensure you have less than 10,000 matching documents
Change the sorting of your results: use 'creation date' or 'alphabetical order' for instance, instead of the sorting by default which corresponds to a relevance scoring
Search your query in a batch search: you will get all your results either on the batch search results' page or, by downloading your results, in a spreadsheet. From there, you will be able to open and read all your documents
If you search "Shakespeare" in the search bar and if you run a query containing "Shakespeare" in a batch search, you can get slightly different documents between the two results.
Why?
For technical reasons, Datashare processes both queries in 2 different ways:
a. Search bar (a simple search processed in the browser):
The search query sent to Elasticsearch is processed in your browser by Datashare's client. It is then sent to Elasticsearch through Datashare server which forwards your query.
b. Batch search (several searches processed by the server):
Datashare's server processes each of the batch search's queries
Each query is sent to Elasticsearch. The results are saved into a database
When the batch search is finished, you get the results from Datashare
Datashare sends back the results stored into the database/
Datashare's team attempts to make both results be similar, but slight differences can happen between the two queries.
You can send an email to datashare@icij.org.
When reporting a bug, please share:
your OS (Mac, Windows or Linux) and version
the problem, with screenshots if possible
the actions that led to the problem
Advanced users can post an issue with their logs on Datashare's GitHub : https://github.com/ICIJ/datashare/issues
1. Go to Applications
2. Click right on 'Datashare' and click 'Move to Bin'
Follow the steps here: https://support.microsoft.com/en-us/windows/uninstall-or-remove-apps-and-programs-in-windows-10-4b55f974-2cc6-2d2b-d092-5905080eaf98
Use the following command:
sudo apt remove datashare-dist
This page explains how to run a neo4j instance inside docker. For any additional information please refer to the [neo4j documentation](https://neo4j.com/docs/getting-started/)
1. enrich the services
section of the docker-compose.yml
of the install with Docker page, with the following neo4j service:
make sure not to forget the APOC plugin (NEO4J_PLUGINS: '["apoc"]'
).
2. enrich the volumes
section of the docker-compose.yml
of the install with Docker page, with the following neo4j volumes:
3. Start the neo4j
service using:
install with Neo4j Desktop, follow installation instructions found here
create a new local DBMS and save your password for later
if the installer notifies you of any ports modification, check the DBMS settings and save the server.bolt.listen_address
for later
make sure to install the APOC Plugin
Additional options to install neo4j are listed here.