Add entities from the CLI

This document assumes that you have installed Datashare in server mode within Docker and already added documents to Datashare.

In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate users to offer admin additional tools.

This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.

Datashare has the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing (NLP) pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage NLP \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --nlpParallelism 2 \
  --nlpp CORENLP

What's happening here:

Datashare starts in "CLI" mode
We ask to process the NLP stage
We tell Datashare to use the elasticsearch service
Datashare will pull documents from ElasticSearch directly
Up to 2 documents will be analyzed in parallel
Datashare will use the CORENLP pipeline

Datashare will use the output queue from the previous INDEX stage (by default extract:queue:nlp in Redis) that contains all the document ids to be analyzed.

The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.

You can also use chain the 3 stages altogether:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX,NLP \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --nlpParallelism 2 \
  --nlpp CORENLP \
  --dataDir /home/datashare/Datashare/

As for the previous stages you may want to restore the output queue from the INDEX stage. You can do:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage ENQUEUEIDX,NLP \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --nlpParallelism 2 \
  --nlpp CORENLP

The added ENQUEUEIDX stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the IDs of those documents into the extract:queue:nlp queue.

PreviousAdd documents from the CLI NextAuthentication providers

Last updated 2 months ago