This document assumes you have installed Datashare in server mode within Docker and already added documents to Datasharte.
In server mode, it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differentiate user to offer admin additional tools.
This is likely to be changed in the near future, but in the meantime, you can extract named entities using the command-line interface.
Datashare as the ability to detect email addresses, name of people, organizations and locations. This process use a Natural Language Processing pipeline called CORENLP. Once your documents have been indexed in Datashare, you can perform the named entities extraction in the same fashion as the previous CLI's stages:
What's happening here:
Datashare starts in "CLI" mode
We ask to process the NLP stage
We tell Datashare to use the elasticsearch
service
Datashare will pull documents from ElasticSearch directly
Up to 2 documents will be analyzed in parallel
Datashare will use the CORENLP pipeline
Datashare will use the output queue from the previous INDEX
stage (by default extract:queue:nlp
in Redis) that contains all the document ids to be analyzed.
The first time you run this command you will have to wait a little bit because Datashare need to download CORENLP's models which can be big.
You can also use chain the 3 stages altogether:
As for the previous stages you may want to restore the output queue from the INDEX
stage. You can do:
The added ENQUEUEIDX
stage will read Elasticsearch index, find all documents that have not already been analyzed by the CORENLP NER pipeline, and put the ids of those documents into the extract:queue:nlp
queue.