> For the complete documentation index, see [llms.txt](https://icij.gitbook.io/datashare/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://icij.gitbook.io/datashare/concepts/cli-stages.md).

# CLI stages

The CLI stages are primarily intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). See [Embedded mode](/datashare/local-mode/about-the-local-mode/embedded-mode.md) to learn more about the default configuration. This allows you to distribute heavy tasks between servers.

## 1. SCAN

This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.

```bash
datashare stage run \
  # Select the SCAN stage
  --stages SCAN \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --busType REDIS \
  # URI of Redis
  --redisAddress redis://redis:6379
```

## 2. INDEX

The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue created in the previous step, then use a combination of [Apache Tika](https://tika.apache.org) and [Tesseract](https://tesseract-ocr.github.io/) to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurrent value at the time. This allows users to distribute this command on several servers.

```bash
datashare stage run \
  # Select the INDEX stage
  --stages INDEX \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --busType REDIS \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # Enable OCR \
  --ocr true
  # URI of Redis 
  --redisAddress redis://redis:6379
```

## 3. NLP

Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entity mentions in ElasticSearch, it will also mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be parallelized as well on several servers.

```bash
datashare stage run \
  # Select the NLP stage
  --stages NLP \
  # Use CORENLP to detect named entities
  --nlpPipeline CORENLP \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://icij.gitbook.io/datashare/concepts/cli-stages.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
