Datashare
DownloadAbout ICIJGitHub
  • About Datashare
  • Ask for help
  • Concepts
    • Running modes
    • CLI stages
  • About ICIJ
  • Github
  • 💻On your computer
    • About the local mode
    • Install on Mac
      • Start Datashare
      • Add documents to Datashare
    • Install on Windows
      • Start Datashare
      • Add documents to Datashare
    • Install on Linux
      • Start Datashare
      • Add documents to Datashare
    • Install with Docker
    • Add documents
    • Add more languages
    • Install plugins and extensions
    • Neo4j
      • Install Neo4j plugin
      • Create and update Neo4j graph
  • 🌐On your server
    • About the server mode
    • Install with Docker
    • Add documents from the CLI
    • Add entities from the CLI
    • Authentication providers
      • OAuth2
      • Basic with a database
      • Basic with Redis
      • Dummy
    • Neo4j
      • Install Neo4j plugin
      • Create and update Neo4j graph
    • Performance considerations
  • ⚡Usage
    • Search documents
    • Search documents in batch
    • Search with operators / Regex
    • Filter documents
    • Sort documents
    • Explore a document
    • Star documents
    • Tag documents
    • Recommend documents
    • Keyboard shortcuts
    • Create a Neo4j graph and explore it
    • FAQ
      • General
        • Can I use Datashare with no internet connection?
        • Can I download a document from Datashare?
        • Can I remove document(s) from Datashare?
        • Do you recommend OS or machines for large corpuses?
        • Can I use an external drive as data source?
        • How can we use Datashare on a collaborative mode on a server?
        • How can I contact ICIJ for help, bug reporting or suggestions?
        • Why results from a simple search and a batch search can be slightly different?
        • How can I uninstall Datashare?
        • Advanced: how can I do bulk actions with Tarentula?
        • What should I do if I get more than 10,000 results?
        • How to run Neo4j?
      • Definitions
        • What is a named entity?
        • What are NLP pipelines?
        • What is fuzziness?
        • What are proximity searches?
      • Common errors
        • 'Your search query is wrong.' What should I do?
        • Searching with double quotes doesn't work
        • List of common errors leading to "failure" in Batch Searches
        • What if Datashare says 'No documents found'?
        • Nothing works, everything crashes. What can I do?
        • What if tasks are 'running' but not completing?
        • 'You are not allowed to use Docker, you must be in the "docker-users" group'. What should I do?
        • What if a 'Preview' of my documents is 'not available'?
        • What do I do if Datashare opens a blank screen in my browser?
        • I see people, organizations and locations in the filters but not in the documents
        • What does 'Windows named pipe error' mean?
        • Datashare doesn't open. What should I do?
        • I upgraded to version 9 of Datashare and it fails.
  • 🤓Developers
    • How to contribute
    • Backend
      • API
      • API (deprecated)
      • Database
    • Frontend
      • JSDoc
      • Plugin hooks
      • Insight widgets
      • Vue app
        • Components
          • Api
          • AppliedSearchFilters
          • AppliedSearchFiltersItem
          • AppNav
          • AppSidebar
          • BatchDownloadActions
          • BatchSearchActions
          • BatchSearchClearFilters
          • BatchSearchCopyForm
          • BatchSearchFilterDate
          • BatchSearchFilterQuery
          • BatchSearchForm
          • BatchSearchResultsDetails
          • BatchSearchResultsFilters
          • BatchSearchResultsTable
          • BatchSearchStatus
          • BatchSearchTable
          • ColumnChartPicker
          • ColumnFilter
          • ColumnFilterBadge
          • ColumnFilterDropdown
          • ContentTypeBadge
          • Document
            • DocumentNavbar
            • DocumentNotes
            • DocumentTabDetails
            • DocumentTabExtractedText
            • DocumentTabNamedEntities
            • DocumentTabPreview
            • Viewers
              • AudioViewer
              • ImageViewer
              • JsonViewer
              • LegacySpreadsheetViewer
              • PaginatedViewer
              • SpreadsheetViewer
              • TiffViewer
              • VideoViewer
          • DocumentActions
          • DocumentAttachments
          • DocumentContent
          • DocumentContentSlice
          • DocumentContentSlicePlaceholder
          • DocumentContentSlices
          • DocumentGlobalSearchTermsTags
          • DocumentInModal
          • DocumentLocalSearchInput
          • DocumentSlicedName
          • DocumentTagsForm
          • DocumentThread
          • DocumentThumbnail
          • DocumentTranslatedContent
          • DocumentTypeCard
          • EllipseStatus
          • EmailString
          • Extensions
          • ExtractingForm
          • ExtractingFormOcrControl
          • ExtractingLanguageFormControl
          • Filter
            • FilterBoilerplate
            • FilterFooter
            • FilterSearch
            • FilterSortByDropdown
            • Types
              • FilterAbstract
              • FilterDate
              • FilterDateRange
              • FilterNamedEntity
              • FilterPath
              • FilterProject
              • FilterRecommendedBy
              • FilterStarred
              • FilterText
          • FiltersPanel
          • FindNamedEntitiesForm
          • Hook
          • InlineDirectoryPicker
          • JsonFormatter
          • LocalesMenu
          • MountedDataLocation
          • NamedEntityInContext
          • PageHeader
          • PageIcon
          • Pagination
          • Plugins
          • ProjectCards
          • ProjectForm
          • ProjectLink
          • ProjectSelector
          • ProjectThumbnail
          • QuickItemNav
          • ResetFiltersButton
          • RouterLinkPopup
          • ScrollTracker
          • SearchBar
          • SearchBarInput
          • SearchBarInputDropdown
          • SearchBarInputDropdownForField
          • SearchBarInputDropdownForProjects
          • SearchDocumentNavbar
          • SearchFormControl
          • SearchLayoutSelector
          • SearchResults
          • SearchResultsGrid
          • SearchResultsHeader
          • SearchResultsList
          • SearchResultsListLink
          • SearchResultsTable
          • ServerSettings
          • ShortkeysModal
          • TaskItemStatus
          • TasksList
          • TreeBreadcrumb
          • TreeView
          • UserDisplay
          • UserHistorySaveSearchForm
          • VersionNumber
          • Widget
            • WidgetDiskUsage
            • WidgetDocumentsByCreationDate
            • WidgetDocumentsByCreationDateByPath
            • WidgetDuplicates
            • WidgetEmpty
            • WidgetEntities
            • WidgetFieldFacets
            • WidgetFileBarometer
            • WidgetListGroup
            • WidgetNames
            • WidgetNested
            • WidgetProject
            • WidgetRecommendedBy
            • WidgetSearchBar
            • WidgetText
            • WidgetTreeMap
        • Pages
          • App
          • DocumentModal
          • DocumentStandalone
          • DocumentView
          • Error
          • Landing
          • Login
          • Project
          • ProjectList
          • ProjectNew
          • ProjectView
          • ProjectViewAddDocuments
          • ProjectViewEdit
          • ProjectViewFindNamedEntities
          • ProjectViewInsights
          • Search
          • Settings
          • TaskAnalysis
          • TaskAnalysisList
          • TaskBatchDownload
          • TaskBatchDownloadList
          • TaskBatchSearch
          • TaskBatchSearchList
          • TaskBatchSearchNew
          • TaskBatchSearchView
          • TaskBatchSearchViewResults
          • Tasks
          • UserHistory
          • UserHistoryDocumentList
          • UserHistorySavedSearchList
    • Introduction to Tarentula
    • Index operations with Playground
    • Write extensions
    • Write plugins
Powered by GitBook

Datashare is an open source project by the International Consortium of Investigative Journalists

On this page
  • 1. SCAN
  • 2. INDEX
  • 3. NLP
Export as PDF
  1. Concepts

CLI stages

When running Datashare from the command-line, you can pick which "stage" to apply to analyse your documents.

Last updated 1 year ago

The CLI stages are primarly intented to be run for an instance of Datashare that uses non-embedded resources (ElasticSearch, database, key/value memory store). This allows you to distribute heaving tasks between servers.

1. SCAN

This is the first step to add documents to Datashare from the command-line. The SCAN stage allows you to queue all the files that need to be indexed (next step). Once this task is done, you can move to the next step. This stage cannot be distributed.

datashare --mode CLI \  
  # Select the SCAN stage
  --stage SCAN \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Redis 
  --redisAddress redis://redis:6379

2. INDEX

The INDEX stage is probably the most important (and heavy!) one. It pulls documents to index from the queue create in the previous step, then use a combination of and to extract text, metadata and OCR images. The result documents are stored in ElasticSearch. The queue used to store documents to index is a "blocking list", meaning that only one client can pull a concurent value at the time. This allows users to distribute this command on serveral servers.

datashare --mode CLI \
  # Select the INDEX stage
  --stage INDEX \
  # Where the document are located
  --dataDir /path/to/documents \
  # Store the queued files in Redis
  --dataBusType REDIS \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 \
  # Enable OCR \
  --ocr true
  # URI of Redis 
  --redisAddress redis://redis:6379

3. NLP

Once a document is available for search (stored in ElasticSearch), you can use the NLP stage to extract named entities from the text. This process will not only create named entities mentions in ElasticSearch, it will mark every analyzed document with the corresponding NLP pipeline (CORENLP by default). In other words, the process is idempotent and can be paralelized as well on several servers.

datashare --mode CLI \
  # Select the NLP stage
  --stage NLP \
  # Use CORENLP to detect named entities
  --nlpp CORENLP \
  # URI of Elasticsearch
  --elasticsearchAddress http://elasticsearch:9200 
Apache Tika
Tesseract