Datashare
DownloadAbout ICIJGitHub
  • About Datashare
  • Ask for help
  • Concepts
    • Running modes
    • CLI stages
  • About ICIJ
  • Github
  • 💻On your computer
    • About the local mode
    • Install on Mac
      • Start Datashare
      • Add documents to Datashare
    • Install on Windows
      • Start Datashare
      • Add documents to Datashare
    • Install on Linux
      • Start Datashare
      • Add documents to Datashare
    • Install with Docker
    • Add documents
    • Add more languages
    • Install plugins and extensions
    • Neo4j
      • Install Neo4j plugin
      • Create and update Neo4j graph
  • 🌐On your server
    • About the server mode
    • Install with Docker
    • Add documents from the CLI
    • Add entities from the CLI
    • Authentication providers
      • OAuth2
      • Basic with a database
      • Basic with Redis
      • Dummy
    • Neo4j
      • Install Neo4j plugin
      • Create and update Neo4j graph
    • Performance considerations
  • ⚡Usage
    • Search documents
    • Search documents in batch
    • Search with operators / Regex
    • Filter documents
    • Sort documents
    • Explore a document
    • Star documents
    • Tag documents
    • Recommend documents
    • Keyboard shortcuts
    • Create a Neo4j graph and explore it
    • FAQ
      • General
        • Can I use Datashare with no internet connection?
        • Can I download a document from Datashare?
        • Can I remove document(s) from Datashare?
        • Do you recommend OS or machines for large corpuses?
        • Can I use an external drive as data source?
        • How can we use Datashare on a collaborative mode on a server?
        • How can I contact ICIJ for help, bug reporting or suggestions?
        • Why results from a simple search and a batch search can be slightly different?
        • How can I uninstall Datashare?
        • Advanced: how can I do bulk actions with Tarentula?
        • What should I do if I get more than 10,000 results?
        • How to run Neo4j?
      • Definitions
        • What is a named entity?
        • What are NLP pipelines?
        • What is fuzziness?
        • What are proximity searches?
      • Common errors
        • 'Your search query is wrong.' What should I do?
        • Searching with double quotes doesn't work
        • List of common errors leading to "failure" in Batch Searches
        • What if Datashare says 'No documents found'?
        • Nothing works, everything crashes. What can I do?
        • What if tasks are 'running' but not completing?
        • 'You are not allowed to use Docker, you must be in the "docker-users" group'. What should I do?
        • What if a 'Preview' of my documents is 'not available'?
        • What do I do if Datashare opens a blank screen in my browser?
        • I see people, organizations and locations in the filters but not in the documents
        • What does 'Windows named pipe error' mean?
        • Datashare doesn't open. What should I do?
        • I upgraded to version 9 of Datashare and it fails.
  • 🤓Developers
    • How to contribute
    • Backend
      • API
      • API (deprecated)
      • Database
    • Frontend
      • JSDoc
      • Plugin hooks
      • Insight widgets
      • Vue app
        • Components
          • Api
          • AppliedSearchFilters
          • AppliedSearchFiltersItem
          • AppNav
          • AppSidebar
          • BatchDownloadActions
          • BatchSearchActions
          • BatchSearchClearFilters
          • BatchSearchCopyForm
          • BatchSearchFilterDate
          • BatchSearchFilterQuery
          • BatchSearchForm
          • BatchSearchResultsDetails
          • BatchSearchResultsFilters
          • BatchSearchResultsTable
          • BatchSearchStatus
          • BatchSearchTable
          • ColumnChartPicker
          • ColumnFilter
          • ColumnFilterBadge
          • ColumnFilterDropdown
          • ContentTypeBadge
          • Document
            • DocumentNavbar
            • DocumentNotes
            • DocumentTabDetails
            • DocumentTabExtractedText
            • DocumentTabNamedEntities
            • DocumentTabPreview
            • Viewers
              • AudioViewer
              • ImageViewer
              • JsonViewer
              • LegacySpreadsheetViewer
              • PaginatedViewer
              • SpreadsheetViewer
              • TiffViewer
              • VideoViewer
          • DocumentActions
          • DocumentAttachments
          • DocumentContent
          • DocumentContentSlice
          • DocumentContentSlicePlaceholder
          • DocumentContentSlices
          • DocumentGlobalSearchTermsTags
          • DocumentInModal
          • DocumentLocalSearchInput
          • DocumentSlicedName
          • DocumentTagsForm
          • DocumentThread
          • DocumentThumbnail
          • DocumentTranslatedContent
          • DocumentTypeCard
          • EllipseStatus
          • EmailString
          • Extensions
          • ExtractingForm
          • ExtractingFormOcrControl
          • ExtractingLanguageFormControl
          • Filter
            • FilterBoilerplate
            • FilterFooter
            • FilterSearch
            • FilterSortByDropdown
            • Types
              • FilterAbstract
              • FilterDate
              • FilterDateRange
              • FilterNamedEntity
              • FilterPath
              • FilterProject
              • FilterRecommendedBy
              • FilterStarred
              • FilterText
          • FiltersPanel
          • FindNamedEntitiesForm
          • Hook
          • InlineDirectoryPicker
          • JsonFormatter
          • LocalesMenu
          • MountedDataLocation
          • NamedEntityInContext
          • PageHeader
          • PageIcon
          • Pagination
          • Plugins
          • ProjectCards
          • ProjectForm
          • ProjectLink
          • ProjectSelector
          • ProjectThumbnail
          • QuickItemNav
          • ResetFiltersButton
          • RouterLinkPopup
          • ScrollTracker
          • SearchBar
          • SearchBarInput
          • SearchBarInputDropdown
          • SearchBarInputDropdownForField
          • SearchBarInputDropdownForProjects
          • SearchDocumentNavbar
          • SearchFormControl
          • SearchLayoutSelector
          • SearchResults
          • SearchResultsGrid
          • SearchResultsHeader
          • SearchResultsList
          • SearchResultsListLink
          • SearchResultsTable
          • ServerSettings
          • ShortkeysModal
          • TaskItemStatus
          • TasksList
          • TreeBreadcrumb
          • TreeView
          • UserDisplay
          • UserHistorySaveSearchForm
          • VersionNumber
          • Widget
            • WidgetDiskUsage
            • WidgetDocumentsByCreationDate
            • WidgetDocumentsByCreationDateByPath
            • WidgetDuplicates
            • WidgetEmpty
            • WidgetEntities
            • WidgetFieldFacets
            • WidgetFileBarometer
            • WidgetListGroup
            • WidgetNames
            • WidgetNested
            • WidgetProject
            • WidgetRecommendedBy
            • WidgetSearchBar
            • WidgetText
            • WidgetTreeMap
        • Pages
          • App
          • DocumentModal
          • DocumentStandalone
          • DocumentView
          • Error
          • Landing
          • Login
          • Project
          • ProjectList
          • ProjectNew
          • ProjectView
          • ProjectViewAddDocuments
          • ProjectViewEdit
          • ProjectViewFindNamedEntities
          • ProjectViewInsights
          • Search
          • Settings
          • TaskAnalysis
          • TaskAnalysisList
          • TaskBatchDownload
          • TaskBatchDownloadList
          • TaskBatchSearch
          • TaskBatchSearchList
          • TaskBatchSearchNew
          • TaskBatchSearchView
          • TaskBatchSearchViewResults
          • Tasks
          • UserHistory
          • UserHistoryDocumentList
          • UserHistorySavedSearchList
    • Introduction to Tarentula
    • Index operations with Playground
    • Write extensions
    • Write plugins
Powered by GitBook

Datashare is an open source project by the International Consortium of Investigative Journalists

On this page
Export as PDF
  1. On your server

Add documents from the CLI

Last updated 1 year ago

This document assumes you have installed Datashare .

In server , it's important to understand that Datashare does not provide an interface to add documents. As there is no build-in roles and permission in Datashare's data model, we have no way to differenciate user to offer admin additional tools.

This is likelly to be changed in the near future, but in the meantime, you can still add documents to Datashare using the command-line interface.

Here is a simple command to scan a directory and index its files:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

What's happening here:

  • Datashare starts in "CLI"

  • We ask to process both SCAN and INDEX at the same time

  • The SCAN stage feeds a queue in memory with file to add

  • The INDEX stage pulls files from the queue to add them to ElasticSearch

  • We tell Datashare to use the elasticsearch service

  • Files to add are located in /home/datashare/Datashare/ which is a directory mounted from the host machine

Alternativly, you can do this in two separated phases, as long as you tell Datashare to store the queue in a shared resource. Here, we use the redis:

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Once the opperation is done, we can easily check the content of queue created by Datashare in redis. In this example we only display the 20 first files in the datashare:queue:

docker compose exec redis redis-cli lrange datashare:queue 0 20
docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage INDEX \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

Once the indexing is done, Datashare will exit gracefully and your document will already be visible on Datashare.

Sometimes you will face the case where you have an existing index, and you want to index additional documents inside your working directory without processing every document again. It can be done in two steps :

  • Scan the existing ElasticSearch index and gather document paths to store it inside a report queue

  • Scan and index (with OCR) the documents in the directory, thanks to the previous report queue, it will skip the paths inside of it

docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCANIDX \
  --queueType REDIS \
  --reportName "report:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/
docker compose exec datashare_web /entrypoint.sh \
  --mode CLI \
  --stage SCAN,INDEX \
  --ocr true \
  --queueType REDIS \
  --queueName "datashare:queue" \
  --reportName "report:queue" \
  --redisAddress redis://redis:6379 \
  --defaultProject secret-project \
  --elasticsearchAddress http://elasticsearch:9200 \
  --dataDir /home/datashare/Datashare/

The INDEX can now be executed in the same container:

🌐
in server mode within Docker
mode
mode
stages
stage