Batch search documents
Batch searches allow to get the results of each query of a list all at once: instead of searching each query one by one, upload a list, set options/filters and see the matching documents.
Prepare a CSV list
Open a spreadsheet (LibreOffice, Framacalc, Excel, Google Sheets, Numbers, ...)
Write your queries in the first column of the spreadsheet, typing one query per line:

Do not put line break(s) in any of your cells.


To delete all line breaks in your spreadsheet, use 'Find and replace all': find all '\n' and replace them by nothing or a space.

Write 2 characters minimum in each query. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to a 'failure'.
If you have blank cells in your spreadsheet...

...the CSV, which stand for 'Comma-separated values', will translate these blank cells into semicolons (the 'commas'). You will thus see semicolons in your batch search results:

To avoid that, remove blank cells in your spreadsheet before exporting it as a CSV.
If there is a comma in one of your cells (like in 'Jane, Austen' below), the CSV will put the content of the cell in double quotes so it will search for the exact phrase in the documents:


Remove all commas in your spreadsheet if you want to avoid exact phrase search.
Want to search only in some documents? Use the 'Filters' step in the batch search's form (see below). Or describe fields directly in your queries in the CSV. For instance, if you want to search only in some documents with certain tags, write your queries like this:
Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)
Use operators in your CSV: AND NOT * ? ! + - and other operators do work in batch searches as they do in the regular search bar but only if "Do phrase match" at step 3 is turned off. You can thus turn it off and write your queries like this for instance:
Paris NOT Barcelona AND Taipei
Reserved characters (^ " ? ( [ *), when misused, can lead to failures because of syntax errors.
Searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.
Export the list as a CSV
Export your spreadsheet of queries in a CSV format:

Important: Use the UTF-8 encoding in your spreadsheet software's settings.
LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':

Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, as explained here.
Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".
Create the batch search
Open the menu, go to 'Tasks', open 'Batch searches' and click the 'Plus' button at the top right:

Alternatively, in the menu next to 'Batch searches', click the 'Plus' button :

The form to create a batch search opens:

Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query. In that case, it won't apply any operators (AND OR, etc) that would be in the queries. If 'Do phrase match' is off, queries are searched without double quotes and with potential operators.
What is fuzziness? When you run a batch search, you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
What are proximity searches? When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Once you filled all steps, click 'Create' and wait for the batch search to complete.
Explore your results
In the menu, click 'Batch searches' and click the name of the batch search to open it:

See the number of matching documents per query:

Sort the queries by number of matching documents or by query position using the page settings (icon at the top right of the screen). The query position will put the query in their original order as you put them in the CSV.
To explore a query's matching documents, click its name and see the list of matching documents:

Click a document's name to open it. Use the page settings or the column's names to sort documents.
Relaunch a batch search (optional)
If you've added new files in Datashare after you launched a batch search, you might want to relaunch the batch search to search in the new documents too.
The relaunched batch search will apply to newly indexed documents and previously indexed documents (not only the newly indexed ones).
In 'Batch searches', go at the end of the table and click the 'Relaunch' icon:

Or click 'Relaunch' in the batch search page below its name on the right panel:

Change its name, description and decide to delete current batch search after relaunch or not:

See your relaunched batch search in the list of batch searches:

Failures
Failures in batch searches can be due to several causes.
Go to 'Tasks' > 'Batch searches' > open the batch search with a failure status and click the 'Red cross icon' button on the right panel:

Check the first failure-generating query in the error window:

Here it says:
Unexpected char 106 at (line no=1, column no=81, offset=80)
The first line contained a comma while it shouldn't. Datashare interpreted this query as a syntax error, it thus failed so the batch search stopped.
Check the most common syntax errors.
We recommend to remove the commas, as well as any reserved characters, in your CSV using 'Find and replace all' features in your spreadsheet software and re-create the batch search.
'elasticsearch: Name does not resolve'
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: check how for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
'Data too large'
One of your queries can lead to a 'Data too large' error.
It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.
We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.
Last updated