Batch search documents

Batch searches allow to get the results of each query of a list all at once: instead of searching each query one by one, upload a list, set options/filters and see the matching documents.

1

Prepare a CSV list

Open a spreadsheet (LibreOffice, Framacalc, Excel, Google Sheets, Numbers, ...)

Write your queries in the first column of the spreadsheet, typing one query per line:

Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell
One query per line in a spreadsheet

  • Do not put line break(s) in any of your cells.

Screenshot of a spreadsheet cell filled with a text containing a line break and a red cross indicates it is wrong
This will lead to a "failure"
Screenshot of a spreadsheet cell filled with a text not containing a line break and a green check indicates it is right
This will lead to a "success"

To delete all line breaks in your spreadsheet, use 'Find and replace all': find all '\n' and replace them by nothing or a space.

Screenshot of a spreadsheet software's 'Find and replace' window with the 'Replace all' button highlighted
Use this functionality to delete all line break(s)

  • Write 2 characters minimum in each query. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to a 'failure'.

  • If you have blank cells in your spreadsheet...

Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell and other columns from B to H empty and highlighted
Blank columns in a spreadsheet

...the CSV, which stand for 'Comma-separated values', will translate these blank cells into semicolons (the 'commas'). You will thus see semicolons in your batch search results:

Screenshot of Datashare's batch search page where each query with the female personality's surname is followed by several semicolons which are highlighted
Remove blank cells in your spreadsheet in order to avoid this.

To avoid that, remove blank cells in your spreadsheet before exporting it as a CSV.

  • If there is a comma in one of your cells (like in 'Jane, Austen' below), the CSV will put the content of the cell in double quotes so it will search for the exact phrase in the documents:

Screenshot of a spreadsheet with the first column filled with one name and surname of a female personality per cell and the second cell contains 'Jane, Austen' and is highlighted
Screenshot of Datashare's batch search page where two queries are highlighted: one is 'Jane, Austen' and has 0 documents as results and the second one is 'Jane Austen' and has 2 documents as results is 'Jane Austen'

Remove all commas in your spreadsheet if you want to avoid exact phrase search.

  • Want to search only in some documents? Use the 'Filters' step in the batch search's form (see below). Or describe fields directly in your queries in the CSV. For instance, if you want to search only in some documents with certain tags, write your queries like this:

    Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)

  • Use operators in your CSV: AND NOT * ? ! + - and other operators do work in batch searches as they do in the regular search bar but only if "Do phrase match" at step 3 is turned off. You can thus turn it off and write your queries like this for instance:

    Paris NOT Barcelona AND Taipei

  • Reserved characters (^ " ? ( [ *), when misused, can lead to failures because of syntax errors.

  • Searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.

2

Export the list as a CSV

Export your spreadsheet of queries in a CSV format:

Screenshot of a window of 'Numbers' software where the menu's path File > Export to > CSV is selected

Important: Use the UTF-8 encoding in your spreadsheet software's settings.

  • LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':

Screenshot of a window of LibreOffice software where the Export options with 'Character set: Unicode (UTF-8)' is highlighted
  • Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, as explained here.

  • Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".

3

Open the menu, go to 'Tasks', open 'Batch searches' and click the 'Plus' button at the top right:

Screenshot of Datashare's batch searches page where the 'Plus' button on the top right is highlighted

Alternatively, in the menu next to 'Batch searches', click the 'Plus' button :

Screenshot of Datashare's batch searches page where the 'Plus' button in the menu next to the entry 'Tasks > Batch searches' is highlighted

The form to create a batch search opens:

Screenshot of Datashare's page with a form to create a new batch search

  • Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query. In that case, it won't apply any operators (AND OR, etc) that would be in the queries. If 'Do phrase match' is off, queries are searched without double quotes and with potential operators.

  • What is fuzziness? When you run a batch search, you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.

kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)

kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)

If you search for similar terms (to catch typos for example), use fuzziness.

"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).

Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)

Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)

  • What are proximity searches? When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.

“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)

“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)

Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"

Once you filled all steps, click 'Create' and wait for the batch search to complete.

4

Explore your results

In the menu, click 'Batch searches' and click the name of the batch search to open it:

Screenshot of Datashare's batch searches page where the first batch search's name is highlighted

See the number of matching documents per query:

Screenshot of Datashare's page for one batch search where the list of queries and their matching documents are highlighted

Sort the queries by number of matching documents or by query position using the page settings (icon at the top right of the screen). The query position will put the query in their original order as you put them in the CSV.

To explore a query's matching documents, click its name and see the list of matching documents:

Screenshot of Datashare's page for one batch search's matching documents

Click a document's name to open it. Use the page settings or the column's names to sort documents.

5

Relaunch a batch search (optional)

If you've added new files in Datashare after you launched a batch search, you might want to relaunch the batch search to search in the new documents too.

The relaunched batch search will apply to newly indexed documents and previously indexed documents (not only the newly indexed ones).

In 'Batch searches', go at the end of the table and click the 'Relaunch' icon:

Screenshot of Datashare's batch searches page where the last button with the 'Relaunch' icon is highlighted

Or click 'Relaunch' in the batch search page below its name on the right panel:

Screenshot of Datashare's page for one batch search where the 'Relaunch' button in the right panel describing the batch search is highlighted

Change its name, description and decide to delete current batch search after relaunch or not:

Screenshot of Datashare's page for one batch search where the 'Relaunch batch search' pop-in window is open

See your relaunched batch search in the list of batch searches:

Screenshot of Datashare's batch searches page where the two first batch searches (one normal, one relaunched) are highlighted
6

Failures

Failures in batch searches can be due to several causes.

The first query containing an error makes the batch search fail and stop.

Go to 'Tasks' > 'Batch searches' > open the batch search with a failure status and click the 'Red cross icon' button on the right panel:

Screenshot of Datashare's batch search page where the 'Failure' button in the right panel describing the batch search is highlighted

Check the first failure-generating query in the error window:

Screenshot of Datashare's batch search page where a modal window shows 'The error is' with a description of the error 'Unexpected char 106 at (line no=1, column no=81, offset=80)'

Here it says:

Unexpected char 106 at (line no=1, column no=81, offset=80)

The first line contained a comma while it shouldn't. Datashare interpreted this query as a syntax error, it thus failed so the batch search stopped.

Check the most common syntax errors.

We recommend to remove the commas, as well as any reserved characters, in your CSV using 'Find and replace all' features in your spreadsheet software and re-create the batch search.

'elasticsearch: Name does not resolve'

If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.

In that case, you need to re-open Datashare: check how for Mac, Windows or Linux.

Example of a message regarding a problem with ElasticSearch:

SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'

'Data too large'

One of your queries can lead to a 'Data too large' error.

It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.

We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.

Last updated