Search documents in batch
It allows to get the results of each query of a list, but all at once.
Last updated
It allows to get the results of each query of a list, but all at once.
Last updated
Datashare is an open source project by the International Consortium of Investigative Journalists
If you want to search a list of queries in Datashare, instead of doing each of them one by one, you can upload the list directly in Datashare. To do so, you will:
Create a list of terms that you want to search in the first column of a spreadsheet
Export the spreadsheet as a CSV (a special format available in any spreadsheet software)
Upload this CSV in the "new Batch Search" form in Datashare
Get the results for each query in Datashare - or in a CSV.
Write your queries, one per line and per cell, in the first column of a spreadsheet (Excel, Google Sheets, Numbers, Framacalc, etc.). In the example below, there are 4 queries:
Do not put line break(s) in any of your cells.
To delete line break(s) in your spreadsheet, you can use the "Find and replace all" functionality. Find all "\n" and replace them all by nothing or a space.
Write 2 characters minimum in the cells. If one cell contains one character but at least one other cell contains more than one, the cell containing one character will be ignored. If all cells contain only one character, the batch search will lead to 'failure'.
If you have blank cells in your spreadsheet...
...the CSV (which stand for 'Comma-separated values') will keep these blank cells. It will separate them with semicolons (the 'commas'). You will thus have semicolons in your batch search results (see screenshot below). To avoid that, you need to remove blank cells in your spreadsheet before exporting it as a CSV.
If there is a comma in one of your cells (like in "1,8 million" in our example above), the CSV will formally put the content of the cell in double quotes in your results and search for the exact phrase in double quotes.
In the new Batch Search's form > Advanced Filters, you will be able to select some file types and some paths if you want to search only in some documents.
But you can also use fields directly in your queries in the CSV.
For instance, if you want to search only in some documents with certain tag(s), you can write your queries like this: "Paris AND (tags:London OR tags:Madrid NOT tags:Cotonou)".
The operators AND NOT * ? ! + - do work in batch searches (as they do in the regular search bar) but only if "Do phrase match" in Advanced filters is turned off.
Reserved characters, when misused, can lead to failures because of syntax errors.
Please also note that searches are not case sensitive: if you search 'HeLlo', it will look for all occurrences of 'Hello', 'hello', 'hEllo', 'heLLo', etc. in the documents.
Export your spreadsheet in a CSV format like this:
Important: Use the UTF-8 encoding.
LibreOffice Calc: it uses UTF-8 by default. If not, go to LibreOffice menu > Preferences > Load/Save > HTML Compatibility and make sur the character set is 'Unicode (UTF-8)':
Microsoft Excel: if it is not set by default, select "CSV UTF-8" as one of the formats, as explained here.
Google Sheets: it uses UTF-8 by default. Just click "Export to" and "CSV".
Other spreadsheet softwares: please refer to each software's user guide.
Open Datashare, click 'Batch searches' in the left menu and click 'New batch search' on the top right:
Type a name for your batch search:
Upload your CSV:
Add a description (optional):
Set the advanced filters ('Do phrase matches', 'Fuzziness' or 'Proximity searches', 'File types' and 'Path') according to your preferences:
'Do phrase matches' is the equivalent of double quotes: it looks for documents containing an exact sentence or phrase. If you turn it on, all queries will be search for their exact mention in documents as if Datashare added double quotes around each query.
When you run a batch search, you can set the fuzziness to 0, 1 or 2. It will apply to each term in a query. It corresponds to the maximum number of operations (insertions, deletions, substitutions and transpositions) on characters needed to make one term match the other.
kitten -> sitten (1 substitution (k turned into s) = fuzziness is 1)
kitten -> sittin (2 substitutions (k turned into s and e turned into i) = fuzziness is 2)
If you search for similar terms (to catch typos for example), use fuzziness.
"The default edit distance is 2, but an edit distance of 1 should be sufficient to catch 80% of all human misspellings. It can be specified as: quikc~1" (source: Elastic).
Example: quikc~ brwn~ foks~ (as the default edit distance is 2, this query will catch all quick, quack, quock, uqikc, etc. as well as brown, folks, etc.)
Example: Datashare~1 (this query will catch Datasahre, Dqtashare, etc.)
When you turn on 'Do phrase matches', you can set, in 'Proximity searches', the maximum number of operations (insertions, deletions, substitutions and transpositions) on terms needed to make one phrase match the other.
“the cat is blue” -> “the small cat is blue” (1 insertion = fuzziness is 1)
“the cat is blue” -> “the small is cat blue” (1 insertion + 2 transpositions = fuzziness is 3)
Example: "fox quick"~5 (this query will catch "quick brown fox", "quick brown car thin fox" or even "quick brown car thin blue tree fox"
Click 'Add'. Your batch search will appear in the table of batch searches.
Open your batch search by clicking its name:
You see your results and you can sort them by clicking the column's name. 'Rank' means the order by which each queries would be sorted out if run in Datashare's main search bar. They are thus sorted by relevance score by default.
You can click on a document's name and it will open it in a new tab:
You can filter your results by query and read how many documents there are for each query:
You can search for specific queries:
You can also download your results in a CSV format:
If you add more and more files in Datashare, you might want to relaunch existing batch search on your new documents too.
Notes:
In the server collaborative mode, you can only relaunch your own batch searches, not others'.
The relaunched batch search will apply to your whole corpus, newly indexed documents and previously indexed documents (not only the newly indexed ones).
To do so, open the batch search that you'd like to relaunch and click 'Relaunch':
Edit the name and the description of your batch search if needed:
You can choose to delete the current batch search after relaunching it:
Note: if you're worried about losing your previous results because of an error, we recommend to keep your current batch search (turn off this toggle button) and delete it only after the relaunch is a success.
Click 'Submit':
You can see your relaunched batch search running in the batch search's list:
Failures in batch searches can be due to several causes.
Click the 'See error' button to open the error window:
The first query containing an error makes the batch search fail and stop.
Check this first failure-generating query in the error window:
In the case above, the slash (/) used between 'Heroin' and 'Opiates' is a reserved character that was not escaped by a backslash so Datashare interpreted this query as a syntax error, failed and didn't go further so the batch search stopped.
We recommend to remove the slash, as well as any reserved characters, and re-run the batch search again.
If you have a message which contain 'elasticsearch: Name does not resolve', it means that Datashare can't make Elastic Search, its search engine, work.
In that case, you need to re-open Datashare: ****here are the instructions for Mac, Windows or Linux.
Example of a message regarding a problem with ElasticSearch:
SearchException: query='lovelace' message='org.icij.datashare.batch.SearchException: java.io.IOException: elasticsearch: Name does not resolve'
__
One of your queries can lead to a 'Data too large' error.
It means that this query had too many results or in their results, some documents that were too big to process for Datashare. This makes the search engine fail.
We recommend to remove the query responsible for the error and re-start your batch search without the query which led to the 'Data too large' error.
****
One or several of your queries contains syntax errors.
It means that you wrote one or more of your queries the wrong way with some characters that are reserved as operators (see below).
You need to correct the error(s) in your CSV and re-launch your new batch search with a CSV that does not contain errors. Click here to learn how to launch a batch search.
Datashare stops at the first syntax error. It reports only the first error. You might need to check all your queries as some errors can remain after correcting the first one.
They are more likely to happen when 'do phrase matches' toggle button is turned off:
When 'Do phrase matches' is on, syntax error can still happen though:
Here are the most common errors:
You cannot start a query with AND all uppercase, neither in Datashare's main search bar nor in your CSV. AND is reserved as a search operator.
You cannot start a query with OR all uppercase, neither in Datashare's main search bar nor in your CSV. OR is reserved as a search operator.
You cannot type a query with only one double quote, neither in Datashare's main search bar nor in your CSV. Double quotes are reserved as a search operator.
You cannot start a query with tilde (~) or make one contain a tilde, neither in Datashare's main search bar nor in your CSV. Tilde is reserved as a search operator for fuzziness or proximity searches.
You cannot start a query with caret (^) or make it contain a caret, neither in Datashare's main search bar nor in your CSV. Caret is reserved as a boosting operator.
You cannot start a query with slash (/) or make it contain a slash, neither in Datashare's main search bar nor in your CSV. Slash is a reserved character to open Regex ('regular expressions'). Note that you can use Regex in batch searches.
You cannot use square brackets except for searching for ranges.
Open your batch search and click the trash icon:
Then click 'Yes':