Thanks for sharing this. Let me expand your answer a little.
Basic search involves sorting and holds all results in memory
Using any of these methods
search(Query, ApplicationUser, PagerFilter)
search(Query, ApplicationUser, PagerFilter, org.apache.lucene.search.Query)
searchOverrideSecurity(Query, ApplicationUser, PagerFilter, org.apache.lucene.search.Query)
has the following two implications: results are sorted (according to
Query, or according to Lucene’s scoring) and results are returned in one go (as
SearchResults). Sorting means you spend extra time on ordering the results, even if you don’t care about the order. Returning
SearchResults, which in Jira 7.x contain
List<Issue>, means loading lots of data into memory, which is costly in terms of both act of loading and the memory you need to hold the results. All of these are not a problem if you deal with a couple of hundred of results, but if your query matches many thousands results, it can lead to OOME.
The hidden truth about getting a 100th page of results
There is also a “slight” problem with getting n-th page using
PagerFilter. One might think that if they’re asking for 100 results at a time, they will be safe. However, to know what is in the 100th page of results, we need to sort the first 100 pages, skip 99 and return the last one. Fortunately we don’t load all the documents into memory, but it’s dog slow due to the sorting.
The API, which is very efficient for iterating over large data sets, is the
search(Query, ApplicationUser, Collector)
search(Query, ApplicationUser, Collector, org.apache.lucene.search.Query)
searchOverrideSecurity(Query, ApplicationUser, Collector)
Collector consumes one Lucene document at a time in the order they are laid out on the disk. It ignores any sorting requested in
Query. This means the penalty for sorting is avoided. Also the memory is saved by not loading more than 1 document at a time. (Of course one can still blow up the instance by loading every single document into a huge list from inside the collector.)
Collector API ties plugins to Lucene API, which is something we’d like to avoid. However, if your plugin needs to operate on tens of thousands of issues at once, it’s the only choice for now.
There is an abuse of the
Collector concept, which is these two methods:
searchAndSort(Query, ApplicationUser, Collector, PagerFilter)
searchAndSortOverrideSecurity(Query, ApplicationUser, Collector, PagerFilter)
The problem with these two methods is that they provide sorting by passing documents to the collector according to the order from
Query. As I mentioned before, sorting requires holding in memory a data structure to compare documents and put them in order w.r.t. each other. This method is much slower for large result sets.
In Jira 8 we refreshed the
searchAndSort with collector is gone.
- The result,
SearchResults, no longer contains issues. Instead, it contains bare Lucene documents and it’s up to the caller to decide what to do with them.
- Instead of loading full documents, you can ask for specific fields by passing
fieldsToLoad set. The returned documents will only contain these fields.
- Most parameters were encapsulated in
in Jira 8, the API is slightly changed and there is slightly different
SearchProvider.search(SearchQuery, ApplicationUser, Set<String> fieldToLoad) overload method, to achieve the same result.
This is not true. This method is more efficient than loading full documents, but is not a replacement for searching with a collector. Using it still involves the penalty for sorting results and holding the whole result set in memory. Use this method if you need sorted results, or if you get only limited number of documents (a few thousand is probably the upper bound).
As a side note, Lucene 7 provides doc values which Jira indexes for most of the fields. Reading doc values is faster than calling
IndexReader#Document, so if you only need one or two fields from all documents, consider doing that. See
com.atlassian.jira.jql.query.IssueIdCollector for an example.
If you don’t want to go that deep into Lucene, you can extend
com.atlassian.jira.issue.statistics.util.FieldDocumentHitCollector, which will be slower, but easier to use.
If you still need to get all the results in order
You can pass empty set as
fieldsToLoad. This way you’ll only get document ids (these are not the same as issue ids) and then you can load documents one at a time. You will pay the sorting penalty, though.
To see all the changes in Jira 8 related to Lucene, see Lucene upgrade.