How to avoid Out Of Memory error using SearchProvider

@azhdanov,

Thanks for sharing this. Let me expand your answer a little.

Jira 7

Basic search involves sorting and holds all results in memory

Using any of these methods

  • search(Query, ApplicationUser, PagerFilter)
  • search(Query, ApplicationUser, PagerFilter, org.apache.lucene.search.Query)
  • searchOverrideSecurity(Query, ApplicationUser, PagerFilter, org.apache.lucene.search.Query)

has the following two implications: results are sorted (according to Query, or according to Lucene’s scoring) and results are returned in one go (as SearchResults). Sorting means you spend extra time on ordering the results, even if you don’t care about the order. Returning SearchResults, which in Jira 7.x contain List<Issue>, means loading lots of data into memory, which is costly in terms of both act of loading and the memory you need to hold the results. All of these are not a problem if you deal with a couple of hundred of results, but if your query matches many thousands results, it can lead to OOME.

The hidden truth about getting a 100th page of results

There is also a “slight” problem with getting n-th page using PagerFilter. One might think that if they’re asking for 100 results at a time, they will be safe. However, to know what is in the 100th page of results, we need to sort the first 100 pages, skip 99 and return the last one. Fortunately we don’t load all the documents into memory, but it’s dog slow due to the sorting.

Using collectors

The API, which is very efficient for iterating over large data sets, is the Collector:

  • search(Query, ApplicationUser, Collector)
  • search(Query, ApplicationUser, Collector, org.apache.lucene.search.Query)
  • searchOverrideSecurity(Query, ApplicationUser, Collector)

Collector consumes one Lucene document at a time in the order they are laid out on the disk. It ignores any sorting requested in Query. This means the penalty for sorting is avoided. Also the memory is saved by not loading more than 1 document at a time. (Of course one can still blow up the instance by loading every single document into a huge list from inside the collector.)

Collector API ties plugins to Lucene API, which is something we’d like to avoid. However, if your plugin needs to operate on tens of thousands of issues at once, it’s the only choice for now.

Abusing collectors

There is an abuse of the Collector concept, which is these two methods:

  • searchAndSort(Query, ApplicationUser, Collector, PagerFilter)
  • searchAndSortOverrideSecurity(Query, ApplicationUser, Collector, PagerFilter)

The problem with these two methods is that they provide sorting by passing documents to the collector according to the order from Query. As I mentioned before, sorting requires holding in memory a data structure to compare documents and put them in order w.r.t. each other. This method is much slower for large result sets.

Jira 8

In Jira 8 we refreshed the LuceneSearchProvider API:

  1. searchAndSort with collector is gone.
  2. The result, SearchResults, no longer contains issues. Instead, it contains bare Lucene documents and it’s up to the caller to decide what to do with them.
  3. Instead of loading full documents, you can ask for specific fields by passing fieldsToLoad set. The returned documents will only contain these fields.
  4. Most parameters were encapsulated in com.atlassian.jira.issue.search.SearchQuery.

search with fieldsToLoad vs. search with Collector

@azhdanov wrote:

in Jira 8, the API is slightly changed and there is slightly different SearchProvider.search(SearchQuery, ApplicationUser, Set<String> fieldToLoad) overload method, to achieve the same result.

This is not true. This method is more efficient than loading full documents, but is not a replacement for searching with a collector. Using it still involves the penalty for sorting results and holding the whole result set in memory. Use this method if you need sorted results, or if you get only limited number of documents (a few thousand is probably the upper bound).

Doc values

As a side note, Lucene 7 provides doc values which Jira indexes for most of the fields. Reading doc values is faster than calling IndexReader#Document, so if you only need one or two fields from all documents, consider doing that. See com.atlassian.jira.jql.query.IssueIdCollector for an example.
If you don’t want to go that deep into Lucene, you can extend com.atlassian.jira.issue.statistics.util.FieldDocumentHitCollector, which will be slower, but easier to use.

If you still need to get all the results in order

You can pass empty set as fieldsToLoad. This way you’ll only get document ids (these are not the same as issue ids) and then you can load documents one at a time. You will pay the sorting penalty, though.

Documentation

To see all the changes in Jira 8 related to Lucene, see Lucene upgrade.

6 Likes