How to avoid Out Of Memory error using SearchProvider

azhdanov · November 14, 2018, 3:41pm

Just would like to share my recent knowledge during App Week in Amsterdam about using SearchProvicer.search() method to avoid high memory allocation in underlying Jira api.

There are a few examples showing SearchProvider.search(Query, ApplicationUser, PagerFilter) method. Some examples even show loading all issues at once, using unlimited PagerFilter.getUnlimitedFilter().

Unfortunately, this method is not quite efficient, and may cause OutOfMemory error sooner or later. Especially, unlimited PagerFilter.

Often, there is no need for all issues or complete issue objects be loaded by search. So, to avoid OutOfMemory errors, it’s better to use new PagerFilter(MAX_RESULTS), or another SearchProvider.search(Query, ApplicationUser, IssueCollector) overload method, to get limited number of issues or minimum required data from search.

Here is example to get issue ids:

IssueCollector collector = new IssueIdCollector();
searchProvider.search(query, user, collector);
Set<String> issueIds = collector.getAllIssueIds();

Note, in Jira 8, the API is slightly changed and there is slightly different SearchProvider.search(SearchQuery, ApplicationUser, Set<String> fieldToLoad) overload method, to achieve the same result. I could not find JavaDoc for Jira 8 yet, so this could be a question to answer.

pvandevoorde · November 29, 2018, 8:52pm

Thank you for sharing!

kcichy · December 4, 2018, 3:16pm

@azhdanov,

Thanks for sharing this. Let me expand your answer a little.

Jira 7

Basic search involves sorting and holds all results in memory

Using any of these methods

search(Query, ApplicationUser, PagerFilter)
search(Query, ApplicationUser, PagerFilter, org.apache.lucene.search.Query)
searchOverrideSecurity(Query, ApplicationUser, PagerFilter, org.apache.lucene.search.Query)

has the following two implications: results are sorted (according to Query, or according to Lucene’s scoring) and results are returned in one go (as SearchResults). Sorting means you spend extra time on ordering the results, even if you don’t care about the order. Returning SearchResults, which in Jira 7.x contain List<Issue>, means loading lots of data into memory, which is costly in terms of both act of loading and the memory you need to hold the results. All of these are not a problem if you deal with a couple of hundred of results, but if your query matches many thousands results, it can lead to OOME.

The hidden truth about getting a 100th page of results

There is also a “slight” problem with getting n-th page using PagerFilter. One might think that if they’re asking for 100 results at a time, they will be safe. However, to know what is in the 100th page of results, we need to sort the first 100 pages, skip 99 and return the last one. Fortunately we don’t load all the documents into memory, but it’s dog slow due to the sorting.

Using collectors

The API, which is very efficient for iterating over large data sets, is the Collector:

search(Query, ApplicationUser, Collector)
search(Query, ApplicationUser, Collector, org.apache.lucene.search.Query)
searchOverrideSecurity(Query, ApplicationUser, Collector)

Collector consumes one Lucene document at a time in the order they are laid out on the disk. It ignores any sorting requested in Query. This means the penalty for sorting is avoided. Also the memory is saved by not loading more than 1 document at a time. (Of course one can still blow up the instance by loading every single document into a huge list from inside the collector.)

Collector API ties plugins to Lucene API, which is something we’d like to avoid. However, if your plugin needs to operate on tens of thousands of issues at once, it’s the only choice for now.

Abusing collectors

There is an abuse of the Collector concept, which is these two methods:

searchAndSort(Query, ApplicationUser, Collector, PagerFilter)
searchAndSortOverrideSecurity(Query, ApplicationUser, Collector, PagerFilter)

The problem with these two methods is that they provide sorting by passing documents to the collector according to the order from Query. As I mentioned before, sorting requires holding in memory a data structure to compare documents and put them in order w.r.t. each other. This method is much slower for large result sets.

Jira 8

In Jira 8 we refreshed the LuceneSearchProvider API:

searchAndSort with collector is gone.
The result, SearchResults, no longer contains issues. Instead, it contains bare Lucene documents and it’s up to the caller to decide what to do with them.
Instead of loading full documents, you can ask for specific fields by passing fieldsToLoad set. The returned documents will only contain these fields.
Most parameters were encapsulated in com.atlassian.jira.issue.search.SearchQuery.

`search` with `fieldsToLoad` vs. `search` with `Collector`

@azhdanov wrote:

in Jira 8, the API is slightly changed and there is slightly different SearchProvider.search(SearchQuery, ApplicationUser, Set<String> fieldToLoad) overload method, to achieve the same result.

This is not true. This method is more efficient than loading full documents, but is not a replacement for searching with a collector. Using it still involves the penalty for sorting results and holding the whole result set in memory. Use this method if you need sorted results, or if you get only limited number of documents (a few thousand is probably the upper bound).

Doc values

As a side note, Lucene 7 provides doc values which Jira indexes for most of the fields. Reading doc values is faster than calling IndexReader#Document, so if you only need one or two fields from all documents, consider doing that. See com.atlassian.jira.jql.query.IssueIdCollector for an example.
If you don’t want to go that deep into Lucene, you can extend com.atlassian.jira.issue.statistics.util.FieldDocumentHitCollector, which will be slower, but easier to use.

If you still need to get all the results in order

You can pass empty set as fieldsToLoad. This way you’ll only get document ids (these are not the same as issue ids) and then you can load documents one at a time. You will pay the sorting penalty, though.

Documentation

To see all the changes in Jira 8 related to Lucene, see Lucene upgrade.

rosy.salame · March 27, 2024, 8:22am

Hi @kcichy,

I have a big subset without a need to sort or render all fields as I only care about issue keys.
I tried to use com.atlassian.jira.jql.query.IssueIdCollector in my plugin but it is not public.
Can you please help? How can we add it as a dependency and use it for searching?

Thanks,
Rosy

m.herrmann · March 27, 2024, 9:01am

The IssueIdCollector is a quite simple Class with less than 100 lines, so you can use it as a base to create your own IssueKeyCollector. There is a constant you can use for the getSortedDocValues method parameter:
com.atlassian.jira.issue.index.DocumentConstants.ISSUE_KEY

kcichy · April 2, 2024, 7:43am

Hi @rosy.salame ,

I agree with @m.herrmann that creating your own version of IssueIdCollector is a reasonable thing to do.
Alternatively, you could add a dependency on jira-core and use it directly. This, however, is not recommended, because, unlike jira-api, jira-core does not follow SemVer and might have breaking changes in minor releases, so best to avoid depending on that module.