[Announcement] Changes to Lucene indexing

klopacinski · September 9, 2019, 12:57pm

We have recently created a fix for JSWSERVER-20133. The bug causes indexing to fail when plugins with custom indexing code attempt to create very large Lucene terms or DocValues fields. It stems from Lucene’s limit of 32766 bytes for single terms or DocValues. In practice, the cases where we’ve seen it manifest were mostly attempts at indexing very large JSON documents without tokenizing them.

Starting with Jira 8.4.1, fields that exceed this limit will be removed before they are committed to Jira’s Lucene index in order to prevent entire indexing operations from failing. Each such event will emit an ERROR level message to the logs, allowing plugin developers to pinpoint the offending fields. The log entry looks like this:

1`` 2019-09-04 20:06:24,091 IssueIndexer:thread-6 ERROR admin 1206x3196x1 ujgapq 0:0:0:0:0:0:0:1 /secure/admin/IndexReIndex!reindex.jspa [c.a.j.issue.index.DocumentScrubber] A document contained a potential immense term in field customfield_10220_timeline. The field has been removed from the document.

Ultimately, it can be solved only by fixing the plugin’s indexing code.

In the case of immense terms, using a tokenized field should be enough (i.e. indexing as a StringField as opposed to a TextField).
In the case of DocValues, the recommended approach is to truncate the field to fit below Lucene’s 32766 byte limit.

tbinna · September 10, 2019, 12:53pm

@klopacinski, I have a question somewhat related to indexing. I have opened DEVHELP-3134 with dev support a while back regarding bringing back array indexing in Jira 8 (community post related to this). This was working in Jira 7 and is still working in Jira Cloud but not anymore in Jira 8.

Do you have any idea if there is an intention to fix this? Would really appreciate some clarity on this to get an idea on how we move forward. We are currently maintaining two different implementations.