New Jira Cloud Webhook Retry Policy

kkercz · July 22, 2019, 11:05am

Hi everyone,

I’m happy to announce that Jira Cloud webhooks have recently become much more resilient. We’ve rolled out a new retry mechanism that repeats failed requests.

Retries are attempted when any of the following are true:

the callback server returns any of the following status codes: 408, 409, 425, 429, 5xx.
the connection fails or times out.

This means that:

some webhooks might be delivered more than once (if the delivery acknowledgment fails).
webhooks might be delivered later than usually (up to 30 minutes, subject to change).
you might need to modify your integrations to take this into account (e.g. check the webhook timestamp or the special retry header, more on which in the next paragraph).

The X-Atlassian-Webhook-Retry header with the current retry count is included with webhooks that have been retried. We recommend monitoring this header and cross-referencing it with the callback server logs to stay on top of any unexpected reliability problems.

We hope this will allow you to fully rely on webhooks without the need of resorting to periodical polling.

See the official documentation here: https://developer.atlassian.com/cloud/jira/platform/webhooks/#retry-policy

maciej.dudziak · July 22, 2019, 2:18pm

Hi, does it mean it is live ?

Is my understanding correct: If the server of an app returns any mentioned status codes as the result of the webhook, the webhook will be sent again by Jira?

david2 · July 22, 2019, 6:20pm

Hi @kkercz,
does this apply to post-function /triggered calls too (since they rely on the same mechanism as webhooks)?

remie · July 22, 2019, 9:16pm

Just to be clear: this can have a huge impact both for Atlassian and vendors, because previously vendors may have treated the web hooks as a one-off, hit and run type of request.

By implementing a retry mechanism with acknowledgement (correct response codes and no timeout) this might result in a cascading number of requests for vendors if they have previously not handled requests properly.

You might want to be monitoring the number of requests that are being sent to vendors and see if this increases gradually over time. Also, you might have wanted to notify vendors prior to rolling it out

kkercz · July 23, 2019, 6:56am

Yes, it’s live.

That’s correct.

Yes.

True. We realise the roll out of this was not perfect. I guess we got carried away by the prospect of potential value and the desire to solve the long-standing pain-point as soon as possible, without properly assessing the implications.

However, we do have monitoring and so far the negative impact was minimal. Fortunately, even If someone was previously not aknowledging webhooks correctly, it is easy to notice and fix.

tobias.viehweger · July 23, 2019, 10:03am

@kkercz Is this also impacting the initial delivery of Webhooks? We are seeing quite some delay between for example the UI action (e.g. updating an issue) to receving a webhook (more than 5min)… This is unfortunate because we are using this for syncing stuff, and the delay might confuse customers…

mszerszynski · July 23, 2019, 11:13am

@tobias.viehweger the initial delivery of webhook should not be affected - the delay applies only to the retries.

acalantog · July 23, 2019, 2:48pm

Hi,

We’ve receive reports about the delay in Devhelp and have created a public facing issue for it. Please see [ACJIRA-1908] - Ecosystem Jira

Cheers,
Anne

jtrzebiatowski · July 25, 2019, 8:48am

Hi everyone,

Due to the high load put by the number of retries on our infrastructure we’ve temporarily reduced the number of retries to 1.

stevemac · July 29, 2019, 2:59am

Sounds like a worthwhile enhancement. Does this apply to webhooks initiated from bulk updates to issues?

kkercz · July 29, 2019, 6:59am

This applies to all Jira Cloud webhooks, they are all sent in the same way.

konrad.garus1 · July 29, 2019, 7:00am

Awesome, now we need full JQL support for dynamic webhooks.

remie · July 29, 2019, 8:26am

Could it be that this is because vendors were unaware of the change and have yet to implement the correct request handling with proper response?

remie · July 29, 2019, 9:30am

My apologies for sounding a bit sarcastic, but this strikes me as very basic stakeholder management for which product owners are responsible, especially for changes that require adjustments from third parties. I expect better from a company that wishes to be an advocate for software development best practices.

george1 · August 7, 2019, 5:32am

Great update! Beside all the technicalities, it is a move into the right direction!

ola.melin · August 19, 2019, 7:26am

Hi,
This is a nice addition to the webhooks!

Are these changes applied to the lifecycle events as well?

Cheers

kkercz · August 19, 2019, 7:40am

Yes, this applies to all webhooks in Jira Cloud.

BenRomberg · September 19, 2019, 3:36pm

Hi Krzysztof,

thanks a lot for introducing retries! How would you propose to deduplicate requests? Is the response body guaranteed to be the same on the retry? Or is there another header with a correlation ID that we could use (staying the same across retries for one request)?

kkercz · September 19, 2019, 5:25pm

Hi,

The body is always the same, so if you are afraid of duplicates, just keep an eye on the “X-Atlassian-Webhook-Retry” header and discard retried webhooks that you believe you’ve already processed.

BenRomberg · September 20, 2019, 8:53am

Thanks for the confirmation.

Ideally we’d like to have another header that allows us to relate retries to their original requests. In theory, we might have multiple (separate) requests having exactly the same body within a short time period. If some of those are retried it’s impossible for us to know how many unique, non-retried requests there were to begin with.

Example:
00:00 request 1, body “ABC”, retried header = null
00:01 request 2, body “ABC”, retried header = 1

We cannot tell if request 2 is now a retry of request 1 or another, separate request altogether, whose initial request got lost due to connection issues. This could be fixed by introducing a correlation header, e.g.:

00:00 request 1, body “ABC”, retried header = null, correlation header = “def”
00:01 request 2, body “ABC”, retried header = 1, correlation header = “def”

Now we know for sure that request 2 was a retry of request 1.

00:00 request 1, body “ABC”, retried header = null, correlation header = “def”
00:01 request 2, body “ABC”, retried header = 1, correlation header = “ghi”

Now we know for sure that request 2 was a separate request and not a retry of request 1.

It’s very important for our app to properly detect and ignore any duplicate requests, so doing deduplication e.g. with a hash of the body might not work for all of our customers.