@dmorrow apologies if this should be a separate topic but this felt like the most appropriate place to begin…
A few days ago we received 4x the number of webhooks per hour than usual, triggered, as far as we can tell, by a single tenant. We usually see peaks of 500,000 webhooks per hour from Atlassian across all tenants, but we received just under 2 million requests between 14:30 and 16:00 UTC of which 3/4 were from a single tenant.
Our services were not provisioned to handle that spike in load or autoscale and as a consequence fell over badly while we manually scaled them up and out. The outage triggered a notification from the Fortified Apps monitoring that Atlassian has in place because no longer successfully processed 99% of requests in a 15min period.
I’d like to work with Atlassian to publish some webhook rate limits that apps must be able to support, because currently there’s an implicit expectation for apps to be able to scale our services to process an infinite number of inbound webhook requests per second.
Thanks for raising this. I agree we need to put a solution in place that will avoid this type of incident occurring in the future. I’ve created ACJIRA-2497 and will reach out to the Jira Ecosystem team to see if they can prioritise it.
Any traction on this internally? We can’t just return 429s when we get spikes in inbound webhooks because we’ll violate our Fortified Apps SLA if I understand things correctly…
Recent examples of individual tenants making waves in our infra:
Hi @jbevan and @david2 , thanks for raising this. I’ve drafted an internal proposal outlining a suggested way forward for the Cloud Fortified and Ecosystem PMs to review.
Can you confirm that rate limiting logic needs to be implemented on a per-user basis, not a per-request basis? I came to that conclusion after some discussion with @gtaylor - if so, can the docs get updated to make that more clear?
I initially implemented adherence to the rate limiting headers on a per-request basis and saw the number of 429s/5xxs quadruple because I wasn’t keeping track of which users had been limited, and wasn’t delaying all subsequent requests for that user for the relevant time period.
I’m not sure what you mean by “per request”; I guess per resource? In any case, no. In the rate limiting docs, the section on implementation overview explains the dimensions:
Call Type
Cost Budget
User (Jira front end initiated action)
User
JWT (app server to Jira API)
App
JWT user impersonation
App + user
Connect app iframe (AP.request)
App + user
OAuth 2.0 (3LO)
App + user
Forge asUser()
App + user
Forge asApp()
App
Anonymous
Anonymous
For your context as a Connect app, I think you are implicitly already limiting “per app” so adding “per user” is correct.
I meant literally on a per-request basis we’d check the response code and headers and back off retrying that specific request (and only that request) if we recieved a 429 or 5xx using the values in the documented headers.
It really wasn’t clear to me that the cost budgets meant that we needed centralised knowledge of all requests being made from different microservices in our infrastructure in order to keep track of whether a “cost budget” was being rate limited.
We have at least 2 background processes doing work in Jira on behalf of users while we also have a service that serves up the user interface and makes requests to Jira as part of handling some user interface requests. I think I’m right in saying that now we need a centralised service to handle all the requests those 3 service make to Jira and keep a track of whether any given user is being rate limited and throttle/delay all subsequent request being made from any of the 3 services. Plus update our frontend codebase to handle rate limiting when we use AP.request… I’m not sure how we’ll centralise the knowledge that AP.request should pre-emptively delay requests because background service #2 has breached the limit, or vice versa but I think at least I understand what we should be doing.
Please could you get some clarification added to the documentation to really spell out what app vendors need to do in this area?
I believe you are right since Jira doesn’t know anything about the threads in an app’s backend. This means that apps with concurrent processing making API calls against the same cost buckets have a couple of choices when it comes to rate limit response handling;- (a) some sort of coordination between threads to share rate limit response data so that invocations respect rate limit responses received by other threads, or (b) more aggressive backoffs and retry handling. Option (b) would be simpler to implement, but may not be as performant and may need to be tuned from time to time whereas option (a) is harder to implement, especially considering it may need to deal with different cost buckets. As you point out, option (a) would also require distributing rate limit response info between your app’s front end and backend if your app makes impersonated API calls from the app’s backend, however, I assume app + user cost budget rate limiting is relatively rare in comparison to rate limiting against the app cost budget.