Rate limiting guide for Jira and Confluence

@dmorrow Another problem with this new approach is in the way Jira identifies the “App + user” call types. In the case of workflow post-functions (or event webhooks for that matter), the action is clearly initiated by the user, completely outside of app control. Therefore, any REST call the app makes as part of the post-function implementation should be considered as “App + user”. However, because of multiple technical reasons, most REST calls the app needs to make are not done using JWT user impersonation but regular JWT authentication (“as the add-on user”), which means that these calls fall into the “app” call type instead.

There should be a way to indicate that the calls are made in response to user interaction in Jira, possibly using the Webhooks Trace Header or something equivalent.

1 Like

Hi Guys,

Thanks for the feedback. There’s some great points that I’d like to address, but I’ll need a few days to work through them. I’m on leave today so I’ll aim for a response mid next week.

Regards,
Dugald

3 Likes

Hey @dmorrow, would love an update on the points mentioned above. Did you have the time to work through them?

Also I’d like to mention that in its current incarnation, the Retry-After header is not very useful. For us, it has only returned values of 60 and 300 seconds for status 429 responses so far (different values for other statuses). In my experience with Jira, those requests would’ve succeeded with much less waiting time as well. It’s difficult for us to tell our users that they have to wait for 5 minutes if it’s not entirely necessary. It would be great if the header responses could be adjusted to more realistic values so we can actually use them and don’t have to do our own guesswork. Thanks!

2 Likes

I’ve gone through the comments and actioned them by creating issues. Here’s a summary of all the issues, including the few I raised previously:

  • ACJIRA-2555 & CONFCLOUD-71116: As an app developer, I need to know the rate limits that my app is subject to
  • ACJIRA-2554 & CONFCLOUD-71203: As an app developer, I need to be able to validate my app’s rate limit response handling
  • AC-2542 & AC-2543: Incorporate best practice rate limit response handling into Atlassian Connect Express and Spring Boot
  • ACJIRA-2356: Provide a means of attributing API calls to users in response to user initiated webhooks.
  • ACJIRA-2358: Add the ability to retrieve cost budgets

Regarding ACJIRA-2554, the comment from @lzachulski on April 8 indicates testing of Jira’s rate limiting may be possible, but I’m requesting the issue status be updated to confirm this.

Regarding AC-2542 and AC-2543, my guess is that the team owning these frameworks is unlikely to be able to prioritise this work. The projects are open source, so maybe the community could propose improvements? Note that there is still the risk of getting PR’s approved.

I have an internal page summarising these issues and have shared this with the Jira Cloud Ecosystem, Confluence Cloud Ecosystem and Ecosystem Platform teams.

Please vote and comment on the issues to help advocate for them.

Regards,
Dugald

5 Likes

Hi @dmorrow ,
[AC-2542] - Ecosystem Jira is not visible to non-Atlassians.
Thanks,
David

2 Likes

Thanks @david2 , I’ve fixed this now.

3 Likes

In addition to my previous response, there are a number of issues relating to inconsistent and invalid error codes returned when limits are reached under various scenarios:

Regards,
Dugald

8 Likes

@dmorrow apologies if this should be a separate topic but this felt like the most appropriate place to begin…

A few days ago we received 4x the number of webhooks per hour than usual, triggered, as far as we can tell, by a single tenant. We usually see peaks of 500,000 webhooks per hour from Atlassian across all tenants, but we received just under 2 million requests between 14:30 and 16:00 UTC of which 3/4 were from a single tenant.

Our services were not provisioned to handle that spike in load or autoscale and as a consequence fell over badly while we manually scaled them up and out. The outage triggered a notification from the Fortified Apps monitoring that Atlassian has in place because no longer successfully processed 99% of requests in a 15min period.

I’d like to work with Atlassian to publish some webhook rate limits that apps must be able to support, because currently there’s an implicit expectation for apps to be able to scale our services to process an infinite number of inbound webhook requests per second.

6 Likes

Hi @jbevan ,

Thanks for raising this. I agree we need to put a solution in place that will avoid this type of incident occurring in the future. I’ve created ACJIRA-2497 and will reach out to the Jira Ecosystem team to see if they can prioritise it.

Regards,
Dugald

2 Likes

Any traction on this internally? We can’t just return 429s when we get spikes in inbound webhooks because we’ll violate our Fortified Apps SLA if I understand things correctly…

Recent examples of individual tenants making waves in our infra:

5 Likes

@dmorrow @ibuchanan @Miro.Capka @nmansilla is anyone at Atlassian able to engage with us on this topic?

+1
David

Hi @jbevan and @david2 , thanks for raising this. I’ve drafted an internal proposal outlining a suggested way forward for the Cloud Fortified and Ecosystem PMs to review.

@dmorrow tagging you because you’ve been active on this post and Are there rate limits for JIRA Cloud APIs? - #32 by DannyGrenzowski and Rate limiting Response handling pseudo code - #4 by OndejMedek

Can you confirm that rate limiting logic needs to be implemented on a per-user basis, not a per-request basis? I came to that conclusion after some discussion with @gtaylor - if so, can the docs get updated to make that more clear?

I initially implemented adherence to the rate limiting headers on a per-request basis and saw the number of 429s/5xxs quadruple because I wasn’t keeping track of which users had been limited, and wasn’t delaying all subsequent requests for that user for the relevant time period.

@jbevan,

I’m not sure what you mean by “per request”; I guess per resource? In any case, no. In the rate limiting docs, the section on implementation overview explains the dimensions:

Call Type Cost Budget
User (Jira front end initiated action) User
JWT (app server to Jira API) App
JWT user impersonation App + user
Connect app iframe (AP.request) App + user
OAuth 2.0 (3LO) App + user
Forge asUser() App + user
Forge asApp() App
Anonymous Anonymous

For your context as a Connect app, I think you are implicitly already limiting “per app” so adding “per user” is correct.

2 Likes

I meant literally on a per-request basis we’d check the response code and headers and back off retrying that specific request (and only that request) if we recieved a 429 or 5xx using the values in the documented headers.

It really wasn’t clear to me that the cost budgets meant that we needed centralised knowledge of all requests being made from different microservices in our infrastructure in order to keep track of whether a “cost budget” was being rate limited.

We have at least 2 background processes doing work in Jira on behalf of users while we also have a service that serves up the user interface and makes requests to Jira as part of handling some user interface requests. I think I’m right in saying that now we need a centralised service to handle all the requests those 3 service make to Jira and keep a track of whether any given user is being rate limited and throttle/delay all subsequent request being made from any of the 3 services. Plus update our frontend codebase to handle rate limiting when we use AP.request… I’m not sure how we’ll centralise the knowledge that AP.request should pre-emptively delay requests because background service #2 has breached the limit, or vice versa but I think at least I understand what we should be doing.

Please could you get some clarification added to the documentation to really spell out what app vendors need to do in this area?

4 Likes

Hi @jbevan,

I believe you are right since Jira doesn’t know anything about the threads in an app’s backend. This means that apps with concurrent processing making API calls against the same cost buckets have a couple of choices when it comes to rate limit response handling;- (a) some sort of coordination between threads to share rate limit response data so that invocations respect rate limit responses received by other threads, or (b) more aggressive backoffs and retry handling. Option (b) would be simpler to implement, but may not be as performant and may need to be tuned from time to time whereas option (a) is harder to implement, especially considering it may need to deal with different cost buckets. As you point out, option (a) would also require distributing rate limit response info between your app’s front end and backend if your app makes impersonated API calls from the app’s backend, however, I assume app + user cost budget rate limiting is relatively rare in comparison to rate limiting against the app cost budget.

Regards,
Dugald

1 Like

@dmorrow @ibuchanan are either of you able to get the docs update with this information?

1 Like

@jbevan , thanks for the prompt. I’ll take on that action. :slight_smile:

A post was split to a new topic: How to implement rate-limiting for Confluence Cloud?