Rate limiting guide for Jira and Confluence

dmorrow · November 27, 2020, 2:38am

Recently we added a guide explaining how rate limiting works in Jira and Confluence. The same guide applies to both Jira and Confluence, but is available in both documentation sets:

Jira: https://developer.atlassian.com/cloud/jira/platform/rate-limiting/
Confluence: Rate limiting

We realise the current rate limiting solution has pitfalls. There is a section at the end of the guide identifying the major known deficiencies.

david2 · November 27, 2020, 3:15am

Hi @dmorrow,

thanks for sharing this documentation, which is very useful.

I have a question on the Rate Limiting documentation for Jira.

The documentation states the following:

You can retry a failed request if:

It is safe to do so from the perspective of the API (e.g. the API being called is idempotent).

A small retry threshold has not been reached.

The response indicates a retry is appropriate ( Retry-After header and/or 429 status).

I assume that means if all conditions are met.

Does it mean that an app should retry any REST call if and only if it receives either a 429 error or a Retry-After header? And thus that any other error (403, 500, 503, 509, which are also returned by certain REST endpoints in overloading situations) should not be retried unless they are accompanied by a Retry-After header? A common error code when Jira is overloaded is a 403 error with no error message in the body (in fact, an HTML response body instead of a JSON response body), usually indicating a DB connection pool error. Are these errors marked with a Retry-After header?

In general, it would be nice to include in the documentation a full, validated pseudo-code algorithm showing how exactly apps should handle errors related to rate limits, including how to identify them (which is not shown in the current algorithm). For that, you’d need to test all REST endpoints one by one by throwing a million requests at them until they fail, and check the error code they return then and add that case to the algorithm.

Thanks,
David

dmorrow · November 27, 2020, 4:03am

Hi @david2,

Thanks for your comments and keeping me on my toes.

Firstly, I should make it clear that the guide provides recommendations rather than rules. I wrote the guide from a somewhat theoretical standpoint because I don’t have a real world app stressing APIs and consequently don’t have first hand experience in dealing with rate limit responses from these APIs. If the guide doesn’t cater to specific use cases, then I’m happy to factor in improvements.

I intended an AND relationship between the bullet list items. I’ll fix the guide to make this clear.

With regard to the various error responses you referred to (403, 500, etc), there may be circumstances where a retry makes sense, but it’s somewhat out of scope of the rate limiting guide. If there are specific use cases and error responses that should be added to the guide, then maybe we can discuss them in more detail. For instance, I wasn’t aware of the 509 error code.

Regards,
Dugald

david2 · November 27, 2020, 5:19am

Hi @dmorrow,
I didn’t realize you were the author of that documentation. I thought it came directly from the Jira dev team. My comments must have felt personal and I apologize for that.

And that indeed explains why you can be neither prescriptive nor exhaustive. It would require an in-depth knowledge of how exactly each REST API endpoint is coded and implements rate-limiting (or fails to do so and thus reacts to errors when overloaded).

And of course that’s probably the root of the problem - the various ticket on EAN about implementing a real rate-limiting solution are still pending.

Anyway, based on your knowledge of the API, would you recommend trusting only the 429 error code and the Retry-After header, and treat all other errors as permanent failures?

Also, is there any way to know which REST endpoints can return a 429 error or a Retry-After header? Is this something that could be added to the official REST API documentation, alongside the other errors that are already documented?

Thanks again,
David

dmorrow · November 27, 2020, 6:03am

Hi @david2,

No problems, I didn’t take offence. I think there are responses other than 429 which may be transient, but there may not be a means of determining if or when a re-try may be successful. I can how this leads to your suggestion of treating them as a permanent error, at least as far as retries are concerned. Having said that, I can imagine some use cases where we might be able to provide more specific guidance regarding retries.

In terms of documenting which APIs can return 429 errors, my understanding of the implementation is that any API can return a rate limit response, however, some may be more likely than others since it depends on the resources consumed by their implementation.

Regards,
Dugald

dmorrow · March 25, 2021, 1:53am

Update as of March 24, 2021

The Jira rate limiting implementation has been updated to introduce cost budgets. This provides a level of isolation between apps and user interactions and will consequently result in more consistent behaviour. The Jira rate limiting guide has been updated to reflect the new implementation. The new implementation is semantically compatible with the old implementation and will be phased in by tuning configuration thresholds in both implementations such that the new implementation becomes active and the old implementation becomes inactive.

david2 · March 25, 2021, 2:07am

Thanks @dmorrow for sharing.
How can we provide feedback on this new approach?

dmorrow · March 25, 2021, 3:18am

Hi @david2 ,

That’s a good question. This change in Jira means that Jira and Confluence no longer share the same implementation so I think it would be best to create topics in the Jira Cloud and Confluence Cloud categories as you would for any other topic you’d like to discuss.

Regards,
Dugald

marc · March 25, 2021, 10:50am

As all developers have to take care of rate limiting, I think it makes sense to implement this in e.g. ACE in the httpClient function as an option when instantiating the httpClient.

jbevan · March 25, 2021, 11:17am

So… what changes should we expect to see here?

If Atlassian built Connect Apps they’d experience the pain of having absolutely no idea how rate limiting actually works in practice, despite your helpful write up.

While I understand there are 4 cost budgets now, I have no idea how those cost budgets compare to each other, or to REST API behaviour from 6 or 12 months ago.

Additionally I’m not allow to run performance tests to figure any of this out according to the docs you link to, and anecdotally I know that rate limiting behaves differently on different Jira instances based off experience back before vendors were told not to run performance tests. Unless that has changed? I can’t run performance tests to find out…

While I really appreciated that you’re trying to provide useful information for vendors, its really hard to build a performant service that does not violate rate limits based on the information that has been provided.

If customers complain about performance being slow we have no idea whether we can increase concurrency safely. If customers complain about feature failures we have no way to establish how much we need to reduce the concurrency or frequency of requests, unless we “do it live” with production traffic.

Apologies for the rant. I appreciate this is a non-trivial problem to solve and that Atlassian have other priorities, I’d just really like to hear that it is a problem that is actually being worked on because its really painful for vendors.

jbevan · March 25, 2021, 11:23am

Anecdotally we’re now seeing this error message in our automated browser-based tests when trying to view workflow postfunctions:

BobBergman · March 25, 2021, 2:46pm

A reference implementation in each of the Atlassian maintained Connect frameworks would be priceless, if not at least a gist with real working code somewhere.

RyanRules · March 25, 2021, 2:58pm

Hi @dmorrow , thank you for the update. On the bottom of that page list known feature requests / Jira issues on Atlassian’s backlog relating to rate limiting, thresholds etc. I might recommend one more also which is [ACJIRA-2337] - Ecosystem Jira concerning how Atlassian send webhook events based on bulk actions from users, which can in some cases cause the opposite effect of sending a large number of requests to a vendors infrastructure.

david2 · March 25, 2021, 5:33pm

@dmorrow Another problem with this new approach is in the way Jira identifies the “App + user” call types. In the case of workflow post-functions (or event webhooks for that matter), the action is clearly initiated by the user, completely outside of app control. Therefore, any REST call the app makes as part of the post-function implementation should be considered as “App + user”. However, because of multiple technical reasons, most REST calls the app needs to make are not done using JWT user impersonation but regular JWT authentication (“as the add-on user”), which means that these calls fall into the “app” call type instead.

There should be a way to indicate that the calls are made in response to user interaction in Jira, possibly using the Webhooks Trace Header or something equivalent.

dmorrow · March 25, 2021, 9:27pm

Hi Guys,

Thanks for the feedback. There’s some great points that I’d like to address, but I’ll need a few days to work through them. I’m on leave today so I’ll aim for a response mid next week.

Regards,
Dugald

BenRomberg · April 7, 2021, 10:31am

Hey @dmorrow, would love an update on the points mentioned above. Did you have the time to work through them?

Also I’d like to mention that in its current incarnation, the Retry-After header is not very useful. For us, it has only returned values of 60 and 300 seconds for status 429 responses so far (different values for other statuses). In my experience with Jira, those requests would’ve succeeded with much less waiting time as well. It’s difficult for us to tell our users that they have to wait for 5 minutes if it’s not entirely necessary. It would be great if the header responses could be adjusted to more realistic values so we can actually use them and don’t have to do our own guesswork. Thanks!

dmorrow · April 8, 2021, 2:07am

I’ve gone through the comments and actioned them by creating issues. Here’s a summary of all the issues, including the few I raised previously:

ACJIRA-2555 & CONFCLOUD-71116: As an app developer, I need to know the rate limits that my app is subject to
ACJIRA-2554 & CONFCLOUD-71203: As an app developer, I need to be able to validate my app’s rate limit response handling
AC-2542 & AC-2543: Incorporate best practice rate limit response handling into Atlassian Connect Express and Spring Boot
ACJIRA-2356: Provide a means of attributing API calls to users in response to user initiated webhooks.
ACJIRA-2358: Add the ability to retrieve cost budgets

Regarding ACJIRA-2554, the comment from @lzachulski on April 8 indicates testing of Jira’s rate limiting may be possible, but I’m requesting the issue status be updated to confirm this.

Regarding AC-2542 and AC-2543, my guess is that the team owning these frameworks is unlikely to be able to prioritise this work. The projects are open source, so maybe the community could propose improvements? Note that there is still the risk of getting PR’s approved.

I have an internal page summarising these issues and have shared this with the Jira Cloud Ecosystem, Confluence Cloud Ecosystem and Ecosystem Platform teams.

Please vote and comment on the issues to help advocate for them.

Regards,
Dugald

david2 · April 8, 2021, 2:24am

Hi @dmorrow ,
[AC-2542] - Ecosystem Jira is not visible to non-Atlassians.
Thanks,
David

dmorrow · April 8, 2021, 2:37am

Thanks @david2 , I’ve fixed this now.

dmorrow · April 9, 2021, 9:36am

In addition to my previous response, there are a number of issues relating to inconsistent and invalid error codes returned when limits are reached under various scenarios:

ACJIRA-1892: Bulk create issue results to status 503 for some instances
ACJIRA-1913: Some instances receives response errors when calling rest api’s after successful app installation
ACJIRA-1929: Random error codes for “Too many connection problem”
ACJIRA-1868: Intermittent status 403 response for valid REST API calls
JRACLOUD-70909: HTTP Status 403 returned for DB connection error when calling REST APIs
JRACLOUD-71874: Report 503 instead of 500 in case of “FATAL: too many connections for role”

Regards,
Dugald