OAuth rotating tokens: Unknown or invalid refresh token

szn · December 18, 2021, 4:37pm

Yet another post on that topic. But most of the existing posts describe early development phases. In my case, I successfully implemented both oAuth2 and newly introduce rotating refresh tokens. It was a painful process through misleading documentation. I documented the process in this post: Confluence Addon talking to Jira (cloud).

In most of cases, my solution works like expected.

Unfortunately time to time my error log is flooded by Unknown or invalid refresh token.

I know what the documentation says:

This error is returned for the following reasons:

The user’s Atlassian account password has been changed. […]

Your app is using rotating refresh tokens and the exchange of refresh token failed because:

Your refresh token has expired. […]

Your app is not replacing the previous refresh token with the new refresh token returned during access token request.

I checked that multiple times. Two days ago I had exact this situation on my account: frequently used, no password change, refresh token saved each time I receive it.

Two observations I have are:

I can see “waves” of this errors across my customers accounts. It looks a bit like a bi-weekly oauth service restart (or something) invalidates all the refresh tokens.
Usually, when I am doing my request to https://auth.atlassian.com/oauth/token, I receive an expected object with refresh_token, scope, token_type, expires_in, access_token. But sometimes this object is missing the key refresh_token property.

My algorithm is as follows:

Get session user info (including user’s accountId), then:
Check user access token by tokenValidDate > new Date(). If access token expired:
Refresh user access token by sending a POST request to https://auth.atlassian.com/oauth/token with { grant_type: 'refresh_token', client_id, client_secret, refresh_token }. If json response does not contain error_description do:
Update user’s access_token
Calculate tokenValidDate (new Date() + expires_in *0.9)
If response contains refresh_token, Update user’s refresh_token
Save user info

Any ideas what is wrong here?

Best,
https://dirtyagile.net/

szn · December 18, 2021, 4:52pm

Versions:

node@14.18.1
atlassian-connect-express@^7.4.8
atlassian-jwt@^2.0.2
express@^4.17.1
passport@^0.4.1
passport-oauth2@^1.6.1

tbinna · December 20, 2021, 3:04am

Hey @szn,

Here are a few thoughts/ideas on what could have gone wrong:

But sometimes this object is missing the key refresh_token property.

At this stage, the OAuth connection may be broken. Either because on the next token refresh, your old refresh token is not accepted anymore and you get the “Unknown or invalid refresh token” error, or because you overwrite the existing refresh token with the token response that is lacking the refresh token. In this case, on the next token refresh, you would not have a refresh token. I cannot comment on why you sometimes do not get a refresh token back, but maybe someone from Atlassian can help you out with this.

Another potential issue could be that you run your program in a clustered environment, in which case you would need to synchronize the token refresh between your running instances. Otherwise, you may get lost updates when multiple instances are trying to refresh the same access token at the same time. Unfortunately, this is a massive complexity introduced by the rotating refresh token approach and seemingly ignored by Atlassian thus far.

The general idea to mitigate issues with concurrent token refreshes in the Auth0 rotating refresh tokens implementation (on which Atlassian’s is based) is to have a reuse interval in which older (previously rotated) refresh tokens can still be used (currently configured in the Atlassian implementation to 10 mins). This method is a way to mitigate the problem of concurrent attempts to refresh a token, however, it does not protect you from lost updates in a clustered environment. If you are using Redis, then Redis dist locks may be able to help with building a cluster lock to synchronize a token refresh.

I hope this helps.

szn · December 20, 2021, 10:28am

Thank you @tbinna for your detailed answer.

My code handles the missing refresh_token scenario. When this happens I am not updating it (which would overwrite the token with null):

if (response.refresh_token) user.refreshToken = response.refresh_token

But it is tempting to dig in that area. I will place more logs around that.

I run a single node env, using Google Datastore as storage and memory for caching. I am also logging refresh token hashes to make sure I am actually sending the latest one (I am).

My users can place multiple macros on a single Confluence page. This obviously leads to the “reuse” problem you mentioned. But due to 10 minutes allowance, I don’t see any problems here.

Lastly, can you confirm, that your implementation is stable and you are not experiencing semi-random “Invalid refresh token”?

tbinna · December 21, 2021, 3:54am

My code handles the missing refresh_token scenario. When this happens I am not updating it (which would overwrite the token with null )

I think this is generally ok because there is not much you can do to fix this. If the Atlassian server is not returning a refresh token, the whole token family/chain is broken. I think you should try to reach out to Atlassian to try to figure out why you sometimes do not get a refresh token back. I would be interested in the result of that.

The only thing you could try is to check if response.refresh_token exists and is not an empty string. If it does, assign it, if it does not, do not overwrite the existing refresh token. Maybe this gives your app a chance to retry the token rotation with the old refresh token if it is within the allowed reuse interval (10 mins). I am not 100% that this works but at least you still have a refresh token to try. If you overwrite the existing refresh token with null it is clear that you will have to send the user back into the authorization flow.

If you run in a single node env you should not have issues with concurrent token refreshes and you would not need any cluster lock solution to synchronize token refreshes. However, noted that this also prevents you from scaling horizontally.

Regarding our own implementation, we have not migrated yet because the first attempt failed and we keep hearing of other vendors posting new issues (like yours where you sometimes do not get back a refresh token). Unfortunately, the impact of a broken connection is significant for our customers’ daily work so we will have to do a lot more testing upfront before attempting another migration.

szn · December 21, 2021, 1:57pm

Thanks again for your quick response.

I am actually doing:

if (response.refresh_token)
  user.refreshToken = response.refresh_token

This handles missing, null and empty refresh_token. But it is not helping.

Despite a single env. app I do have concurrent token refreshes. A user can place a number of my macros on a Confluence page. If she/he opens the page, I am receiving number of concurrent requests that I have to handle in parallel. But, as I believe(d), with the allowed reuse interval set to 10 minutes, and the if statement above this should not be a problem. All these concurrent requests are handled in less than 2 seconds.

What I can do is to save not only the newly received refresh_token but also the response timestamp. Then, I should only overwrite the refresh token if it was delivered later than the one I have in DB.

I was also trying to go back to the permanent refresh token in the OAuth app settings. It was successful in my dev app. Production app throws a very useful “Something went wrong” error message while trying to save this setting.

Anyhow, seems that the root cause is here:

And before they fix that we will continue to have problems at least once per month.

gabriel1 · December 30, 2021, 2:59pm

Hi there, found this thread after an incident last night where we saw every one of our customers who had Jira integration get this error (after working fine for 30 days after switching to rotating refresh tokens). Any idea why this could be?

szn · January 2, 2022, 5:51pm

Hi there @gabriel1,

As I believe you were hit by:

Theoretically, the current implementation sets unused-token invalidation to 90 days, not 30. But My users were experiencing more frequent problems anyhow.

In my case, the problem is more complex due to the nature of the plugin. My users are allowed to place as many macros on a single Confluence page as they wish. This creates a scenario in which a single Confluence page refresh can create 50 requests to my backend.

If a user has a valid access_token it is all fine. I can just use it. But access tokens are valid only for one hour. So it is normal that my backend receives 50 parallel requests with an expired access token. In this scenario, I am trying to refresh all the tokens in parallel. Due to the nature of iframe loads, this can create a race condition:

User open page with three macros after one hour of inactivity
Backend receives 3 parallel requests and attempts to use refresh token "A" to refresh both access token and refresh token. All refresh attempts are using the same refresh token ("A")
Request number 1 is the first and replaces refresh token "A" with "B"
Request number 2 is the second and replaces refresh token "A" with "C" (this is allowed within 10 minutes windows)
Request number 3 is delayed and attempts to use token "A" while token "C" was already issued
I receive my favorite: “Unknown or invalid refresh token”

I implemented an update to my token handling code that handles “Unknown or invalid refresh token”:

If my request to requestAccessToken returns invalid_grant I am pulling fresh user info (with both access and refresh tokens) from the database cache. It is possible that I already have a newer token ("C" from example above) and I should use the access token I obtained in a different thread and ignore this error