RFC-34: Forge remote - retrieving new access tokens at any time

AdamMoore · February 5, 2024, 10:42am

RFCs are a way for Atlassian to share what we’re working on with our valued developer community.

It’s a document for building shared understanding of a topic. It expresses a technical solution, but can also communicate how it should be built or even document standards. The most important aspect of an RFC is that a written specification facilitates feedback and drives consensus. It is not a tool for approving or committing to ideas, but more so a collaborative practice to shape an idea and to find serious flaws early.

Please respect our community guidelines: keep it welcoming and safe by commenting on the idea not the people (especially the author); keep it tidy by keeping on topic; empower the community by keeping comments constructive. Thanks!

Summary of Project

Forge remote is a set capabilities that make it easier to securely integrate remote back ends with your Forge apps. So far, we’ve made it possible to call Atlassian APIs from your remote server when:

a user interacts with the UI of your Forge app, or
a product or lifecycle event you’ve subscribed to is triggered, or
a scheduled trigger is invoked

This RFC goes into detail on Milestone 3 from RFC-8 and proposes a new mechanism for calling Atlassian APIs from remote back ends without any invocation coming from Forge.

Publish: 5 Feb 2024
Discuss: 23 Feb 2024
Resolve: 1 Mar 2024

Problem

There are various use cases that require communication with Atlassian APIs from a remote back end without relying on a user interaction, product event or schedule. For example, you might need to trigger actions based on webhooks from other systems or utilise your own external UI.

Typically this might be solved for by providing long lived credentials that could be used to generate access tokens as needed (client_secret, refresh_token etc) or something like a per-installation shared secret like we have in Connect.

The downside of long lived credentials is the risk that comes with managing them (retrieving them, storing them, rotating them etc). These credentials are extremely sensitive and require robust security controls to manage and protect them.

Proposed Solution

With Forge, our goal is to abstract away as much security risk as possible. We want to take care of that as part of the platform.

Our proposal is to introduce a new module called authTrigger that operates in a similar way to a web trigger. This module will offer a URL that can be invoked from your remote server, with the sole purpose of triggering the generation and delivery of a new Atlassian API token back to your remote server.

An example manifest would look like this:

modules:
  authtrigger:
    - key: auth-trigger
      endpoint: auth-trigger-endpoint
  endpoint:
    - key: auth-trigger-endpoint
      remote: my-remote-server
      route:
        path: /auth
      auth:
        appSystemToken:
          enabled: true

Each installation would have a unique auth trigger URL, which you can retrieve via GraphQL similarly to how getURL works for web triggers today. This request would be authenticated with a short-lived access token that you will receive through one of the existing Forge remote flows (e.g. an Install/Upgrade lifecycle event).

So the installation of an app might go something like:

Customer installs your app
The avi:forge:installed:app lifecycle event is delivered to your remote server with an appSystemToken.
You use the appSystemToken to call the auth trigger getURL GraphQL mutation and retrieve the auth trigger URL.
You persist the installationId and authTriggerURL in your remote data store to use when you need it.

Then later, when you need to make an offline API request

Make a request to the authTriggerURL for that installation (no auth required). A successful request will return a 200 response code and an empty response body.
A new appSystemToken is delivered to the remote endpoint defined in your manifest

Success! You now have the appSystemToken which you can use as the bearer token to make API requests for the next 55 minutes.

If things go out of sync

In cases where installations go out of sync (despite our retry logic for events etc.) your app will be able to recover automatically. The next time it receives a short lived access token for that installation (e.g. the next time as user interacts with the app or a product event you’ve subscribed to is triggered) you will be able to make a getURL() request to recover the auth trigger URL and restore the offline access.

You will also be able add an additional layer of redundancy with remote support for scheduled triggers. Potentially you could set up a schedule to ensure all your installations are synced daily.

This is an important improvement over Connect where it can be difficult to recover if the shared secret for an installation is lost because you need to wait for a new install or upgrade event.

What about asUser() tokens?

Forge doesn’t currently support asUser access for back-end processes and that will also be the case for the initial release of this functionality. It is something we’ll be exploring soon for Forge across the board, but we’ll save that for a future RFC .

Remember, you can make asUser() requests with the short lived tokens available when the user is invoking your UI modules.

Our thinking behind this design

We recognize that this differs from standard OAuth and is a trade-off against one of our other goals, which is to adhere to standards as much as possible. However, we believe that this strikes the best balance between security and ease of adoption.

We considered options such as providing a client_id and client_secret in the developer console, or including refresh_tokens with installation events. However, these options don’t meet our goals around trust and would also require some level of customization to work in the Forge/Atlassian context anyway.

While this proposal may require more effort to implement initially, it will ultimately reduce ongoing security risks for both you and our shared customers.

Action

We are still early in the design/proof-of-concept phase and keen for your feedback.

We’re seeking input on:

Whether this would work for your use case.
Whether you see any major flaws in the proposal.
Any specific implementation details you might like to see included in the final design.
Any major difficulties you might see if you’re migrating from Connect and shared secrets.
What you’d like to see from a developer tooling perspective.

ernst.stefan · February 5, 2024, 11:21am

I don’t see any way in which this is more secure or more trustworthy than standard OAuth. It seems to simply rely on the principle “security through obscurity” in which a potential attacker would have to first understand how this solution works. In my opinion this is just a straight re-implementation of the refresh token mechanism.

remie · February 5, 2024, 11:33am

Hi @AdamMoore,

Thank you for this RFC, and thanks to the Forge team for listening to our concerns with regard to enabling offline processing.

Having said that, I second the concern raised by @ernst.stefan that this seems to be a rather convoluted way to solve the stated problem.

The result of this workflow will just be that vendors will create scheduled tasks that will run every 50 minutes to generate a new token and store that in a database, basically recreating the Connect shared secret at the added cost for both vendor and Atlassian of an hourly synchronisation process.

Looking at all these convoluted solutions to the challenges face by Atlassian Marketplace Partners to create complex apps on Forge, it almost seems like it would have been better to make some slight changes to Connect instead… as Forge is becoming more and more Connect-like and is losing any benefit it once had (and I would argue that is actually becoming technically more complex).

danielwester · February 5, 2024, 11:51am

Hi Adam,
Thank you for the very detailed RFC.
I will have a hard time convincing my team to switch to something to this because of the cost/risk with anything not industry standard. It is a higher cost for us to have to write our own implementation for this and then having to look for our security holes.

I’m also missing the “this is why you use this instead of connect” argument. What’s different in this from atlassian bumping up the rotation of the connect shared secret to be once an hour?

remie · February 5, 2024, 11:55am

I would absolutely welcome Atlassian to do this BTW. Please do this. Please call my enabled lifecycle event every hour for every installed instance with a new shared secret. This would instantly make Connect more secure at little to no cost for vendors.

remie · February 5, 2024, 12:02pm

Please note that I’m being serious about this, and that this is a scenario that this RFC should take into consideration that there are Forge apps with >1000 installs. If they implement this for each install, that will be >28800 requests to the auth tigger URL, resulting in >28800 calls to the auth-trigger-endpoint. And that is for one single Forge app.

You are potentially looking at millions of requests. If they are put in a single queue with a finite number of message workers, this whole thing may explode.

lkimmel · February 5, 2024, 12:16pm

I share all points from the other commentators.

On top I do not really understand why using GraphQL? At the moment I cannot really use the GraphQL API: it is highly experimental, the endpoints I would like to use do not work yet, it is not versioned and there isn’t even a place to get support for it yet(?). Where do I report errors when the GraphQL API isn’t working for my app cases? I assume I can use that token for REST too, that’s not really my point here. Please make the GraphQL API only a standard, when it really works. I see no point in mixing GraphQL and REST APIs.

tobias.viehweger · February 5, 2024, 12:17pm

Hi Adam,

re-iterating @remie - thanks for tackling the offline access problem, since we’ll potentially also make a lot of use of this in our app down the road (Microsoft webhooks → Atlassian update).
That said, I see a couple of issues with this approach, with an unclear benefit.

Security problems

So coming from a security/threat mgmt. perspective I have a bit of an issue with this solution, because as @remie mentions, this is not just security through obscurity, but introduces additional security problems.
We have been thinking about security a lot in the last few years, because we also store access tokens to Microsoft services, which some would consider even more impactful than Jira access tokens.

I see the following problem with sidestepping a common pattern like “client_id / client_secret”: You are assuming some kind of breach, where the client_id and client_secret is in the hands of the attacker and can be used to access Jira data. This is actually the least common thing that would happen, because these two infos can easily be stored in an encrypted & tightly access controlled storage like AWS Secret Manager. A breach that is way more easily to assume is either “breach of database” (should be uncommon enough if you properly put them in private subnets), or “breach of application server” (which is probably the most likely, due to being more exposed to customer input / dependencies like npm packages and the like).

Your solution unfortunately does not guard against the breach of the app servers, as the secret is delivered to these servers on a golden (callback) platter
I’d argue that you are introducing even more security issues, as the secret cannot be contained in the inner (private) subnet of the application, but there needs to be a way through all involved infrastructure (e.g. CloudFront, reverse proxies, load balancers), where any of these components could be breached to create a secret leak. One misconfigured access log would spill those secrets into Cloudwatch logs or the like, the exact thing you are trying to avoid. By sidestepping this common id/secret pattern that can be “easily” secured, you are opening this up to a whole lot of new issues in complexity. I’d strongely suggest reconsidering this approach.

Operation complexity & single point of failure

Using the proposed solution results in a lot of operational complexity, for both Atlassian and the vendor. Instead of having a simple mechanism to get a short lived access token (e.g. in exchange for client secret or certificate), there is a lot of queuing happening in between to make this proper async. E.g. an example we have

Chat message in MS Teams comes in via webhook
Detects: Change on Jira side is needed (e.g. add a comment)
Call Atlassian authUrl to request token
Queue in change on our side, to run after (when exactly?) token is provided
authUrl call is queued in on Atlassian side (single point of failure right here)
Atlassian does callback to our backend is happening
We can process the task after we got a token

This is a lot of complexity for no security benefit (see above). You are introducing a queue on your side, which if stuck, would break every background processing on our side.
Also, we now need to think about how to queue this change to run after the token refresh, which is also not a simple thing to do, and now every single vendor who needs this needs to implement themselves.
I agree with the approach by @remie - we would also probably just implement a scheduled job on our side to refresh the token every 50 min for every customer instance, to avoid this scheduling complexity, negating any security benefit you potentially see here.

Better solutions

I understand you want to be a front-runner in security with Forge, and this is laudable, but there are industry standard solutions that are more helpful in solving this. Instead or in addition to using a client_secret, you could consider the following options, in case you are interested in security while accepting a bigger complexity for vendors:

Use a client_certificate instead of a client_secret (see Microsoft OAuth implementation)
Require vendors to supply a mTLS client certificate when calling a token endpoint
Use IP whitelisting to call the token endpoint
Consider a system where you’d have to sign a call to the token endpoint with a certificate with strong security attached (stored in a HSM, signing happens in HSM).
Consider only allowing “offline” access for cloud fortified vendors or vendors that have a proven track record in storing/maintaining secrets
Use a short lived sharedSecret that is refreshed periodically, but delivered in an encrypted way to the app servers

I would not really recommend most of these, but they are at least somewhat heard of, and do not expose secrets across the entire infrastructure. There is no simple technical solution to this problem, I’m afraid. I hope you reconsider this implementation, as this potentially will lead to more issues & downtime, with no security benefit.

Thanks
Tobi

jbevan · February 5, 2024, 2:36pm

Hi Adam,

I’m coming from the context of migrating an existing Connect app onto Forge and trying to keep a lot of our existing backend infrastructure (for now) using Forge Remote.

Can you explain a bit more about why the proposal suggests this implementation? It feels like it would add significant complexity when implementing this integration mechanism because we need to make all requests to Jira/Confluence even more async than they currently are.

currently we have the Connect credentials which we use to generate a new auth token for the app user or we make a request to https://oauth-2-authorization-server.services.atlassian.com/oauth2/token in order to get the necessary authentication token. (I know that this RFC is not focused on user impersonation, but the flow outlined below will still apply)

We’d need to replace the:

synchronous generation of an app user auth token / single oauth 2 auth server HTTP request

with a new mechanism that:

makes the authTriggerURL HTTP request
stores a record somewhere that we’re waiting for a new request to be received by our new auth endpoint
handles the inbound auth endpoint request and matches it to the record we just stored and stores the new appSystemToken
polls our storage system for the record to now indicate that we got the appSystemToken so we can continue the ingress into Jira/Confluence

and wraps that all up nicely in something that looks like a single operation, otherwise we need to fundamentally rebuild a lot of our existing logic to take into consideration that we have to now wait for Forge to send us a request before we can make requests back to Atlassian.
I just tried to draw out an event-based architecture for how we’d need to build this and I can kind of see how it would work, assuming that no metadata is associated with any given auth token except for the accountId and the siteId/cloudId, but its quite a departure from what we have now…

Why is the authTriggerURL not rotated?

Without an actual proposal to review, I think I quite like @remie’s suggestion of Atlassian regularly rotating/sending us new secrets that we can use to ask for ingress tokens as an alternative to this. Everything is short lived then and frequently rotated.

Thanks,
Jon

AdamMoore · February 6, 2024, 8:02am

Hey Ernst, security by obscurity is certainly not our aim and if we move forward with this approach we will be doing our best to document & provide reference implementations SDKs/Frameworks etc to negate this.

A key difference from a standard refresh token based approach is that this design relies on the ability to request access tokens be delivered to a pre-registered URL for your Auth server. So an attacker would also need to have control of your server (or at least your domain) before they can get a token they can use to access customer data.

AdamMoore · February 6, 2024, 8:13am

Thanks @remie and @danielwester

Our thinking is that potentially developers could choose whichever method suited them better. If they have integrations that are very active (in an offline sense) then it would be worth handling the volume of requests to receive a new access token for each installation every hour. If you have a hourly or more regular sync (for example) this might make sense.

But, if your integration is typically not that active in an offline sense then it might make sense to use auth triggers which provide more ad-hoc access. It may not make sense to be receiving all these access tokens if they’re not being used 90% of the time.

@remie you mentioned you’re keen for the 60min Connect secret idea but wouldn’t that create the same problems around volume of hooks you need to handle?

remie · February 6, 2024, 8:15am

Can you please tell me how this differs from the OAuth best practice of having to provide a list of callback URLs?

I understand that this is not specifically related to the refresh token, because for offline access there is no user authentication flow, thus no callback.

What I meant is: you can actually use the same concept as a list of allowed origins from which a refresh token can be used to generate a new access token. This would limit the attack vector to the machine, or at least the domain, from which the attacker would be able to request the token and it does not need to involve an additional asynchronous delivery method.

AdamMoore · February 6, 2024, 8:16am

Just in case this wasn’t clear the GraphQL API is just to retrieve the auth trigger url. The eventual token you receive can be used to access REST or GraphQL APIs.

But to respond to your main point, we are also looking at providing REST alternatives for all the Forge GraphQL APIs as well. It’s pretty clear that most people prefer REST.

remie · February 6, 2024, 8:27am

This is true, and I would love for ad-hoc access. But the problem with the current implementation is that it creates an asynchronous flow in which my offline process will need to wait (indefinitely?) for a queue event to be processed by Atlassian and a new token to be returned, which will be delivered to a different endpoint / process. This adds engineering complexity because I will need to stop processing and wait for the token to be refreshed, polling the database.

In addition, remote resources might also have processing timeouts, meaning that I need to add retry mechanisms if the asynchronous access token delivery is not fast enough.

Ad-hoc token access is great, but not if it comes at the cost of added complexity. In that case, I will go for the dumb solution of just continuous refresh, even if that comes at the cost of needlessly processing requests & doing DB writes.

AdamMoore · February 6, 2024, 8:27am

Cheers @tobias.viehweger, lots of good points here and many options we’ve been discussing/debating internally as well. Some more points on our thinking:

Security problems

Yes, it’s definitely easy these days to securely stored a client secret in something like AWS secret manager. I have total confidence that you and everyone responding to this thread would manage that no problem.
The problem is that it’s also easy not to do things correctly, especially for less sophisticated developers. Secrets make their way into public Github repos etc. all the time (for example). Attackers don’t necessarily need to access a developer’s database or app server to get hold of them, they just need developers to make a silly mistake. We have a long history of bug bounties, security scans etc. which show this to be more common than you’d hope.
Your point on the callback platter is well taken but I disagree with the characterisation that this solution is creating more risk by spreading secrets around your entire infrastructure. I assume you’re talking about the authTriggerURLs (which should definitely be kept secret) but are of no use to an attacker unless they also control your auth server to receive the access token. It’s very different to a client_secret.
If we implement a refresh token mechanism as others in this thread have suggested then that would presumably also have to travel around your infrastructure in the same way (because we would need to distribute them by some kind of hook).
Perhaps you’re suggesting that there is no per-installation refresh token and the client_id and client_secret would provide access to all installations?

Operating complexity

Yes this is more complexity, but I’d argue with security benefit
Even if vendors do go for a scheduled trigger implementation there is still security benefit in that there aren’t long lived credentials that can be leaked. An attacker needs to take over your server to get access to tokens.

Other solutions

We’ve have discussed options around certificates etc. but I think it would bring its own world of headaches both for Atlassian and for vendors. Something we’re trying to void. It might work well for established vendors who are used to working in enterprise environments but would be pain at scale.
Having a two-tiered solution for Cloud Fortified vendors vs the rest is probably something we’d also like to avoid.

remie · February 6, 2024, 8:36am

I guess this is the gist of the problem. You are trying to solve something that you just cannot solve. The shared responsibility model exists for a reason. You cannot prevent us from doing stupid things.

Overengineering a security solution is also a well established vulnerability. The more hoops you make us jump through, the more shortcuts we’re going to make and the less secure it will become.

remie · February 6, 2024, 8:45am

Please do allow me to argue here the following alternative solution:

An OAuth based solution in which apps will get a client_id and client_secret per app (not per-instance). This is a well established pattern that developers understand. They know that they need to keep these a secret and can put them in environment variables.

For added security, you can force us to rotate client_secret every X period.
Apps can retrieve ad-hoc short-lived access tokens using a longer-lived refresh token.

For added security, requests should come from an approved domain/IP list and the response payload can be encrypted using the client_secret
Require access tokens / refresh tokens to be signed/encrypted using the client_secret when making request, which would make the access tokens useless without also having access to the client_secret (and tell people not to store client_secret together with refresh token).
The refresh token itself could also be short-lived (24h for instance). Apps will do a single call once a day to get a new refresh token. If they fail to do that, you can have the whole async token flow described here to a known endpoint to get a new refresh token.

For added security, you can also deliver a new refresh token with each request for a new access token
Resume work on the granular scopes in order to limit access of the access token (app developers also want this, we do not need full access!)

The clear benefit of using this flow is that it is well known and understood by developers, limiting the possibility of people making strange choices on how to deal with the tokens. They understand OAuth, and if they don’t there is a plethora of online resources to help them.

There are also existing libraries that can help implementing the OAuth flow and dealing with refresh tokens, limiting the amount of boilerplate code written by developers to deal with this custom flow.

Please stay with industry best practices. There is a reason they exists. The data Atlassian stores is not more/less important than what other companies store and they also rely on access tokens / refresh tokens. Providing 3rd party access is inherently insecure, there will always be a level of trust involved (and lawyers creating unreadable agreements that limit liability).

Pat · February 7, 2024, 5:03am

Want to start by saying thanks for taking the time to provide all the great feedback and acknowledge the importance of getting the design right for Forge Remote offline access whilst balancing a number of complex competing concerns.

Hopefully, I can provide some more detail on some of the thinking behind this RFC and address some of the concerns raised so far.

Complexity

It has been raised a few times that the authTrigger URL invocations would be made async and therefore add complexity to the remote server implementation.

This was not the intention. Calls to the remote endpoint URL associated to the authTrigger would be made synchronously from the Forge Platform. The benefit of this is that you should be able to develop your remote auth server in a way that appears like a standard token fetch, only invoking the authTriggerUrl refresh flow in the case where you no longer have a valid token or it has since expired.

Example sequence diagram showing how this could look

Example Java Code showing how this might be implemented at a high level:

    public String getToken(final String installationId) {
        final Optional<Token> storedToken = tokenStore.findById(installationId);
        if (storedToken.isPresent()) {
            log.info("Cache Hit: Found token in store for installationId={}", installationId);
            return decryptToken(storedToken.get().getEncryptedToken());
        }
        log.info("Cache Miss: Looking up authTrigger url for installationId={}", installationId);
        final Optional<Installation> installationStoreById = installationStore.findById(installationId);
        if (installationStoreById.isPresent()) {
            final String authTriggerUrl = installationStoreById.get().getAuthTriggerUrl();
            final String appId = installationStoreById.get().getAppId();
            final ResponseEntity<Void> responseEntity = webClient.get()
                    .uri(authTriggerUrl)
                    .retrieve()
                    .toEntity(Void.class)
                    .block();
            if (requireNonNull(responseEntity).getStatusCode().is2xxSuccessful()) {
                return getTokenFromStore(installationId).orElseThrow();
            } else {
                throw new RuntimeException("Unable to fetch token authTriggerUrl returned status code " + responseEntity.getStatusCode());
            }
        }
        log.error("No authTrigger url stored for installationId={}", installationId);
        throw new RuntimeException("Unable to fetch token no authTriggerUrl stored for installationId " + installationId);
    }

Standards Compliance

As mentioned a key goal for Forge Remote is to adhere to existing standards as much as possible.

That being said we also need to achieve the following Security Controls:

Time limited access credentials
Tenant Isolated access credentials
Policy enforcement checks at point of credential refresh

In order to achieve tenant isolation and allow Remote Access to hosted storage for example we need to ensure that any access tokens provided are isolated to a given app installation. This negates the ability to use a standard OAuth client credentials grant flow with credentials per app.

marc · February 7, 2024, 9:54am

Hi @Pat ,
You wrote:

Why not require separate OAuth client credentials for each combination of app and clientKey? This would allow the usage standard OAuth.

remie · February 7, 2024, 8:13pm

@Pat ,

Can you please confirm that this is a change of behaviour compared to the OP of this RFC?

If I look at the initial flow shared by @AdamMoore, it seems to indicate that the auth-trigger-endpoint, which is to receive the new token, is actually located in the Remote Compute Server, and not on the forge platform.

This is also further emphasised by the manifest example:

AdamMoore:

An example manifest would look like this:

modules:
  authtrigger:
    - key: auth-trigger
      endpoint: auth-trigger-endpoint
  endpoint:
    - key: auth-trigger-endpoint
      remote: my-remote-server
      route:
        path: /auth
      auth:
        appSystemToken:
          enabled: true

In addition, from your diagram it looks like the Atlassian Auth endpoint has a direct connection to the token cache, which I assume is actually located on the Remote Compute Server. How does that work?

To be honest, I’m actually more confused now