RFC-117: Forge LLMs

Project Summary

We propose a new API/SDK that lets Forge apps access Atlassian-hosted LLMs from Forge functions and containers without data egress.

Publish: 29 October 2025

Discuss: 12 November 2025

Resolve: 19 November 2025

Problem

Currently, the only way to integrate LLMs with Forge apps is by egressing data to remote servers or third party providers (e.g. OpenAI). Egressing data outside of Atlassian can be a blocker to enterprise adoption and cause the app to lose eligibility for Runs on Atlassian.

Proposed Solution

Provide access to Atlassian-hosted LLMs from the Forge functions (and containers in future) via an API. Apps using these LLMs would be eligible for Runs on Atlassian.

Developer experience

We will provide an easy-to-use SDK, aligned with industry standards and other Forge APIs. The proposed approach is a simple function call pattern, similar to the Claude SDK, for example:

const response = await chat({
    maxCompletionTokens: 50,
    model: 'claude-sonnet',
    temperature: 0.7,
    topP: 0,
    messages: [
      { role: 'user', content: 'Find any typo in the following text' },
      { role: 'user', content: 'Tis is an intereating artivle!' },
      { role: 'system', content: 'You are a text editing agent' }
    ]
});

If you require a structured response to use programatically (e.g. function calling), you will use “tools” (see Tools) and get a defined, json response back. It would look something like:

...
    "messages": [
      {
        "content": "Your task is to assist",
        "role": "system"
      },
      {
        "role": "user",
        "content": "What is the weather like in Boston today?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather in a given location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": [
                  "celsius",
                  "fahrenheit"
                ]
              }
            },
            "required": [
              "location"
            ]
          }
        }
      }
    ]

To get a response like:

"message": {
          "role": "assistant",
          "content": null,
          "tool_calls": [
            {
              "id": "call_bkhLcH2zVrkSgMGqE5NPAKX4",
              "type": "function",
              "function": {
                "name": "get_current_weather",
                "arguments": "{\"location\":\"Boston, MA\"}"
              }
            }
          ]
        }

Manifest

You will need to add a new module in the manifest to enable LLM access. This will be at the model family level, not a specific model/version (e.g. adding claude would give your app access to Opus, Sonnet and Haiku)

For example:

modules:
  llm:
    - key: my-ai-module
      model:
        - claude

Note - introducing an LLM will trigger a new major version.

Model selection

We’re planning to launch with support for the three Claude 4 models: Sonnet, Opus and Haiku. You will specify the model you want to use on each request, this will give you flexibility to choose the right model for each use case.

Opus

  • Most capable
  • Slowest (deep reasoning)
  • Highest cost

Sonnet

  • Balanced capability
  • Moderate speed and cost

Haiku

  • Fast & efficient
  • Lowest cost

Note: AI moves quickly so specific models and versions may change before launch.

Initially only text will be supported but multimodal options will be considered in future.

Admin Experience

We plan to be transparent about which apps have adopted Forge LLMs. Administrators will be informed via the Marketplace listing and at installation time if an app uses Forge LLMs.

Adding Forge LLMs (or a new model family) to your app will be a major version upgrade that requires admin approval.

Pricing

LLMs will become a paid feature of Forge as part of the pricing changes starting 1 Jan 2025. Usage will be visible in the developer console on the usage and costs page.

Usage will be measured by the volume of tokens sent and received by your app. Specific pricing/rates will be available before the capability goes to preview.

Responsible AI

Requests sent to LLMs on Forge will be subject to the same moderation checks as Atlassian’s first party AI/Rovo features. Messages that are flagged as high risk (according to Atlassian’s Acceptable Use Policy) will be blocked.

Asks

While we would appreciate any reactions you have to this RFC (even if it’s simply giving it a supportive “Agree, no serious flaws”), we’re especially interested in learning more about:

  • What are your thoughts on the proposed SDK interface? Are there patterns or features you would expect that are missing?

  • Are there use cases or requirements for LLMs in Forge apps that won’t be well supported in the proposed design?

  • How important is model choice, and what additional models or capabilities would you like to see prioritised?

  • Are there concerns about pricing, usage limits, or transparency that we should address?

12 Likes

Seems like something we have been waiting for, for some time. Very cool.

re: Pricing - We would have concerns about some customers causing us far more costs than revenue, we would probably require some or all of the following before really being comfortable to add something like this into our apps.

  • ability to charge customers based on model usage
  • ability to see at runtime (ie. in code, before model invoked) the model usage for the customer (so we can set limits, and perhaps higher limits in advanced edition etc)
  • ability to disable usage for a given site from the developer console
12 Likes

Anthropic is not listed at List of Data Subprocessors | Atlassian. Does that mean the models are being used via Bedrock on AWS?

Generally development of LLM based tools requires excellent observability and iteration using evals, metrics, and more. The Forge Developer Console seems to be going in this direction, but not at a pace that aligns with this RFC. Can you comment on how you see this working? Do you expect us to use OTEL with third party tools?

5 Likes

First of all, I really like this option that we can use other LLMs via Forge, this will make for some neat apps!

However, I agree with @richard.white about his concerns with pricing. As an app developer, we cannot really predict how much people will use the app, so charging them seat-based while we get charge used-based is making development of apps a really risky business for us.

It is also bad for the customers, as we would have to charge customers a higher price if some other customer has high usage. The experience with forge usage shows that some customers may be responsible for the most usage.

My concerns are the same as those to the proposed forge pricing (see the concerns I voiced here Updates to Forge Pricing: Effective January 2026 - #18 by tbinna, still waiting for a reply there ):

  1. If we have a few customers/licenses being the ones responsible for large usage, we as app-vendors do not have any means to influence that. Instead, all we can do is increase the pricing of our apps for the next month - so that ALL our customers will have to pay more. Is this an acceptable model for the customers, that they pay for the app-usage of other customers with higher consumption? Just to remember, one of Atlassian core values is: Don’t #@!% the customer.

  2. On the other hand, looking at potential abuse, there could be customers/licenses who want to financially harm us as app-vendors: They could just create a site with a free user and an automatic job with lots of usage. We as vendors are incapable of stopping them. This is a big financial threat which is implemented fairly easily.

I really like the solutions proposed by @richard.white : We need the ability to charge the customers based on usage (after a free tier maybe?) Additionally, we need at least the ability to disable usage for a given site from the developer console to prevent potential abuse.

10 Likes

More dead apps. Admins do not manually update apps. I can show you the data.

Atlassian continues to offer no solution to this death spiral.

5 Likes

Hi,

It’s a good solution.

BTW, may I know whether I can use customized Rovo agent directly in my Forge App? For example, I’ve developed a Foge App with UI Kit components and custom Rovo agent. I wanna use this rovo agent directly in this app to show some AI analysis in the UI.

Thanks,

YY1

3 Likes

So if I got that right, it’s clear that vendors will need to actively manage consumption to stay in control of margins.

From our perspective, this means implementing tenant-level usage tracking and aligning it with our app’s edition model (e.g., quotas per tier). That way, we can define what’s included (e.g., 10k tokens/month for Advanced) and throttle or upsell beyond that. This gives us pricing control, even though Forge bills us directly.

A few open questions:

  • Will Forge provide APIs or events to help track token usage per tenant?

  • Is there support planned for enforcing quotas or surfacing usage to tenants?

  • Maybe we can define in the manifest the quota per edition -to keep control? That would be cool

This is a powerful capability, but pricing transparency and quota enforcement tooling will be key to enabling sustainable adoption and to stay in budget.

Cheers

Oli

12 Likes

Dear @AdamMoore ,

very happy to hear that Atlassian addresses this LLM / ROA problem and happy to hear that Anthropic as a company is picked here over OpenAI. Here are my first initial comments:

What are your thoughts on the proposed SDK interface? Are there patterns or features you would expect that are missing?

→ I think if you follow Anthropic’s standard you are doing nothing wrong here. I think it won’t be the worst to run some internal POCs to check on performance.

→ Streaming support would be essential. Responses streamed via Server-Sent Events (SSE) instead of normal REST would significantly improve UX, especially when generating code, expressions, or longer content where users benefit from seeing real-time progress.

→ Prompt caching is critical for production use. Apps that repeatedly send similar system prompts or static context (like Jira expression syntax rules, schema information) would see massive token cost reductions.

Are there use cases or requirements for LLMs in Forge apps that won’t be well supported in the proposed design?

→ Efficient context injection is crucial. It would be valuable to have performant ways to inject instance-specific data (custom fields, issue schemas, workflows, project configurations) without consuming excessive tokens. Ideally, this could be passed as structured metadata rather than being part of the main prompt maybe?

→ Validation integration for generated code/expressions. SDK access to Jira’s expression validator or similar validation endpoints would enable feedback loops: generate → validate → refine. This is essential for building reliable AI features that generate domain-specific languages like JQL or Jira expressions. As you have this already build for the JQL part would be cool if you enable us to it in the same performant way or even better.

How important is model choice, and what additional models or capabilities would you like to see prioritised?

→ Pre-trained or fine-tuned models on Atlassian-specific knowledge would be game-changing. Models that understand JQL, Jira expressions, automation rule syntax, and Atlassian terminology (project = space, work item = issue, etc.) out of the box would dramatically improve accuracy and reduce prompt engineering effort.

→ Fine-tuning capabilities for app vendors would enable us to build highly specialized features. Even limited fine-tuning options (e.g., on specific syntax or app-specific knowledge) would open up advanced use cases.

Are there concerns about pricing, usage limits, or transparency that we should address?

→ Per-customer token limits would be essential for monetization. Based on cloud ID, app vendors should be able to set token quotas per instance to offer tiered pricing packages (e.g., “Basic: 500K tokens/month”, “Premium: 5M tokens/month”). This prevents cost overruns and enables predictable pricing for customers.

→ Pricing transparency: Could you provide concrete information on how Atlassian’s token pricing compares to Anthropic’s API pricing?

→ Usage monitoring: Detailed token usage analytics per customer instance in the developer console would be essential for cost control, optimization, and debugging unexpectedly high consumption.

These are my first initial toughts happy to see what Atlassian will come up with and all the best for implementing it :slight_smile:

Cheers!

5 Likes

This should be decoupled from versioning as proposed here. Otherwise, existing apps will have fragmented versions that we’ll have to support forever.

3 Likes

Thanks @AdamMoore, exciting to see this RFC given some of the work we’ve got in flight.

I’ll echo @BorisBerenberg’s comments about observability, prompt versioning, trace evaluation and add LLM-as-a-Judge. That is something I’ve heard nothing about from Atlassian thus far (for Rovo agents generally, not this RFC specifically of course) and it concerns me.

Happy to jump on a call and explain in more detail plus show how we approach this.

Thanks Adam,
Nick Muldoon, Easy Agile

2 Likes

Having native LLM support in Forge is definitely a step in the right direction and opens up exciting possibilities for smarter, more interactive apps within the Atlassian ecosystem.

That said, a few important details would be helpful to clarify:

  1. Streaming Support
    For a good UX, especially in chat-like interfaces, HTTP streaming is essential. If the LLM integration is still routed through Forge functions (which currently don’t support streaming), this could be a limiting factor. Are there any plans to support streaming responses in the future?

  2. SDK Capabilities
    Will the Forge SDK expose the full range of capabilities offered by the underlying model APIs? For example, file uploads are a crucial use case in many apps, especially those dealing with document summarization or code analysis. It would be great to know if such features will be supported out of the box.

  3. Cost Management & Usage Control
    LLM usage can quickly become expensive. Are there mechanisms planned to monitor usage per user or app, enforce quotas, or restrict access? Having fine-grained control over who can use the LLM and how much they can use it would be essential for managing costs effectively.

Looking forward to seeing how this evolves, thanks for pushing this forward!

3 Likes

Hi @AdamMoore ,

Thank you for this RFC, this is something we’ve been interested in for a long time.

Some notes from the top of my head:

  • As was already said, the Claude SDK seems a good choice.
  • A major update to add LLMs makes sense, but will make us and many others think twice about adding that feature. Doubly so, if changing the model will also trigger a major update.
  • On model choice: For some use-cases (classification, NER), it makes sense to use much smaller models, than even Haiku. Will it be possible to BYO models in the future?
  • Forge functions have runtime limits, it would be very convenient if we could give the LLM a maximum allowable response time.
  • Pricing is a concern, as we’d have to build secure per-customer usage tracking. Not sure yet how we’d do that. That concern is not unique to LLMs, but will likely be more important.

Related, but anecdotal: We’ve had a few calls lately where the specific requirement by the customer was “No AI”, with no room for discussion about egress or even what constitutes “AI”. This was much more important than trust signals like RoA or Cloud Fortified. If you want adoption of apps with Atlassian-hosted LLMs, we will need to overcome a general “No AI”-sentiment with some enterprise buyers.

All in all though: Sounds great, we’d love to see this become available.

2 Likes

JSON Schema enforcement would be important for us. We often need multiple fields as responses with distinct semantics.

Similarly usage tracking and per-customer limits would be extremely useful. Alternatively, Atlassian is well placed to bill customers directly for usage. Or give customers an opportunity to “buy credits” for further use beyond the limit. Or give them an opportunity to provide their own API keys.

2 Likes

This is really welcome.

For the SDK, we’d appreciate being able to use the emerging APIs rather than yet another AI SDK. Something like https://ai-sdk.dev/ would work well, and could be supported on top of the realtime API possibly? I didn’t see any streaming support in the example given, without streaming good user experiences will be very challenging.

On the cost, we’d like to be able to see context caching. For some of the use cases we have, this makes a big difference. Cost or usage tracking is also important, some kind of tracking (and potential limits) against accountIDs (tagging) would also be useful.

Have you considered the ability to pin a model to a specific version? We have seen versions of models behave differently after updates, which has required work to fix. Updates to models could break or change functionality in unexpected ways.

2 Likes

This might be a different topic, but I’d love to see an AI SDK based on Rovo-Agent abstraction level, rather than (or besides) a bare-bone LLM. I hope that with Forge apps calling Rovo agents, the cost of AI is rolled into the customer’s Rovo usage instead of being charged by the Forge apps. As a developer, I would benefit from developing upon and utilizing the fine-tuning and context-awareness of Rovo (with optional organizational knowledge) instead of starting from scratch with AI agents. (I do recognize that other developers might prefer a bare-bone LLM.)

3 Likes

+1 on @Shu’s suggestion:

the cost of AI is rolled into the customer’s Rovo usage instead of being charged by the Forge apps

LLM usage should not be billed towards the app.
Forge apps using LLMs should somehow proxy their LLM traffic/tokens through the customer that installed the app.
This approach would solve a ton of headaches and save vendors bending over backward controlling token consumption.

3 Likes

Hey everyone,

Thanks for all the great comments and responses. I’ll address some of the high level themes, hopefully it catches most:

Consumption and limits

Yes, early adopters of Forge LLMs will need to build user/tenant/edition based limits within their apps. Feature flags will be available by the time LLMs are in preview which could be useful for more dynamic controls, throttling etc.

With Forge pricing generally, you can expect to see a lot more alerts/controls/APIs etc. coming in 2026 that will make it easier to observe and manage use of all Forge paid capabilities. We’re getting lots of good feedback on this, including in the developer space session yesterday.

Expect to see more RFCs etc. on improving the tooling around paid Forge services.

Consumption based pricing on Marketplace
This is still on the Marketplace roadmap, although it has been delayed. I appreciate this will be limiting on the types of use cases you can build for in the short term, but in the medium/long there will be more flexibility around monetisation.

Observability, evals, testing etc.
Obviously we understand how important this is, and something we are doing a lot of internally with Rovo development etc. We won’t have any specific new features at launch but keen to work on better supporting this on Forge.

Some loosely held thoughts:

  • I imagine we would lean heavily into offline evaluations (using pre-collected, static datasets before production). So would recommend, if anything, folks start investing in their training data sets.
  • Considering that we’re just exposing the vanilla Claude models, for now you could use third party tools to run evals against those models outside of Forge. To test prompt tweaks, different model versions etc.
  • As for online evals, expect to have very limited (if any) access to raw prompts/responses in production. We hold ourselves to pretty high standards internally and you can expect that to be reflected in any tooling we build.
  • For online you could use things like feature flags (coming soon) and custom metrics to capture click-through rates, thumbs up/down, engagement etc in production. You could also use the RoA supported OTEL/analytics services.

Major versions

Yes @BenRomberg I should have pointed out that this will be a new permission which will be able to be decoupled from versioning as per RFC-106. In that case it wouldn’t be a major version change but the admin would need to accept the new permission before that installation could access the LLM.

SDK suggestions

Lots of good suggestions around things like streaming, prompt caching etc. keep them coming :slight_smile: We will have some constraints based on Atlassian’s internal AI gateway and the underlying host (AWS Bedrock in the case of these models) but we’ll look into theses.

1 Like

This is the same problem. 80-90% of admins will not take that action. Nothing to do with the permission scope, everything to do with inaction on manual updates or approvals.

Go and ask internally for this figure: what % of Forge app installs are running the latest version?

I’ll bet the global average is around 15-25%. Run it again for paid vs free apps. Also for under 10 user free tiers. Then workout the figures for 3/6/12 months out from latest major version release.

“Oh just show a message to the user to update the app”: sure, but the user isn’t the one that can accept those permission. They’ll need to contact their admin who won’t take the action.

“Oh we’ll eventually push notifications to admins”: ok, why isn’t that already happening, and realistically what do we expect that to change? An extra 10% of admins take action?

Compounding death spiral for the entire Marketplace. RFC-106 doesn’t solve this. Yes it will eventually result in all installs executing the latest code but only 15-25% of those installs (across the entire Marketplace) will be functional for end-users at runtime.

Why? Because the overwhelming majority of developers will check if ALL required permissions are enabled, and if not, replace the entire app render with “ask your admin to approve missing permissions”. But again, the end-user will get that message whilst only the admin can approve. Admins will not take this action. Dead app.

How does this relate to this RFC? If you have an existing app and add this LLM module or if you ever decide to switch from Claude to any other model or if Atlassian forces a migration or if the permission requirement or naming ever changes. Then expect that 15-25% of your installs (forever) will be functional for end-users at runtime. Apply this to every other RFC and every app: zombie marketplace.

I’ll keep complaining until someone at Atlassian groks the existential nature of this problem.

1 Like