RFC-57 Long running Forge functions

AdamMoore · August 9, 2024, 12:41am

RFCs are a way for Atlassian to share what we’re working on with our valued developer community.

It’s a document for building shared understanding of a topic. It expresses a technical solution, but can also communicate how it should be built or even document standards. The most important aspect of an RFC is that a written specification facilitates feedback and drives consensus. It is not a tool for approving or committing to ideas, but more so a collaborative practice to shape an idea and to find serious flaws early.

Please respect our community guidelines: keep it welcoming and safe by commenting on the idea not the people (especially the author); keep it tidy by keeping on topic; empower the community by keeping comments constructive. Thanks!

Summary of Project:

This project aims to better support long-running compute processes on Forge by extending the timeouts on Forge functions on async event consumers.

Publish: 9 August 2024
Discuss: 23 August 2024
Resolve: 30 August 2024

Problem

Forge functions provide the hosted compute for the Forge apps. Currently, Forge functions have a timeout of 25 seconds for most modules and 55 seconds for async events. This limits the amount of processing that can be done in a single invocation, which adds complexity for apps with long-running processes.

These timeouts can be difficult to work with if your app is:

Processing large amounts of data (e.g. when setting up a new installation)
making a high volume of API calls (e.g. paginating through an endpoint), or
calling APIs that can take a long time to respond (e.g. LLMs)

While it is often possible to work around these limitations by batching up large processes with async events, breaking up big workloads into such small batches adds complexity to your integration and can create overhead that leads to other limitations.

This has meant that developers building apps with long-running processes have had to invest in complex workarounds or have decided not to run their compute natively on Forge.

We see this as an important problem to solve so that more developers can build their apps natively on Forge. This is increasingly important to our enterprise customers as they migrate to the cloud.

Proposed Solution

To solve this problem we intend to increase the timeout of Forge functions on async event consumers to 15 minutes, which is the maximum supported by AWS Lambda. To achieve this in a performant and scalable way, we will introduce the ability to run these functions asynchronously.

For example:

  consumer:
      - key: big-job-queue-consumer
        queue: big-job-queue
        resolver:
          function: processBigJobQueue
        method: submit-big-job-queue-listener
  function:
    - key: processBigJobQueue
      handler: index.processQueue
      async: true   #opt in to long-running functions
      timeout: 900   #specify the timeout up to a max 900 secs

Adopting async functions on async event consumers won’t cause a change in behaviour and the developer experience remains much the same (except for updating your manifest). Logs and metrics in the developer console will allow you to monitor invocation times, errors etc. in a similar way as you do currently.

If you need to start a long-running process from a different module (e.g., UI resolvers), you can use async events from the short-lived function.

Note: We acknowledge that the ability to send notifications from the backend to the front end is still a gap in Forge’s capabilities. For now, you will need to poll from your front end to detect when a long-running job is completed. Options for this will be discussed in a future RFC but feel free to share any ideas or concerns.

Limits for async functions

As part of the project we are assessing our limits around function invocations. In particular:

Invocation rate limit: 500 per 25-second sliding window
Log lines per invocation: 100
Log size per invocation: 200kb
Network requests 100

We would be really interested in your feedback on how these limits would affect your app once 15-minute functions are available. Which would be a blocker for your use case and what increase in limits might you need?

Asks

While we’re happy to get any feedback on this RFC, we’re particularly hoping to get insights on:

What use cases do you see this enabling?
Do you think this solution is sufficient to support your long-running tasks?
If you’d prefer to have long-running async functions on other modules - what would they be?
Would you like to see any changes to the existing invocation limits to ensure this will work for your use case?

IzikAJ · August 9, 2024, 6:31am

It’s great to have the option to increase that limit.
Additionally, it would be helpful to raise the limit on network requests as well. In my case, I need to call multiple APIs—some are slow, while others are fast but return limited data, requiring additional requests. Increasing these limits would be a significant improvement.

JulianWolf · August 9, 2024, 8:02am

Thanks, @AdamMoore, for this announcement. I’m convinced that extending the invocation timeout will increase the number of use cases that can be achieved on Forge natively. I can’t wait to get rid of some of the hacks I’ve built over the last few years to bypass the current limitations.

I’d like to accept the offer to provide some feedback on the described limitations. My feeling is that the limitation on network calls, in particular, could pose a risk in terms of the usability of the long-running functions.

This is because I think the feature will most commonly be used for migration tasks in which data is shifted from an external database to the internal Storage API, as well as from the older key-value store to the entity store. Also, we are betting on Atlassian providing a SQL-like database in the future, which will introduce a third type of mass migration.

Given the fact that reads on the Storage are still limited to 20 entities per request, this could quickly become a bottleneck. This brings me to the first question: Do Storage API reads count as a network request?

Use Case

A specific use case: Templating.app has been among the first Forge apps. It is still Forge Native. It uses the old Storage KV Store. Among other features, it has an entity called ‘Issue Templates.’ One Issue Template is persisted across 3-n values in the Storage. This is because chunking is involved to bypass the 240 KiB limit. We would like to use long-running functions to move to the entity store to enable more complex queries. We have customers with 300 issue templates and maybe more. We can’t be sure about that because we don’t see the data.

Let’s assume we have 300 entities that are persisted across 1400 key/value pairs. Reading all of them leads to at least 70 requests using the maximum page size (1400 / 20). Writing all of them to the entity store will cause another 300 writes (300 requests) because there are no bulk writes for the new entity Storage, as far as I know.

All further requests (to verify permissions) ignored, this rather simple migration would cause 370 requests over probably 5-10 minutes and so breach the limit of 100 requests. This again would lead to a situation where we had to build shaky workarounds to split the load.

Summary

Do I understand correctly that Storage API access counts as a request?
I’d suggest increasing the limit for those functions to at least 1k network requests.
I’d also like to double-check that those functions have at least 512 MB of memory.
The log line limit sounds annoying. I don’t know any other cloud provider where I have to count the possible number of log lines my code causes. Currently, I’m not aware if we’re close to the current boundaries.

Thanks again for the update. I’m really looking forward to using this improvement @AdamMoore!

jbevan · August 9, 2024, 1:58pm

Thanks @AdamMoore this sounds great.

I think the 100 network requests limit will bite us here. As @JulianWolf mentions, the likelihood is that long running functions are used for migration logic or workloads that need to page through REST APIs.

In 15 mins, 100 requests means only making a request once every 9 seconds. Its painful to have to build our logic to be quite so aware of the various platform limitations and chunk up the work accordingly.

Equally the 100 log message limit is just plain annoying when in development mode and trying to debug when and why things are breaking or performing unexpectedly.

JaredJonckheere · August 9, 2024, 2:39pm

This is great news! Our long running imports got slower around the time of the data residency updates. The async job would timeout and launch another instance, but the import it called would still be running, leading to duplicate records. It’d also be a nice feature to see timeout errors in the log. I ended up setting a timer at 50 seconds and retrying after a 5 minute wait so the import would finish and the next job execution would see the finished import.

An ability to run queue jobs one at a time would also be very helpful for us. Forge Cache will work if that can’t be added, though. We’re launching the import using a postfunction, so bulk moves can launch several jobs at once.

Those limits look ok for our use case. I’d increase the number of log entries if anything.

danielwester · August 10, 2024, 6:01pm

We’d be interesting in this (assuming the control of data processing location could be resolved as well as the data storage processing instructions) - different issues I know.

Since we’d use this to process an undetermined number of jira search requests on a single job - that would be be concerning for us. Any way to make Atlassian requests be "preferred ie - not against the time but rather we would have to keep track of where we are in the time period?

AdamMoore · August 12, 2024, 5:20am

Hey everyone,

Thanks for all the quick feedback and constructive responses.

Network requests and log line limits are certainly top of mind so thanks for confirming. They were designed with 25 second invocations in mind so it makes sense that 15min invocations will need significantly higher limits.

@JulianWolf I think anything that leaves the lambda (other than requestConfluence/requestJira) is counted as a network request - so storage is included. I’ll double check if/why that’s the case though - maybe it’s something we can change.

Thanks for providing all those details from your use case - it always helps to have some actual numbers to ground our thinking on.

Yes, the functions will have 512mb of memory.

@JaredJonckheere sequential queues are something we know we need to support on Forge as well. Ideally that’s something you wouldn’t need to use the cache for. Your suggestion for timeout logs is a good one too.

@danielwester yes, I’m sure we’ll have much more to talk about with data processing locations . I didn’t quite understand your second point. Is the question about how you would keep track of progress for a job when you don’t know how long it will take and it might go over multiple invocations? I’m not sure what you meant by making Atlassian requests “preferred”.

danielwester · August 12, 2024, 7:54pm

for the second point - it would be awesome if Atlassian rest end points for declared scopes were not considered against the amount of the rest points. Surely Atlassian has control over the performance over their own end points and they’re the best performant end points ever. (tongue in cheek there btw).

But seriously - I totally undertand the amount of http connections are expensive - but Jira/Confluence aren’t exactly cheap to talk to. Especially over 15 minutes. Maybe we could get Jira/Confluence GET http requests for free? (ie it doesn’t count against the 100 request count).

/Daniel

AdamMoore · August 12, 2024, 10:36pm

Ok, well, after some investigation Today I learned…

The network request limit only applies to egress requests and does not include storage, requestJira/Confluence/Bitbucket etc. The team have double-checked the logic and also manually verified that you can make more than 100 storage requests in an invocation.

I’ll update the docs to clarify that. Apologies for the confusion.

Hopefully that addresses some of the concerns for the 15min functions. I think we’ll still probably increase the limit for apps that are interacting with remote resources from within their Forge functions.

JulianWolf · August 13, 2024, 5:13am

Thank you for checking with the team and providing further clarification, @AdamMoore. This makes the internal migration task challenges I’ve described above seem achievable.

As remote resources probably don’t have limits as strict as Forge Storage, vendors should be able to come further with 100 requests. However, increasing the limits might still be a good idea to reduce the likelihood that vendors have to build shaky workarounds too early.

danielwester · August 13, 2024, 7:12pm

That is awesome! One of our challenges outside of Forge is the Rate limits and handling them in time. Yet another reason to adopt Forge I guess.

AdamMoore · August 14, 2024, 2:21am

I don’t want to spoil the Forge love… but just to clarify, the Jira REST API rate limits will still apply. It’s just that there won’t be additional limits enforced by Forge.

GirishReddy · August 15, 2024, 5:45pm

Our use cases are around API integrations i.e., pulling data from many sources into Jira. This is going to be very useful for us. We’ve had to do gymnastics to get around the timeout issues by complicating our design with unnecessary async events. This will simplify and make our apps more robust.
I am new to the RFC process. What do the dates mean? When will this feature be available as EAP or preview?

AdamMoore · August 15, 2024, 9:05pm

Hey Girish, the dates are just to put timelines around the conversation in the RFC.

As far as delivery goes we’ve already started work and are aiming for an EAP late this quarter or early next quarter.

rcsr · August 15, 2024, 11:41pm

Super excited about this! Like others, we have some hacks in place as a workaround for the 55-second limit. Having a 15-minute window means one less thing for us to worry about.

alvaro.aranda · August 27, 2024, 7:32am

Hello there!

First of all, thanks a lot for this RFC Adam. It is awesome you can increase function times like this, for sure it may unblock a lot of use cases. Or at least, make it easy to manage some of them.

We have some use cases where we are working with bunches of issues/projects or different app entities. We are usually creating async events to manage them and this increase on the function time may make our process management easier in some cases.

But, there are other long running tasks that may not take advantage of this feature. For example, in our apps we have several process that may take more minutes, even hours. Some of them may be done in parallel in different functions, but others not: a task that upgrades the data stored in Forge storage, a task that creates a file with a lot of data that may last even hours to be collected…

For that use cases, having a backend out of Forge will be our approach, but it will be great to consider this heavy tasks for Forge future evolution. But, that being said, this is a really great improvement. Good news!

Have a nice day!

AdamMoore · August 29, 2024, 2:17am

Thanks @alvaro.aranda. After this project we’ll be exploring different compute models (e.g. serverless containers) that would make it possible to run workloads for longer than Lambda’s 15mins - but that’s obviously a much bigger project and a longer time horizon.

Hopefully the 15 minute functions will solve lots of pain points and use cases in the near term.

AdamMoore · August 29, 2024, 2:26am

Thanks everyone for the positive feedback on this RFC. Work is now well underway building this feature.

A couple of closing points:

We’ll focus on async events for now - we can add support for different types of functions if there is sufficient demand in future.
We will shift the limits to be per minute rather than per invocation e.g. you’ll get 100 egress requests per minute for as long as the function runs.
We will simplify the manifest slightly so you opt into a longer running function simply by setting the timeout:

function:
    - key: processBigJobQueue
      handler: index.processQueue
      timeout: 900   #specify the timeout up to a max 900 secs

We’re currently on track to have this in EAP by early October. Keep an eye on the Forge change log for any announcements.

HeyJoe · November 8, 2024, 3:49am

Hi everyone!

The Early Access Program for long running Functions is now available for sign-ups. Please see the change log for details on the feature and instructions on how to register for the EAP: https://developer.atlassian.com/platform/forge/changelog/#CHANGE-2115