Improving support for long-running tasks in Forge

HeyJoe · May 23, 2023, 2:52am

Hi everyone,

I’m a Product Manager working on Forge, and my team is responsible for the runtime layer and function invocation flow for the platform.

As part of planning our future roadmap, I would love to understand in detail where Forge’s existing architecture, quotas and limits prevent you from supporting long-running tasks within your app.

If you’ve experienced this problem, I want to hear from you! I am especially keen to understand:

What use case or customer problem were you trying to solve?
When in your app creation journey did you get blocked by this limitation (early on in prototyping or much later in development or during customer testing?)
What workarounds or alternatives did you try?
What you would like to see changed in Forge to improve support for long-running tasks?

Please feel free to reply directly in this community topic, or else you can set up some time with me to talk about it in more detail: Calendly - Joe Clark

Thank you in advance for your time and feedback!

Joe Clark
Atlassian

remie · May 23, 2023, 6:52am

@HeyJoe, I was wondering if this would qualify for an RFC?

CC: @ibuchanan

HeyJoe · May 23, 2023, 7:20am

Hey @remie - yes, I think any proposed solution would be a great candidate for an RFC!

I’m still in the information gathering phase right now, so I think an RFC would be premature. I’m not recommending any specific solution yet and, in fact, I want to avoid having a solution in mind before researching the problem

If you have strong opinions about how you’d like to see Forge evolve in regards to compute capacity and supporting long-running/batch tasks I’m eager to hear your input in whatever format is convenient for you.

DavidPezet1 · May 23, 2023, 3:48pm

Two issues we sometimes struggle with are the 25s load time and the GraphQL 5242880 byte size limit.

The 25s load time limit we typically run into when dealing with paginated API calls, including to Atlassian APIs and third party. For example, when doing an API call to perform a JQL search to get issues to generate a report.

The size limit we run into when generating Confluence macros with large tables or multiple tabs of information, and end up trimming down, or splitting into separate macros.

Any recommendations for tackling either of these would be appreciated.

ibuchanan · May 24, 2023, 6:13am

Cross-posting a use case:

LukasKotyza1 · May 24, 2023, 12:31pm

Hello, when I was testing Forge I run into the issue of really slow loading times. When I investigated I figured that making calls to ‘bridge’ took atleast 2 seconds, which doesn’t seems such a long time but when there are more calls it adds up.

I found those links

In this ticket it was supposed to be fixed
https://ecosystem.atlassian.net/browse/FRGE-232
But the issue was still there ( I had most recent versions of all packages and Forge )
Since it was slow even when I requested static resources via ‘bridge’ it looked like it is beeing slowed on purpose.

So the biggest trouble I had with Forge was performance because of that I never left researching and testing phase.

ryan · May 25, 2023, 2:21pm

I’ve had a couple of product ideas that would require iterating over all of the content in a Confluence instance and grabbing certain fields or metadata from each page. Think specialized search and organization tools. I haven’t implemented it because the 25s limit could cause issues with doing this on larger instances. There might be workarounds using queues but they have their own limits and I didn’t feel like going down a rabbit hole to find a dead end.

The thing is I would only need to do this once when the app was installed. After that, I could just listen for page_created, page_updated events to keep the index up to date.

klaussner · May 25, 2023, 4:20pm

Thank you for gathering feedback on this topic, @HeyJoe!

We have a Forge app that scans Confluence spaces to find outdated and archivable content. The app is processing a lot of pages in the background to generate reports for our users. We started working on it shortly after the async events API was released and had to find a few workarounds for its limitations:

When making lots of requests to the Confluence API, it can be difficult to predict how many requests can be made within 25 seconds and how to split up a large amount of work into small 25 second batches. After every batch, we are pushing a new event to the queue until all batches are processed.
Since the 200 KB async event payload limit is too small for our use case, we had to find a way to store intermediate results in Forge storage after each invocation and to aggregate them into a report at the end.
It’s hard to control parallelism with async events because we don’t know in advance how many batches we have to process. We can’t run everything in parallel because of REST API and Forge storage rate limits.

I think we could avoid most of this complexity if we could run background jobs without time limits.

HeyJoe · May 26, 2023, 3:49am

@DavidPezet1 - how big of an increase to the 25s invocation timeout limit would you need to no longer worry about this limit?

In terms of recommendations for right now, the best approach is to use the Async Events API to process the paginated results in batches if you are not already doing that. Even then, there are still limits that put an upper ceiling on how much you can process in this way.

Another option is to utilise external resources to offload some of the compute from your app, but this isn’t a feasible option in all cases, and erodes some of the value of trying to build your app as wholly contained within Atlassian’s cloud in the first place.

I had personally not come across the limit before! Thanks for flagging it with me - I will pass this feedback on to the Confluence team and see if they have any recommendations.

HeyJoe · May 26, 2023, 3:58am

Hi @LukasKotyza1

When I investigated I figured that making calls to ‘bridge’ took atleast 2 seconds, which doesn’t seems such a long time but when there are more calls it adds up.

Thanks for flagging this. Where are you based in the world? Currently all forge functions run out US-West, so there can be a pretty substantial delay caused by geographic latency - we are planning to address this and make Forge invocations multi-region later in the year.

HeyJoe · May 26, 2023, 3:59am

Thanks @ryan for calling this out. In my conversations so far this is appearing as a very common use case!

HeyJoe · May 26, 2023, 4:10am

Thanks for this detailed feedback, @klaussner !

Some follow-up questions:

By storing intermediate results in Forge storage, have you run into any read/write API limits on the storage API?
We may not be able to support background jobs with no time limits easily (due to our current reliance on Lambda for invocations - which can go up to 15mins). What kind of increase in the timeout limit would substantially simplify things for you? Does 1min make a difference? 5min?

LukasKotyza1 · May 29, 2023, 8:31am

Hey @HeyJoe, that could explain it. I am based in Europe and my instance of Jira was also in Europe.

SilvreLestang · May 30, 2023, 10:18am

Hey @HeyJoe ,
Like @DavidPezet1 , we stumble over the limitation of the max size (error: apollo GraphQL error: Response must not exceed 5242880 bytes in size) while requesting a list of issues from a JQL query from a resolver.
We work around this issue by asking only a subset of the fields for each issue. But depending on client’s issues and queries, we might randomly still get it.

SushilBhattachan · June 1, 2023, 7:30pm

@HeyJoe We are trying to call ChatGPT API and it’s timing out as ChatGPT take longer than 25 seconds in many cases. So we are stuck now, we have already invested a lot in developing Forge app. However without reliability of getting answer it will not work. So what suggestion do you have to resolve this issue? You guys want us to use Forge and now I am totally stuck.

HeyJoe · June 2, 2023, 6:43am

Hi @SushilBhattachan

If there is no way to call the ChatGPT API in an asynchronous way that lets you poll/wait for a response, I would suggest either:

Assess if it’s possible to call the API from within a client browser context (eg. using Custom UI) so that you can wait longer for the response than 25 seconds.
Assess if it’s possible to build an external service or host that can proxy the query to the API and the response for your Forge app. This way, the external service can wait longer than 25 seconds and then call back into the Forge app (eg. via a web trigger) when the API response is ready.

I appreciate that there are no ideal solutions for this problem at the moment, and while I hope we can raise the 25s limit in the future, there’s nothing I can do to raise right now at this point in time.

I hope this is helpful.

klaussner · June 2, 2023, 9:25am

@HeyJoe

By storing intermediate results in Forge storage, have you run into any read/write API limits on the storage API?

We didn’t run into rate limiting issues with this approach (we are using the delayInSeconds option of the event queue’s push function to reduce the risk of being rate limited), but the maximum limit of 20 items per query is too low to retrieve lots of items for aggregation. To work around this limitation, we are storing as many intermediate results as possible in every storage item. For each function invocation, we read an item, add new results, and write it back to storage until the limit of 128 KB is reached.

Another challenge is to delete the intermediate results when they are no longer needed because there’s currently no function to delete multiple items at once by key prefix.

I think we could simplify our implementation if we could share data between function invocations more easily, for example with a cache that is deleted automatically when the event queue is empty or with a higher payload limit (we would need a few megabytes to make sure that the app works for Confluence sites with a lot of content).

We may not be able to support background jobs with no time limits easily (due to our current reliance on Lambda for invocations - which can go up to 15mins). What kind of increase in the timeout limit would substantially simplify things for you? Does 1min make a difference? 5min?

We would probably still need to store intermediate results in storage because some of our customers have so much content that 15 minutes aren’t be enough to analyze all of it in one invocation. But a runtime limit of 5 minutes or more would still help because it would reduce the invocation overhead and the number of storage reads and writes. It would also make it much less likely to hit the cyclic invocation limit.

SushilBhattachan · June 2, 2023, 1:53pm

@HeyJoe Thanks for your response. And I understand.

So you mean we have to create an UI outside of forge framework?
I was trying this but again how would the external API will connect back to Jira API, what would be authentication method. Is there any document to explain that. Primarily I need to get back json content from ChatGPT (Or external API) and update Jira Issue.

I saw some note that we can have Atlassian connect co-exist with in forge app, so that I an utilize connect to call API. Is there any documentation that explains how I can have connect component with in forge app and make a call or utilize it.

Thanks
Sushil

SushilBhattachan · June 5, 2023, 3:23am

@HeyJoe I could not find good documents about web trigger. Is there any good documents about web trigger that you know of?

Thanks

remie · June 5, 2023, 8:12am

Hi @HeyJoe! My apologies for the delay.

Personally I think the RFC format is very much suitable for gathering thoughts. I would even say that one of the downsides of the current RFCs posted is that they do not offer alternative solutions, or allow us to catch a glimpse of the chain of thought that led to the proposed solution. One of my personal mantras is that you should always ask feedback on the problem, not the solution.

Even so, I’m happy to provide you the context about what we would need with regard to long running tasks if we were to migrate to Forge.

I think our Version & Component Sync (VCS) app is the best example for this.

The app heavily depends on GCP PubSub and Cloud Scheduler. For the synchronisation process, it listens to both product events as well as a scheduled task (which runs once every hour per configured source project). We currently have 380+ scheduled tasks configured in Cloud Scheduler.

Upon receiving either a product event (version) or a scheduled task trigger (which also goes through PubSub) the app retrieves all target projects linked to the source and emits a new PubSub event for each linked project.

Each PubSub event handler for the synchronisation of target projects retrieves a list of versions/components from the Jira API from both Source and Target projects and compares them. If there are any changes detected from the source project they are propagated to the target project.

We are currently at ~2.5M invocations per month. Each function has 512MB memory. Some of these functions have peaks of ~500 active instances (simultaneous). Most of these operations remain within the boundary of the 9 minute runtime limit of GCP Cloud Functions, but we are increasingly getting reports of timeouts and are currently looking into optimising that. It relies heavily on caching (Redis) to avoid Jira API rate limiting, but nonetheless we’ve had ~350M read requests to our Firestore database in the past month.

Anyway, long story to say that in order to be able to migrate VCS to forge would mean:

Support for PubSub & Scheduler
Support for caching
Support for massive read operations on Forge Storage API
Long running tasks for up to 9 minutes (or preferably longer)
Enough memory for each function
The ability to scale the number of active instances

I can imagine that maybe Forge is not a suitable platform for this, and we should wait for Forge Remote in order to be able to migrate. Looking forward to hearing your thoughts on what Forge will be able to do for us in terms of quotas & limits.

Cheers,

Remie