Forge storage outage - impact

Hi Atlassian,
Can we talk about the outage that was just discovered by atlassian ( Atlassian Developer Status - Forge Storage delete API was not removing data) - 12 hours with no notification to vendors or customers and the just “Yepp there was a problem but it was fixed”.

How did this issue happen? How was it recovered? What data did atlassian delete by themselves? Were affected partners consulted?

What happens if a user deleted data using the api (which would then fail) causing the vendor’s app to update the data instead of deleting it. Was this data then deleted from atlassian?

If you want us to trust Forge - we need all data related outages be document as it was Atlassians data. (That’s what our customers expect of us).

11 Likes

I would not have realized it, thanks Daniel.

I have a forge Jira App with paying customers and did not get any notifications from the developer console. And I guess because it silently failed …

I can just add my 5 cents to Daniel’s statement:

  • Please communicate clearly when something is broken.
  • And just make it stable enough so that we can all rely on forge since you communicate that as the future framework for cloud apps.

I will need to adjust my automated tests to detect such stuff ^^

At the same time (midnight Feb 5), we also noticed a difference in how Storage API’s query method works. Specifically the return value of nextCursor when all the data has been returned seems to have changed for a while and then somehow magically it went back to normal. We tested in several instances and saw this behavior.

nextCursor change caused our marketplace app to send 50+ emails at every hour for every reminder. It may have caused our app to be uninstalled or disabled from our customer’s Jira instances to prevent mail bombardment.

Hi @danielwester, thanks for raising this.

We understand this issue is concerning, and very important to you. We wanted to provide an update today on the key aspects. However, we are commited to addressing all the questions on this thread asap.

As part of Forge’s preparation to support Data Residency, Forge hosted storage has been undergoing a platform and data migration. During this migration, dual-writes is enabled between the old and new data store. However, there was an unexpected condition where the delete was not successfully done in the new datastore, resulting in the data not being deleted as expected.

We want to reassure you that as a result of this incident, there was no data loss, however there may be data inconsistency. We did not delete any data that was not directly called for the app, through the delete API in storage. No data deletion has been taken to remediate the issue. We did not try to catch up on deletes after the fact to make up for deletes we didn’t process.

To resolve this issue, Forge storage has been moved back to use the old data store as the source of truth. We did not consult with affected partners, as we did not take any data modification actions.

Here’s a few scenarios to understand the potential impact on your app during the incident:

  1. Delete, followed by get, followed by no data updates (by app or end-user): This may have resulted in showing inconsistent status of a data intermittently
  2. Delete, followed by get, followed by an update (by app or end-user) of in-place data: If the data in-place was updated, the last update wins
  3. Delete, followed by get, followed by an update to another value (using the retrieved data): Derived data maybe inaccurate

We will be providing a public post incident review, where we will provide more details on any actions we will take to prevent it from happening in the future and our analysis of the conditions leading to that.

Any further steps of this migration are on hold. We will only resume the migration again once we are confident no further issues will arise. We will communicate when we resume the migration via Forge changelog.

4 Likes

Thank you for this. My apologies for not responding sooner (I’ve been bed-ridden the past week with a cold/flu thing) and I’m just now able to feel comfortable with letting me onto cdac again since I don’t have any cold meds in my system…

First off - I’m not trying to point to any individual (I think this is a larger failure at Atlassian going on here as Atlassian tries to grasp the significant effort involvedof operating a platform for external parties versus themself). I do appreciate your very clear and explaining post.

Here’s a few scenarios to understand the potential impact on your app during the incident:

  1. Delete, followed by get, followed by no data updates (by app or end-user): This may have resulted in showing inconsistent status of a data intermittently
  2. Delete, followed by get, followed by an update (by app or end-user) of in-place data: If the data in-place was updated, the last update wins
  3. Delete, followed by get, followed by an update to another value (using the retrieved data): Derived data maybe inaccurate

Based on that - if an admin of our app (the controller of the data - in our view) had deleted somebody’s accountId or other PII references from a large json bolb (with other’s account Id - they did #3). This means that we are now not removing the accountId and have violated the controller/processor relationship. So it’s imparative for us to know if we had activites that would be on those 3 areas.

In addition to this, a common use case in Forge storage (at least for us) is to have a index keyval storage (ie with pointers to others documents) - this could cause corrpuption happen across the board (ie missing docs etc) which nobody was aware could happen.

It’s quite disappointing that there was no prior notification that this was happening (ie. proper operational change management would alert the users of the system to changes happening and potential risks). In addition to this, the fact there was 12 hours of no reaching out to app vendors (making Forge more expensive to build on than Connect due to customer support costs).

I am starting to feel that we as a whole (Atlassian and Ecosystem Vendors) are really stepping backwards with the outages (it was going well. :frowning: ).

So I’ll end with this question - does the operations team behind Forge have any ideas of the number of apps impacted by this and if so - have those app vendors been contacted?

5 Likes

Hi everyone, thanks for your patience.

Firstly, I hope you are feeling better and recovering well @danielwester.

We want to provide an update to everyone on the situation, and address the urgent questions.

Upon further analysis, we have confirmed the root cause of the incident to be a bug in the migration logic and not an issue with the underlying storage platform. We have fixed this bug, but are performing additional testing and preparation to ensure the next attempt at migration proceeds smoothly.

We are currently planning to resume the migration toward the end of the month. Once the date is confirmed, we will schedule a planned maintenance on the Developer Statuspage (an updated link will be shared on this thread). We will also post an announcement on the Developer Changelog the day before the migration proceeds.

Unfortunately, our analysis shows that a small number of delete requests failed for 4 apps across 6 customer installations. We have reached out to the affected apps, with details of the incident and scope of the impacted data.

@GirishReddy and @denizoguz thank you for your reports of the changed behaviour with cursors. We have identified an unintended change in logic where a cursor value is returned as an empty string for the last page of results (previously it was undefined ). We are working on this and planned to be fixed early next week. We will provide a changelog announcement once the bug is fixed, and you can also follow on the public Forge ticket. As the migration resumes, there will be a change in the cursor string format. Cursors are derived from underlying storage identifiers, and will change as a part of the migration. As outlined in the documentation, cursors are not stable and should not be persisted. We will be updating the documentation to make this warning more prominent. While we have implemented logic that will allow both old and new cursors to temporarily be used to ensure consistency of Forge apps executing during the migration period, you should not be persisting cursors for any significant period of time.

We take note of the concerns that are highlighted, and acknowledge the impact this incident has had to our partners and customers. We are considering all of the points raised, and evaluating any actions and improvements we will take to mitigate such incidents, and will share these with you in a detailed post-incident review in the coming weeks. I hope this update has addressed some of the urgent questions, and provided clarity of the immediate next steps.

6 Likes

Thank you Sushant for the details. Appreciate the detailed follow up.

2 Likes

The nextCursor change has also broken many of our apps. If Atlassian could stop making unannounced breaking changes, that would be great.

1 Like