Addressing the frequency of incidents

As announced at Partner Connect in April, we’re committed to improving incident management and ensuring Partners and developers have a level of platform stability and reliability they can trust. As a first step, we’re committed to providing more transparency on how Atlassian manages incidents and in turn help your teams understand what you can expect from Atlassian when an incident occurs, for you and your customers.

We’re sharing our approach to how we define incidents at Atlassian. The definitions were recently re-defined with partners and developer-specific use cases in mind, ensuring better management of incidents as they arise. We’ll continue to work on improvements on how developers and partners report incidents, Atlassian communication to partners during an incident, and how we reduce the frequency of incidents. Learn more: App incident severity levels

We understand new status definitions, incident management transparency, and communication expectations are addressing symptoms rather than the root cause. We want to assure you that in parallel to incident management improvements, we are also taking closer looks into the root causes of the recurring incidents across our product and Marketplace platforms. We will be sharing more information over the coming weeks and months and are also exploring new ways to keep our developer and partner community more informed about these efforts on an ongoing basis.

Continuous improvements

We’re improving how Atlassian developers and Partners find the information and tools they need to succeed in our ecosystem. It’s part of our efforts to improve Marketplace Platform Stability and Reliability. Partners can learn more about these plans on the Partner Portal*

*Marketplace Partners with at least 1 paid-via-Atlassian app qualify for Partner Portal Resources. If you experience any issues getting access, and meet the eligibility criteria, please open a support ticket and our team will work to get things resolved as quickly as possible.

9 Likes

Can you share where I should be looking for the associated SLAs?

Hi @boris, sorry for the delayed response here.

We currently have the following set of SLAs (Service Level Agreements) and SLOs (Service Level Objectives):

  • SLAs:

    • When a major (Sev 2 or above) incident is raised, the on-call team must start investigating the incident within 15 mins of being paged.
    • During a major (Sev 2 or above) incident, external comms must be sent within 1h of the incident start time.
  • SLO: All major incidents should be resolved in less than 4h (TTR < 4h)

We will get those added to the main page shortly, thanks for calling it out!

2 Likes

Thanks for sharing. Given that the current incident monitoring process seems to be driven by vendors manually reporting issues. How does the timing from an issue starting, to manual detection, to acknowledgement of detections factor into these SLAs and SLOs?

Also, are there any retroactive numbers you can share if these SLAs and SLOs were applied to the last 6 months of incidents for example?

Hey Boris,

For TTR calculations, the “start time” of an incident is when the impact actually started (not when it was first reported or detected). Typically, it’s the timestamp of when a “bad” commit was deployed to prod.

Without going too deep on the details, I’d say that we are doing pretty well on the SLAs (acting on incidents within 15 mins of being paged + comms within 1h), but not so well on the TTR (we failed to meet the TTR SLO for 65% of incidents last quarter).

All teams review those SLAs/SLOs quarterly, and when they are not met they must commit to remedial actions (this is a pretty well-oiled internal process, which has been in place for multiple years). We currently have a number of initiatives in-flight aimed are reducing TTR across the board.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.