ECOHELP incident management experience

BenRomberg · April 16, 2024, 11:21am

We’re a bit frustrated by ECOHELP incident management and wanted to see if:

Atlassian can evolve the ECOHELP incident management process,
Other partners have similar or different experiences.

In our experience, incidents raised via ECOHELP are now often prematurely downgraded to bugs, and we have to escalate an incident via other channels and collect vast amounts of evidence to finally convince Atlassian to re-upgrade the request to an incident once again.

Examples:

Of the last 6 incidents that we raised via ECOHELP:

4 were downgraded to bug, and then upgraded to incident again (see above)
1 was permanently downgraded to a high priority bug and (thankfully) fixed within 6 weeks
…and only 1 was actually being treated as an incident, without a premature bug-downgrade

How it’s currently going:

We raise an incident.
Atlassian staff reviews the request and, most of the time, downgrades the incident to a bug almost immediately.
We escalate through other channels and describe severity, impact to customers, motivate other partners to raise the same incident if they’re also affected.
Request is again upgraded to incident status.
Severity is unclear, status page entry sometimes takes more than a day to be created.

How we’d expect it to go:

We raise an incident.
Atlassian staff investigates and asks for more details, if necessary.
If in doubt if it’s really an incident, Atlassian staff asks more clarifying questions and proactively investigates the impact.
After having obtained a complete picture about the situation, Atlassian may downgrade the request to a bug if no doubt is left that it’s not an incident. Atlassian explains why it doesn’t qualify as an incident, so we can learn before raising future incidents and hopefully agree with Atlassian’s assessment.
If it’s an incident, the severity is confirmed and a status page entry is created right away.

We don’t like to raise incidents with Atlassian, but if our customers are experiencing an outage or a severe impact on the usability of our apps, there’s often nothing else we can do. We would like these incidents to be investigated more seriously and accurately. Prematurely downgrading incidents leads to delays in reproducing and treating incidents with the necessary priority, and a frustrating experience on our side.

Please consider re-evaluating and evolving the ECOHELP incident process and reduce the number of falsely downgraded incidents. We’re 100% dependant on Atlassian as a platform provider and need it to be more reliable in case of an incident. Thanks!

marc · April 16, 2024, 12:53pm

Completely agree with @BenRomberg .

For us it is critical that Atlassian publishes incidents on the statuspage ASAP, even if an incident is resolved fast.
Our customers expect transparency from us, and also from Atlassian.

marc · April 16, 2024, 12:56pm

As a followup, I’d expect this to have been an incident: Table extensibility not working (we were not impacted, but impacted customers would have no idea why their apps are not working).

tobias.viehweger · April 16, 2024, 1:56pm

Can confirm that we have also seen this more often, that support is very quick to role out an incident (with dubious tests) and only trough private channel escalation it’s possible to get to the bottom.
Recent example (which Refined was also affected by, I think) was an authentication issue of the Forms API.
See ECOHELP-35481

Andrew_Golokha · April 17, 2024, 1:01am

Hi @BenRomberg and everyone.

Thank you for sharing your concerns and suggestions.

Ecosystem Support has evolved a lot since our transition to ECOHELP. This includes lots of internal process changes, up-skilling and growing the team so that we stay on top of your requests.

Incident response is certainly one of the areas where we’ve invested while improving the response times. Our experience with incidents in general is that the vast majority of them end up being “false positives” - i.e. not incidents at all. That isn’t to say these aren’t critical — they are! As such, we prioritize their investigation but they should have been submitted as a higher priority support ticket instead. I mention this because a large volume of “false positives” impacts the team capacity and delays “regular” non-incident ticket resolution since incidents are stop-the-world events for the support team.

At the same time, I understand the negative experience where an incident get “downgraded” to a bug due to initial lack of details and/or our inability to reproduce the issue at scale only to be moved back to an incident at a later stage after we gather additional evidence/details and/or make progress on the reproduction.

It looks like it’s a good time for us to revise the incident report submission form to make it a bit more prescriptive so that we have as much crucial data on hand to help us more quickly assess the scope of the issue.

Thanks again for your honest feedback.

remie · April 17, 2024, 5:39am

Although I understand this sentiment, I think Atlassian should make a distinction with regard to the incident reporter. Atlassian Marketplace Partners are your power users. It would be good to operate from the basic assumption that if we report a ticket, shit has hit the fan. Even if this proves to be a high priority ticket, we wouldn’t be contacting you unless we are seeing severe impact with our customers.

Atlassian Marketplace Partners do not create incidents lightly, and that should be reflected in the way Atlassian evaluates them.

BenRomberg · April 17, 2024, 8:38am

That might help in general, but probably not with the issue described here. Like I was trying to communicate earlier, sorry if I’m being a bit blunt here, but I believe the problem is the attitude towards incidents. It seems to me that the current policy is to “downgrade incidents at all costs if any reason not to treat it as an incident can be found”, where it should be more like “investigate and take the issue seriously until we’re certain beyond a doubt that it’s not an incident”. That’s probably not the intention, but it’s definitely how it comes across in our case.

There’s also a big lack of transparency regarding severity levels, time delays for status page updates and reasons why something gets downgraded from incident to bug.

I don’t think the proposed changes would address any of those.

If Atlassian continues to have an “Incident” support category, those should be treated as such. Sure, there might be false positives, but if valid incidents in general are not treated with the priority they deserve, you might as well remove this category and we’ll try to escalate through other channels from the very beginning of an incident.

If the false positives is what’s holding you back from treating valid incidents with the proper priority, maybe the process should be enhanced to reduce the number of false positives. Improve the definition of an incident, show a warning banner on the request form, or demand more evidence before allowing us to create an incident request.

Andrew_Golokha · April 20, 2024, 12:29am

Absolutely! For every reported incident we use our internal tooling to assess impact which is directly tied to the number of apps/end-users that could possibly be affected. This means, however, that some reported “Incidents” are scoped to actually be critical “Bugs”. No less important for the reporter but simply not classified as an incident. We acknowledge the gravity of the situation every time we receive an incident report and it was partner feedback that exclusively led to the creation of this specific request type as it didn’t exist before.

This is definitely not the intention and we’ll scrutinize the examples you provided to see if and how we can improve. We file incidents to engineering teams every time we have sufficient evidence to inform and support incident severity levels. At times, some of the issues may be important/urgent in nature, but cannot be treated as incidents.

Moving incident reports to a “bug” issue type doesn’t always assume significant delays in resolution. We use priority-1 tickets (P1’s) for the “near-but-not-quite-the-incident” cases. They have tighter SLAs and often require multiple support engineers to collaborate on. Some of these cases may end up being escalated to engineering - others not. As for confirmed incidents - 100% of them are escalated to engineering but we cannot do this until we have all the details/justification including clear steps to reproduce.

100%! We’re already working on major tooling improvements for ticket submission, including incident reports. I’ll meet with the team to see if and how we can accelerate improvements for incident requests specifically.

In brief, I understand the current process isn’t ideal and I recognize the frustration you and others have experienced. It goes without saying that it’s a complex space with many variables from the tooling (submission form, internal processes & workflows) to engineer training and cross-team collaboration. We’ve invested a lot across all these dimensions over the past few years and are continuing to do so as we strive to deliver the level of service you expect and deserve.

I’ll comment again to announce any near-term improvements. In the meantime, if you feel we could have handled a individual case better, you can help us by completing the survey when your request is resolved. We read all of them!

As a closing reminder to the rest of our partners, please create “Bugs” and set the priority accordingly to match the urgency – the team is alerted separately when they are created. For issues with symptoms that indicate an incident, the more details we have on the submission, the faster we’ll be able to reproduce and get started on a resolution.

remie · April 20, 2024, 6:40am

This entire thread is about partners telling you that you do not properly acknowledge the gravity of the situation

Yeah, that’s not entirely how this went. You do know that most of us have been here for some time now, right? Incidents were first reported through us shouting in the void of CDAC, after which a specific CDAC category was made for incidents, after which we asked for Atlassian to not use CDAC as this was not the appropriate tool, after which Atlassian created a private incident report mechanism, after which we asked for a public means to track incidents, after which Atlassian created some weird public mirror (to which incidents are only published after partners explicitly ask for it or once Atlassian finally acknowledges this is an incident), after which this process completely failed and partners are still using back channels and public outcry on CDAC to get Atlassian to actually take them serious.

tobias.viehweger · April 22, 2024, 8:16am

Thanks for taking the time to write all this up, Andrew! I’m a bit concerned though, that this is process seems a bit protective, esp. for something like ecosystem incidents, which by itself are far better pre-scoped than normal customer requests. As the others mentioned, we have far better insight into what is supposed to happen, so we can make a much better decision if we suspect something is an incident or not. Especially, when we had cases in the past (like ECOHELP-35481), where two partners (Refined and us) reported an incident at the same time, and the conclusion was very uninformed and “not an incident”, whereas a simple Slack message to engineering would’ve confirmed this actually being an incident.

I’m not convinced protecting your engineering silo is the smartest way to go about handling ecosystem incidents. At least having a central internal Slack channel for all teams involved in maintaining APIs (there are not that many public APIs?) to be notified of reported incidents (at least passively) would at least create some transparency and a quick way to escalate to engineering? Having a semi-to-non-technical first-level incident support is rather unfortunate as it creates long, long delays our mutual customers have to wait for a resolution, while we need to spent an insane amount of time proving to someone semi/non-technical that this is actually incident (and by semi/non-technical I mean someone not understanding API token vs App auth token).

Thanks!
Tobias

ggachev · April 22, 2024, 8:17am

We had a bad experience with the support as well two weeks ago (ECOHELP-38731). Probably our fault we reported it as a bug and not as an incident but still…

On the Friday we received multiple customer reports that our app is not working. After a quick check we realized that the AUI link that we have been using for ages is no longer working (https://unpkg.com/@atlassian/aui@9.10/dist/aui/aui-prototyping.js ). We asked whether they will fix the link or we are supposed to handle this ourselves.

Several hours later they provided an inadequate (to say the least) answer: please export the log files and send them to us. Afterwards, since we found out that if we change aui@9.10 to aui@9.10.0 the link works, we decided to release a new version of our addon using the updated link. Later that day they made changes so that the old 9.10 link redirects to the new one so we kind of wasted the day and we had complaints from our customers.
Not sure whose fault it is for the update in the link but definitely their response in the ticket was not helpful

BenRomberg · April 22, 2024, 12:06pm

In our experience, incidents are resolved in a matter of days (sometimes hours, which is good!), where bugs are resolved in a matter of weeks or months, if at all. “Doesn’t always” may be true, but I would call it a rare exception if there is truly no “significant delay” when comparing these two categories. So if we desperately need something fixed within a few days (worst case), because customers are blocked from their critical workflows, there is currently no other path than to raise an incident from our point of view.

To be honest, I never fill out the “How was our service for this request?” (if that’s what you mean) survey, since I cannot decide what I should rate. I’m mostly happy if the incident was resolved in a matter of 1-2 days, however the process of getting there is often nerve-wracking, like I already described. But will do so more often if it helps surface these issues for you.

IMHO you shouldn’t rely on reproduction in order to verify an incident. Very often, we notice an incident quite early, if only 3-10 customers are affected, because of a feature flag slowly rolling out or a weird corner case that slowly accumulates more errors. Reproduction has almost always failed for these incidents (and incidents were prematurely downgraded to bugs as a consequence), however we’re certain that something is broken for these customers that will likely spread to more customers as time goes on. Developer support should be able to verify/reproduce these issues on customer’s instances or forward them to someone who can. I know that developer support might not have permissions to do so, but incidents should not be downgraded solely because they cannot reproduce it on their own test instances.

Thanks, looking forward to the announcement. Also, it may be good to collect feedback through an RFC before implementing any changes.

Andrew_Golokha · April 23, 2024, 12:44am

Thanks again for all of your feedback. As I posted earlier, some of the improvements will be longer term but where possible we’ll try to make incremental updates more quickly. In the meantime, yes, fill out survey requests after your issue has been resolved and you are also more than welcome to DM me if you’d like to catch-up via Zoom as I’m always happy to hear your suggestions directly as well.

BenRomberg · May 16, 2024, 11:58am

Any updates yet?

chhantyal · May 16, 2024, 12:40pm

Adding our experience because it is relevant.

We raised an incident yesterday (3rd time, same incident coming back to life), Atlassian didn’t even bother to update status page. The issue was resolved after 11 hours → Jira Service Management

However, like Ben is saying, a lot of time was spent on trying to accept it as incident or not even though it was the exact same incident as last week, for which we provided all the details.

scott.dudley · June 5, 2024, 7:54pm

Hi @Andrew_Golokha and all,

Our E2E tests started failing last night, with Confluence page creation erroring out with “We couldn’t find what you are looking for” on the last step. I diagnosed the problem as being “not my code” and I assumed that Atlassian would surely flag and fix the issue by the time I woke up.

That didn’t happen. So, I reported an incident today (ECOHELPPUB-134 / ECOHELP-41809).

This was immediately acknowledged as something being broken (“We are informed that we have reproduced the error and are checking it urgently”), only to be told later that “As this is not an incident, we are closing this ticket”. There was no question of scope or missing details or not having the correct instructions. I infer from the ticket history that it also spawned a HOT.

This is core Confluence functionality (page creation doesn’t work!) and it has been broken for more than 16 hours on multiple sites (and tests on our E2E instance are still failing as of this writing).

I infer that there aren’t thousands of other impacted users (perhaps because it might be limited to developer-cohort instances?). That notwithstanding, if this is not considered an incident (not even the least-severe type), then I don’t really know what is.

It was already slightly discouraging that my monitoring caught this and Atlassian’s did not, even hours later. As @BenRomberg noted, to help encourage power users such as Marketplace Vendors to submit incidents, I think more can be done.

Downgrading the incident to a bug makes it almost feel like incident statistics are being massaged. (If I didn’t bother to file a report, would anyone else have noticed before the change was distributed more widely in production and created a wider incident? I don’t think that benefits anyone.)

How are vendors supposed to know how many apps/end-users are impacted? Vendors do not have any visibility into this and this part of the definition is likely creating a misalignment of expectations.

The only alternative is the other ECOHELP “Report a bug” ticket queue, but that is viewed by many as where issues go to die. It often takes days or weeks to get a response to anything there. Maybe there needs to be a sev4 incident type? Or perhaps a “Report a critical bug” queue is needed?

Andrew_Golokha · June 14, 2024, 5:36pm

Hi @scott.dudley, thank you for your comment and apologies for delayed reply.
In the past few weeks we’ve organized a few workshops to discuss the best ways to improve incident submission and handling experience based on the initial post in this thread. We’ve narrowed down the scope to several internal changes that are being worked on in addition to short and mid-term external improvements we will be introducing in the coming weeks and months.

Regarding the specific incident you reported, we’ll perform an internal review and provide feedback into how it was handled within the scope of current processes to learn if we could have handled it better with what we have in place.

To speak to your concern about incident statistics, I assure you that the transition to a “bug” is simply a consequence of our own internal support workflow (which we’re working on improving) and doesn’t reflect incident count/reporting from the development side.

Again, we value your continued feedback and partnership as we work towards better solutions together.