[Data Residency] What happens to the app realm migration when the app is down?

Considering the use case below…

  1. User triggers app realm migration.
  2. JIRA was able to send the scheduled migration hook to the app successfully
  3. JIRA was also able to send the start migration hook to the app successfully in which app starts the migration process from their side.
  4. During the migration operation, JIRA will constantly send the status migration hook to check the app’s migration status.
  5. BUT the app went down unexpectedly (server not running).

So my question is…

  1. Will JIRA trigger a rollback by sending the rollback migration hook since it was not able to successfully access the app via status hook?
  2. Let’s say after #2, the app also went down and JIRA sends the start hook, will it retry to send the start hook? or will it treat the app migration as failure and trigger a rollback?

Thanks.

Reference: https://developer.atlassian.com/cloud/jira/platform/data-residency/#data-residency-and-atlassian-marketplace-apps

1 Like

I also have some related questions:

Is there a retry policy for /rollback webhooks? In broader terms, what happens if rollback returns non-2xx response?

Also, what is expected behaviour if /status returns non-2xx response and/or response in non-compliant format? Will it also cause a rollback?

Is there maybe a possibility to consider retry for intermittent 502/503 responses?

2 Likes

Hi @StevenPila and @lexek-92, apologies for the delay in responding to this. I’ve answered each of your questions in-line below, hopefully this helps.

This is correct, any non-2xx response received during the migration to a status hook request will result in a roll-back occurring. Do you anticipate a high (or sufficient) frequency of failures to these intermittent requests?

We will retry sending the start hook three times with a backoff delay. If we are unable to receive a 2xx response within three attempts, then that migration will be marked as a rollback and receive a rollback hook.

This is similar to start, we will attempt to communicate this with your app two times, however the product will be brought back online upon the first request.

Any 2xx response (non-format specific) is ok, however any non-2xx response will result in a rollback.

Good question - this is something which we’re open to exploring given a 502/503 response could constitute a retry. However for now, these are treated as a non-2xx response.

1 Like

Hi @SeanBourke , thanks for your reply so just to clarify…

Since you mentioned /start and /rollback have a retry mechanism, does this mean,

  • The hooks’ /schedule, /start, /commit and /rollback will retry to send for a max of 3 attempts if unable to receive 2xx response?
  • and for /status hook, there’s no retry mechanism at all?

Thank you.

Hey @StevenPila,

Thanks for the reply. We’ve reassessed our endpoints and identified that they do not today. With that said, we also believe it’s unreasonable that one failed response would result in a failure for the entire migration, particularly given it’s potential to increase the complexity or cost of migrations for yourselves (if things can fail more easily/frequently) and increasing the likelihood of customers seeing failed migrations.

Given this, we’re assessing implement retries for hooks which could result in an otherwise immediate failure of a migration. For example, this means that a /status hook would only move to rollback when it:

  1. Provides an explicit status response of failed in a successful request
  2. Fails to provide a 2xx response over at least 3 three retries

The above would also apply in these circumstances.

2 Likes

Hi @SeanBourke,

Thank you for the confirmation. I agree, that would be really helpful to us. :slight_smile:

Also, I would like to clarify regarding the error codes as mentioned here https://developer.atlassian.com/cloud/jira/platform/data-residency/#error-codes.
As stated in the /status hook only, to send the predefined error code, we just need to follow the format below.

{
“status”: “failed”,
“errorResponseCode”: “E0004”
}

But as mentioned also in the error codes section,

To help diagnose problems with migrations, we’re adding a set of standardised error codes that your app can report back to us with when you’re reporting back with a non-2xx to the hooks or the status retrieval.

Does this mean, if we want to send these predefined error codes for the other hooks (E.g., /schedule, /start, etc.), then we can use the following format above using a JSON with errorResponseCode along with 2xx status code? or should we use the non-2xx status code?

Also, by the way, is there any ticket or page regarding your implementation for the hook retries? so we can watch for any updates?

Thank you.

Hey @StevenPila,

My apologies for a day in responding to this. A few quick updates:

  1. We’ve updated our documentation to be more informative regarding the use of error codes. These can be included in any non-2xx response to provide more details about what happened. Hopefully this helps - please let us know if it isn’t clear and we can explore further updates.

  2. Retries are coming soon, you can follow here on AC-2572. Where we receive a non-2xx response which does not include an errorResponseCode in its response, we’ll try a few more attempts before failing the migration. This means a decreased likelihood of transient errors resulting in a full migration failure for customers.

1 Like

Hi @SeanBourke , no worries, thank you for the clarifications and addressing our concern. I really appreciate it. :slight_smile:
Also, just wanted to ask, is there any plan already on when the retry mechanism will be implemented?

Hey @StevenPila,

We’ve released some improvements to the retry mechanisms for the data residency migration hooks. The related documentation has been updated to reflect this improved behaviour.