What is the problem?
We are receiving multiple reports from Marketplace Partners that their apps are experiencing errors when loaded in the host Atlassian product. This issue appears intermittently and can affect all app iframes on the current page. We know that this issue affects multiple apps across multiple Atlassian products. We suspect web browser behaviour may be involved in the problem (eg. corporate firewalls, browser type/version, caching, etc.) as there are reports of only some browsers being affected. Additionally, some users have reported that disabling browser extensions or clearing browser caches can sometimes correct the problem.
When the app fails to load, a generic error message is displayed. Sometimes, the loading problem disappears after a short period of time (e.g. 45 mins) and the app starts working again.
What have we investigated so far?
Our extensive and on-going investigation has yet to uncover the root cause(s) of this issue. I’ve compiled a summary of our investigation so far to improve our transparency.
Hypothesis: The CDN hosting all.js is unreliable and causing load failures
Steps taken:
- Requested and reviewed all CloudFront and S3 metrics from AWS searching for related 4xx or 5xx errors (an unrelated improvement to CORS security was identified and implemented).
- Deployed a fallback, cross-region CDN to production, and implemented fallback code in our internal app with highest usage (Automation for Jira). There was no traffic to this CDN that could be traced to a failure of our primary CDN.
- Added Sentry logging to Automation for Jira to identify any CDN issues (none found across millions of views per day).
- Made small improvements to all.js, making it more robust and tolerant of environmental differences, which did not have any affect.
- Tested 10 popular ad-blocker extensions to determine if all.js was being blocked.
Hypothesis: Browser and/or iframe behaviour is causing app load failures.
Steps taken:
- Reviewed all HAR files, console logs, and reports received for clues.
- Made loading of our front-end code slightly more robust in slow network environments.
- Attempted to reproduce the problem using various combinations of browsers, network conditions, browser extensions and network configurations (eg. VPN).
- Implemented long-running automated tests to try and capture diagnostics of the problem. Problem was not reproducible.
- Reviewed changes to the Cloud platform by Atlassian, looking for indications of a recent regression.
- Reviewed all Jira platform changes for the month of August (when issue was first reported), looking for indications of a recent regression.
- Reviewed open Chromium bugs for indications of a related bug or behaviour change.
Hypothesis: postMessage
calls between the iframe and the parent window are being blocked.
Steps taken:
- Added additional logging to verify that iframe messages are being sent. Very few instances of message failure were identified. Slow network or compute may cause a timeout, but not at a magnitude relevant to this investigation.
Hypothesis: Browser behaviour is caching app iframes incorrectly
Steps taken:
- Left sessions with Connect apps loaded on them in Chrome, Firefox, and Safari for Bitbucket (Pipelines), a Jira issue view with a sidebar glance open, and a general page app view. Wait until the JWT is expired and try interacting with the app. Leave the session open over night, close and restore the browser session and try interacting with the app. The problem could not be reproduced.
- While we have not yet reproduced the problem, our hypothesis that browser caching could be the culprit is still open. References:
Next steps and a call for help
As we have exhausted several avenues of investigation so far, the biggest blocker to solving this issue is the lack of a consistently reproducible environment. Therefore, we are asking for your help in gathering information that will help with the next phase of investigation. You can help us in the following ways:
- If you are able to reproduce the problem, please submit diagnostic information (such as a HAR file and console logs) via this Google Form: http://go.atlassian.com/connect-app-load-failure-survey.
- If you have performed your own internal investigations of the issue, please share any details with us. via a reply on this thread, or by emailing me at jclark at atlassian dot com.
- If you have access to a consistently reproducible environment, please reply on this thread, so that we can arrange a time to troubleshoot the issue together in real-time.
Our commitment to solving this issue
We know that this problem is causing pain for Marketplace Partners and for customers. I wish I had good news to share about what the problem is, but so far the root cause of the problem has eluded our most experienced Atlassian Connect and app iframe experts.
We’re committed to working on this issue until it is solved, and we will pursue all avenues of investigation as they are uncovered. Thank you for your patience and understanding.