An update on our investigation to solve intermittent iframe loading problems in Connect apps

What is the problem?

We are receiving multiple reports from Marketplace Partners that their apps are experiencing errors when loaded in the host Atlassian product. This issue appears intermittently and can affect all app iframes on the current page. We know that this issue affects multiple apps across multiple Atlassian products. We suspect web browser behaviour may be involved in the problem (eg. corporate firewalls, browser type/version, caching, etc.) as there are reports of only some browsers being affected. Additionally, some users have reported that disabling browser extensions or clearing browser caches can sometimes correct the problem.

When the app fails to load, a generic error message is displayed. Sometimes, the loading problem disappears after a short period of time (e.g. 45 mins) and the app starts working again.

What have we investigated so far?

Our extensive and on-going investigation has yet to uncover the root cause(s) of this issue. I’ve compiled a summary of our investigation so far to improve our transparency.

Hypothesis: The CDN hosting all.js is unreliable and causing load failures

Steps taken:

  • Requested and reviewed all CloudFront and S3 metrics from AWS searching for related 4xx or 5xx errors (an unrelated improvement to CORS security was identified and implemented).
  • Deployed a fallback, cross-region CDN to production, and implemented fallback code in our internal app with highest usage (Automation for Jira). There was no traffic to this CDN that could be traced to a failure of our primary CDN.
  • Added Sentry logging to Automation for Jira to identify any CDN issues (none found across millions of views per day).
  • Made small improvements to all.js, making it more robust and tolerant of environmental differences, which did not have any affect.
  • Tested 10 popular ad-blocker extensions to determine if all.js was being blocked.

Hypothesis: Browser and/or iframe behaviour is causing app load failures.

Steps taken:

  • Reviewed all HAR files, console logs, and reports received for clues.
  • Made loading of our front-end code slightly more robust in slow network environments.
  • Attempted to reproduce the problem using various combinations of browsers, network conditions, browser extensions and network configurations (eg. VPN).
  • Implemented long-running automated tests to try and capture diagnostics of the problem. Problem was not reproducible.
  • Reviewed changes to the Cloud platform by Atlassian, looking for indications of a recent regression.
  • Reviewed all Jira platform changes for the month of August (when issue was first reported), looking for indications of a recent regression.
  • Reviewed open Chromium bugs for indications of a related bug or behaviour change.

Hypothesis: postMessage calls between the iframe and the parent window are being blocked.

Steps taken:

  • Added additional logging to verify that iframe messages are being sent. Very few instances of message failure were identified. Slow network or compute may cause a timeout, but not at a magnitude relevant to this investigation.

Hypothesis: Browser behaviour is caching app iframes incorrectly

Steps taken:

  • Left sessions with Connect apps loaded on them in Chrome, Firefox, and Safari for Bitbucket (Pipelines), a Jira issue view with a sidebar glance open, and a general page app view. Wait until the JWT is expired and try interacting with the app. Leave the session open over night, close and restore the browser session and try interacting with the app. The problem could not be reproduced.
  • While we have not yet reproduced the problem, our hypothesis that browser caching could be the culprit is still open. References:

Next steps and a call for help

As we have exhausted several avenues of investigation so far, the biggest blocker to solving this issue is the lack of a consistently reproducible environment. Therefore, we are asking for your help in gathering information that will help with the next phase of investigation. You can help us in the following ways:

  1. If you are able to reproduce the problem, please submit diagnostic information (such as a HAR file and console logs) via this Google Form: http://go.atlassian.com/connect-app-load-failure-survey.
  2. If you have performed your own internal investigations of the issue, please share any details with us. via a reply on this thread, or by emailing me at jclark at atlassian dot com.
  3. If you have access to a consistently reproducible environment, please reply on this thread, so that we can arrange a time to troubleshoot the issue together in real-time.

Our commitment to solving this issue

We know that this problem is causing pain for Marketplace Partners and for customers. I wish I had good news to share about what the problem is, but so far the root cause of the problem has eluded our most experienced Atlassian Connect and app iframe experts.

We’re committed to working on this issue until it is solved, and we will pursue all avenues of investigation as they are uncovered. Thank you for your patience and understanding.

14 Likes

Thanks @HeyJoe, appreciate the writeup!

Do you already have access to the support requests raised by Easy Agile team members or do we need to share them with you?

Thanks,
Nick Muldoon, Easy Agile

Yes, just let me know the issue keys or the reporter email address(es) and I can look them up from there.

I don’t know if this is related or not, but I have been getting some AP errors for some time now, where AP is defined (so all.js is loaded as I am checking for AP to load before accessing it), but is missing some functions/properties. Unfortunately, I am unable to replicate this and this is not consistent. They are reported as first seen on Nov 10.

Some errors reported by Sentry:
AP.request is not a function
AP.events is undefined
AP.cookie is undefined

1 Like

I have just modified Sentry to include all AP keys in that case. And I get that in these cases AP is returned with

["_xdm",“parentTargets”,"_data","_hostOrigin","_top","_host","_topHost","_initTimeout","_initReceived","_initCheck","_isKeyDownBound","_eventHandlers","_pendingCallbacks","_keyListeners","_version","_apiTampered","_isSubIframe","_onConfirmedFns","_promise","_messageHandlers",“resize”,“container”,“size”,“registerAny”,“register”,"_hostModules",“defineGlobal”,“defineModule”,“subCreate”,“Dialog”,“define”,“require”,“Meta”,“meta”,“localUrl”,"_util"]

while normally it should return

["_xdm",“parentTargets”,"_data","_hostOrigin","_top","_host","_topHost","_initTimeout","_initReceived","_initCheck","_isKeyDownBound","_eventHandlers","_pendingCallbacks","_keyListeners","_version","_apiTampered","_isSubIframe","_onConfirmedFns","_promise",“request”,“messages”,“flag”,“dialog”,“inlineDialog”,“env”,“events”,"_analytics",“scrollPosition”,“dropdown”,“host”,“cookie”,“history”,“navigator”,“user”,“context”,“jira”,“dropdownList”,"_messageHandlers",“resize”,“container”,“size”,“registerAny”,“register”,"_hostModules",“defineGlobal”,“defineModule”,“subCreate”,“Dialog”,“define”,“require”,“Meta”,“meta”,“localUrl”,"_util",“getUser”,“getCurrentUser”,“getTimeZone”,“getLocale”,“getLocation”,“sizeToParent”]

1 Like

Is the URL to this fallback URL public?

Is the URL to this fallback URL public?

@BobBergman - it would have been public at the time of the test, but I don’t think we kept it live long-term.

I don’t know if this is related or not, but I have been getting some AP errors for some time now, where AP is defined (so all.js is loaded as I am checking for AP to load before accessing it), but is missing some functions/properties

Hi @RaimisJ - I haven’t heard anyone else reporting these symptoms, so I also don’t know if this is related. It would be interesting to hear if anyone else is observing the same behaviour.

@RaimisJ Thanks for the info. If an app or page includes all.js, and is loaded in the browser outside of a Jira / Confluence iframe, then AP will be missing some functions / properties as you describe.

We’ve seen this in the past for scenarios such as:

  • Automated testing of an app (without Jira / Confluence)
  • Pages served by the app that are not intended to be iframes (eg. external links)

If you, or anyone could confirm you’re seeing this error on a customer instance, inside an Atlassian iframe that would be a significant clue

@dboyd, I can confirm that these errors are for pages loaded in iframe since it otherwise has all the parameters in a query string, that the usual customer instance has.

This is not related to the main report and the issues we have been experiencing.

The main problem is that loading add-on iframe times out and add-on code is not executed at all.
When it happens, all apps/iframes are affected.

More information: Apps fail to load due to timeouts

We’re seeing this same behavior in some cases (not all though) as @RaimisJ.

Late to the party I know (holidays etc).

We’ve had an influx of these in the past couple of months. So far for us the issue has been:

  1. Corporate firewalls
  2. Content filtering extensions
  3. objects/methods on AP missing (ie AP.context ).
  4. Other (we suspect #1 but they’re able to access the page directly which then causes other issues).

What’s bothersome about #1 and #2 is that it recently started. It’s not until you start looking at HAR files that we’re able to see things(which btw - is really difficult for us to get for the customers that just uninstall us - it sure would be nice if the error messaging was updated perhaps?).

The #1 is really difficult as well since for large companies - any changes to the firewall has to go through approvals.

We had one case where the end user had to allow list the atlassian.net domain (not ours) to get things working for #2. (I suspect that #1 is related to this).

1 Like

User here, we have iframe errors filling our logs and the thing we did change recently is that we added JSD, so our server now runs Jira Software and Jira Service Desk and it happens when we enable the qTest add-on.

base-url/browse/JSD-1 throws an error
base-url/browse/JSW-1 mostly does not

Best we can tell in our case all it does it fill up the logs, which is not welcome.

Not sure whether its another red-herring or whether putting JSD in the mix makes it more reproducible for somebody.

Hello @MartinaRiedel,

Thanks for sharing this information. From the level of detail you have provided, it’s hard to tell whether or not the symptoms you are experiencing are part of this problem or something different.

If you are able to capture additional information about the error, such as detailed log messages or a HAR file capture of the problem, please submit them via the survey link at the top of this post - http://go.atlassian.com/connect-app-load-failure-survey

Thanks,
Joe.