An update on our investigation to solve intermittent iframe loading problems in Connect apps

HeyJoe · December 21, 2020, 3:14am

What is the problem?

We are receiving multiple reports from Marketplace Partners that their apps are experiencing errors when loaded in the host Atlassian product. This issue appears intermittently and can affect all app iframes on the current page. We know that this issue affects multiple apps across multiple Atlassian products. We suspect web browser behaviour may be involved in the problem (eg. corporate firewalls, browser type/version, caching, etc.) as there are reports of only some browsers being affected. Additionally, some users have reported that disabling browser extensions or clearing browser caches can sometimes correct the problem.

When the app fails to load, a generic error message is displayed. Sometimes, the loading problem disappears after a short period of time (e.g. 45 mins) and the app starts working again.

What have we investigated so far?

Our extensive and on-going investigation has yet to uncover the root cause(s) of this issue. I’ve compiled a summary of our investigation so far to improve our transparency.

Hypothesis: The CDN hosting all.js is unreliable and causing load failures

Steps taken:

Requested and reviewed all CloudFront and S3 metrics from AWS searching for related 4xx or 5xx errors (an unrelated improvement to CORS security was identified and implemented).
Deployed a fallback, cross-region CDN to production, and implemented fallback code in our internal app with highest usage (Automation for Jira). There was no traffic to this CDN that could be traced to a failure of our primary CDN.
Added Sentry logging to Automation for Jira to identify any CDN issues (none found across millions of views per day).
Made small improvements to all.js, making it more robust and tolerant of environmental differences, which did not have any affect.
Tested 10 popular ad-blocker extensions to determine if all.js was being blocked.

Hypothesis: Browser and/or iframe behaviour is causing app load failures.

Steps taken:

Reviewed all HAR files, console logs, and reports received for clues.
Made loading of our front-end code slightly more robust in slow network environments.
Attempted to reproduce the problem using various combinations of browsers, network conditions, browser extensions and network configurations (eg. VPN).
Implemented long-running automated tests to try and capture diagnostics of the problem. Problem was not reproducible.
Reviewed changes to the Cloud platform by Atlassian, looking for indications of a recent regression.
Reviewed all Jira platform changes for the month of August (when issue was first reported), looking for indications of a recent regression.
Reviewed open Chromium bugs for indications of a related bug or behaviour change.

Hypothesis: postMessage calls between the iframe and the parent window are being blocked.

Steps taken:

Added additional logging to verify that iframe messages are being sent. Very few instances of message failure were identified. Slow network or compute may cause a timeout, but not at a magnitude relevant to this investigation.

Hypothesis: Browser behaviour is caching app iframes incorrectly

Steps taken:

Left sessions with Connect apps loaded on them in Chrome, Firefox, and Safari for Bitbucket (Pipelines), a Jira issue view with a sidebar glance open, and a general page app view. Wait until the JWT is expired and try interacting with the app. Leave the session open over night, close and restore the browser session and try interacting with the app. The problem could not be reproduced.
While we have not yet reproduced the problem, our hypothesis that browser caching could be the culprit is still open. References:
- 324102 - chromium - An open-source project to help move the web forward. - Monorail
- 356558 - Firefox displays cached iFrame instead of the new iFrame SRC that is defined.

Next steps and a call for help

As we have exhausted several avenues of investigation so far, the biggest blocker to solving this issue is the lack of a consistently reproducible environment. Therefore, we are asking for your help in gathering information that will help with the next phase of investigation. You can help us in the following ways:

If you are able to reproduce the problem, please submit diagnostic information (such as a HAR file and console logs) via this Google Form: http://go.atlassian.com/connect-app-load-failure-survey.
If you have performed your own internal investigations of the issue, please share any details with us. via a reply on this thread, or by emailing me at jclark at atlassian dot com.
If you have access to a consistently reproducible environment, please reply on this thread, so that we can arrange a time to troubleshoot the issue together in real-time.

Our commitment to solving this issue

We know that this problem is causing pain for Marketplace Partners and for customers. I wish I had good news to share about what the problem is, but so far the root cause of the problem has eluded our most experienced Atlassian Connect and app iframe experts.

We’re committed to working on this issue until it is solved, and we will pursue all avenues of investigation as they are uncovered. Thank you for your patience and understanding.

nick · December 21, 2020, 3:40am

Thanks @HeyJoe, appreciate the writeup!

Do you already have access to the support requests raised by Easy Agile team members or do we need to share them with you?

Thanks,
Nick Muldoon, Easy Agile

HeyJoe · December 21, 2020, 3:52am

Yes, just let me know the issue keys or the reporter email address(es) and I can look them up from there.

RaimisJ · December 21, 2020, 12:42pm

I don’t know if this is related or not, but I have been getting some AP errors for some time now, where AP is defined (so all.js is loaded as I am checking for AP to load before accessing it), but is missing some functions/properties. Unfortunately, I am unable to replicate this and this is not consistent. They are reported as first seen on Nov 10.

Some errors reported by Sentry:
AP.request is not a function
AP.events is undefined
AP.cookie is undefined

RaimisJ · December 21, 2020, 1:19pm

I have just modified Sentry to include all AP keys in that case. And I get that in these cases AP is returned with

[“_xdm”,“parentTargets”,“_data”,“_hostOrigin”,“_top”,“_host”,“_topHost”,“_initTimeout”,“_initReceived”,“_initCheck”,“_isKeyDownBound”,“_eventHandlers”,“_pendingCallbacks”,“_keyListeners”,“_version”,“_apiTampered”,“_isSubIframe”,“_onConfirmedFns”,“_promise”,“_messageHandlers”,“resize”,“container”,“size”,“registerAny”,“register”,“_hostModules”,“defineGlobal”,“defineModule”,“subCreate”,“Dialog”,“define”,“require”,“Meta”,“meta”,“localUrl”,“_util”]

while normally it should return

[“_xdm”,“parentTargets”,“_data”,“_hostOrigin”,“_top”,“_host”,“_topHost”,“_initTimeout”,“_initReceived”,“_initCheck”,“_isKeyDownBound”,“_eventHandlers”,“_pendingCallbacks”,“_keyListeners”,“_version”,“_apiTampered”,“_isSubIframe”,“_onConfirmedFns”,“_promise”,“request”,“messages”,“flag”,“dialog”,“inlineDialog”,“env”,“events”,“_analytics”,“scrollPosition”,“dropdown”,“host”,“cookie”,“history”,“navigator”,“user”,“context”,“jira”,“dropdownList”,“_messageHandlers”,“resize”,“container”,“size”,“registerAny”,“register”,“_hostModules”,“defineGlobal”,“defineModule”,“subCreate”,“Dialog”,“define”,“require”,“Meta”,“meta”,“localUrl”,“_util”,“getUser”,“getCurrentUser”,“getTimeZone”,“getLocale”,“getLocation”,“sizeToParent”]

BobBergman · December 21, 2020, 3:42pm

Is the URL to this fallback URL public?

HeyJoe · December 22, 2020, 2:13am

Is the URL to this fallback URL public?

@BobBergman - it would have been public at the time of the test, but I don’t think we kept it live long-term.

I don’t know if this is related or not, but I have been getting some AP errors for some time now, where AP is defined (so all.js is loaded as I am checking for AP to load before accessing it), but is missing some functions/properties

Hi @RaimisJ - I haven’t heard anyone else reporting these symptoms, so I also don’t know if this is related. It would be interesting to hear if anyone else is observing the same behaviour.

dboyd · December 22, 2020, 3:31am

@RaimisJ Thanks for the info. If an app or page includes all.js, and is loaded in the browser outside of a Jira / Confluence iframe, then AP will be missing some functions / properties as you describe.

We’ve seen this in the past for scenarios such as:

Automated testing of an app (without Jira / Confluence)
Pages served by the app that are not intended to be iframes (eg. external links)

If you, or anyone could confirm you’re seeing this error on a customer instance, inside an Atlassian iframe that would be a significant clue

RaimisJ · December 22, 2020, 6:42am

@dboyd, I can confirm that these errors are for pages loaded in iframe since it otherwise has all the parameters in a query string, that the usual customer instance has.

jack · December 22, 2020, 10:02am

This is not related to the main report and the issues we have been experiencing.

The main problem is that loading add-on iframe times out and add-on code is not executed at all.
When it happens, all apps/iframes are affected.

More information: Apps fail to load due to timeouts

danielwester · December 28, 2020, 9:57am

We’re seeing this same behavior in some cases (not all though) as @RaimisJ.

Late to the party I know (holidays etc).

We’ve had an influx of these in the past couple of months. So far for us the issue has been:

Corporate firewalls
Content filtering extensions
objects/methods on AP missing (ie AP.context ).
Other (we suspect #1 but they’re able to access the page directly which then causes other issues).

What’s bothersome about #1 and #2 is that it recently started. It’s not until you start looking at HAR files that we’re able to see things(which btw - is really difficult for us to get for the customers that just uninstall us - it sure would be nice if the error messaging was updated perhaps?).

The #1 is really difficult as well since for large companies - any changes to the firewall has to go through approvals.

We had one case where the end user had to allow list the atlassian.net domain (not ours) to get things working for #2. (I suspect that #1 is related to this).

MartinaRiedel · January 14, 2021, 1:56am

User here, we have iframe errors filling our logs and the thing we did change recently is that we added JSD, so our server now runs Jira Software and Jira Service Desk and it happens when we enable the qTest add-on.

base-url/browse/JSD-1 throws an error
base-url/browse/JSW-1 mostly does not

Best we can tell in our case all it does it fill up the logs, which is not welcome.

Not sure whether its another red-herring or whether putting JSD in the mix makes it more reproducible for somebody.

HeyJoe · January 15, 2021, 2:33am

Hello @MartinaRiedel,

Thanks for sharing this information. From the level of detail you have provided, it’s hard to tell whether or not the symptoms you are experiencing are part of this problem or something different.

If you are able to capture additional information about the error, such as detailed log messages or a HAR file capture of the problem, please submit them via the survey link at the top of this post - http://go.atlassian.com/connect-app-load-failure-survey

Thanks,
Joe.

MartinaRiedel · January 28, 2021, 8:03pm

Hi Joe, it turned out to be something different.
Thanks a bunch for your reply.
Martina

nick · February 10, 2021, 10:56pm

Morning @HeyJoe,

Just wondering if you’ve been able to reliably reproduce? Anecdotally we have not seen this come through in support requests since Christmas - are you aware of any changes that may have improved the situation?

Thanks Joe, have a great day,
Nick Muldoon, Easy Agile

jack · February 11, 2021, 9:09am

I faced it on my own a few times in the recent weeks.

However, the number of support cases dropped significantly. I wonder if something was fixed or customers get used to live with it.

Cheers,
Jack

HeyJoe · February 17, 2021, 1:02pm

Hi everyone,

Thank you to those who uploaded further information to help us troubleshoot this problem.

@nick @jack we have not deployed any changes beyond additional diagnostics/monitoring that would solve the problem. Given that at this stage, we haven’t ruled out a browser issue, it’s possible that a recent change to Chrome/Firefox/etc. has improved the situation.

Unfortunately, we have not yet identified the root cause of the problem. Despite this, I want to provide an update on our on-going investigation.

As we analysed data from affected apps, we were able to identify cases where the problem was due to an issue in the app itself. This is a pertinent reminder that the generic nature of the error makes it hard to identify the source of the error. Cases where we identified bugs in the app were solved in collaboration with the app developer.

We re-investigated potential reliability issues with the CDN. We identified ~0.05% of requests failing after a HTTP 200 response with a ClientConnectionError. We do not believe this is the source of the problem, but are still investigating.

We also investigated the HAR files uploaded to us via the survey form. One file showed an interesting scenario of duplicate GET requests to the iframe. To dig into this further, we’re building further analytics to see if this is a regular occurrence. Further investigation of this clue is proceeding.

If your app is still experiencing this problem, we encourage you to upload diagnostics for us to analyse at http://go.atlassian.com/connect-app-load-failure-survey. More information will increase the likelihood that we can identify a pattern of behaviour that leads to the root cause.

Thanks for your continued help in working to solve this problem for our shared customers.

Regards,
Joe Clark [Atlassian]

nick · February 18, 2021, 3:03am

Thanks for the update Joe, greatly appreciated.

james.dellow · February 18, 2021, 8:00pm

Late last year I reported an issue with links created by an app in Confluence break when the page hasn’t fully loaded (the user falls through to a JIRA page instead).
Could there be any relationship between that and this problem of intermittent iframe loading?
Support ticket reference is DEVHELP-5553

HeyJoe · February 21, 2021, 12:27pm

Hi @james.dellow,

Thanks for the extra info. I’ll get the team to take a look at DEVHELP-5553 and see if it’s related.

Thanks,
Joe.