How to parse Confluence storage format in JavaScript?

In my TypeScript app, I want to retrieve a content object from Confluence using the REST API, modify its body, and then save a new version of the object. What library do I use to parse the storage format? I need to do this in the browser for the frontend, in a jsdom environment for the frontend tests, and in a Node.js environment for the backend.

Usually I would use jQuery in the frontend and cheerio in the backend. The problem is that when I use these in HTML mode, <![CDATA[ nodes are not parsed properly. These are required in <ac:plain-text-body> elements. On the other hand, when I use these in XML mode, parsing errors are thrown because entities such as &ndash; cannot be found.

I found this comment explaining how to wrap storage format in an XML declaration so that XML parsers support it. After manually copying all the XML entities that are referenced in that declaration into the DOCTYPE, I had some success with jQuery in the browser. However, both jQuery under jsdom as well as cheerio still have trouble parsing the entities.

This can’t be such an uncommon thing to do, does anyone have any advice?

2 Likes

It turns out that in the browser, storage format can be parsed in XML mode (for the CDATA sections and namespaces) with an XHTML DOCTYPE (for the entities). An example would look like this:

const storageFormatBody = '<span>&ouml;<![CDATA[ä]]></span>';
const el = new DOMParser().parseFromString(`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xmlns:atlassian-content="http://atlassian.com/content" xmlns:ac="http://atlassian.com/content" xmlns:ri="http://atlassian.com/resource/identifier" xmlns:atlassian-template="http://atlassian.com/template" xmlns:at="http://atlassian.com/template"><body>${storageFormatBody}</body></html>`, 'application/xml').querySelector('body');

el is then a DOM element that contains the body. Instead of DOMParser, jQuery.parseXML can also be used. After making our modifications, we can get the storage format back using el.innerHTML. This approach has two problems though:

  • Dealing with XML namespaces is really painful and inconsistent among browsers. For example in Chrome, I would select a macro using const macro = el.querySelector('ac\\:structured-macro');, but then to select its parameters, I would have to use macro.querySelector('parameter') instead of macro.querySelector('ac\\:parameter'). Whether the prefix has to be present or not in a selector depends on the browser and on the context. jquery also officially doesn’t support XML namespaces because of these inconsistencies.
  • This approach does not work in a Node.js environment because DOMParser is not available there. Usually when running frontend tests in Node.js, a browser environment is simulated using jsdom. The problem is that the above approach does not work in jsdom, but it throws an exception about an unknown XML entity (which I have reported here).

In order to have consistent behaviour between different browsers and also between the frontend and the backend, I have decided to use a JavaScript-based XML parser even when running in the browser. For now the only one that I have found that supports this use case is cheerio. It might not be the strictest or most standards-compliant XML parser, but since we can rely on Confluence providing us with valid XML, I think it should be sufficient for this use case.

To support both HTML entities and CDATA sections, we have to use cheerio in a somewhat unusual way, as has been pointed out to me here:

const storageFormatBody = '<span>&ouml;<![CDATA[ä]]></span>';
const el = cheerio.load(`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xmlns:atlassian-content="http://atlassian.com/content" xmlns:ac="http://atlassian.com/content" xmlns:ri="http://atlassian.com/resource/identifier" xmlns:atlassian-template="http://atlassian.com/template" xmlns:at="http://atlassian.com/template"><body>${storageFormatBody}</body></html>`, {
    xml: { xmlMode: false, recognizeCDATA: true, recognizeSelfClosing: true }
})('body');

el is now a cheerio object that can be modified using an API very similar to jquery. To get the updated storage format at the end, use el.html().

3 Likes