How to parse Confluence storage format in JavaScript?

candid · October 25, 2021, 9:58pm

In my TypeScript app, I want to retrieve a content object from Confluence using the REST API, modify its body, and then save a new version of the object. What library do I use to parse the storage format? I need to do this in the browser for the frontend, in a jsdom environment for the frontend tests, and in a Node.js environment for the backend.

Usually I would use jQuery in the frontend and cheerio in the backend. The problem is that when I use these in HTML mode, <![CDATA[ nodes are not parsed properly. These are required in <ac:plain-text-body> elements. On the other hand, when I use these in XML mode, parsing errors are thrown because entities such as – cannot be found.

I found this comment explaining how to wrap storage format in an XML declaration so that XML parsers support it. After manually copying all the XML entities that are referenced in that declaration into the DOCTYPE, I had some success with jQuery in the browser. However, both jQuery under jsdom as well as cheerio still have trouble parsing the entities.

This can’t be such an uncommon thing to do, does anyone have any advice?