Use REST API to get Confluence page with images

SteveJorgensen · May 17, 2022, 6:25pm

I have been having trouble trying to use the REST API to get a page from Confluence in a form that I can use to render a preview.

First, I tried to get the page in Word format which actually returns MHTML with a .doc file extension. If that were a real Word document, then I could use LibreOffice to convert it to PDF, but it’s not, so I can’t. I should be able to use wkhtmltobdf to convert it, but the only available builds of wkhtmltopdf fail to work for me for 1 reason or another. The build that recognizes the MHTML fails to work headlessly, so I’m stuck there.

My next thought was to try to get the page in export view format and use wkhtmltopdf on that, but then how to I get the images? I see that I can get a mediaToken in the API response from Confluence, but I can find no documentation on how to use that token once I have it.

I don’t want to pass the main API token as a command line argument to the wkhtmltopdf executable because that’s a security risk.

james.dellow · May 17, 2022, 10:46pm

This sounds like a very complicated way of doing what you want to do. Why can’t you just process the body.view?

SteveJorgensen · May 17, 2022, 10:53pm

I tried to explain that in my OP. That would work fine except, how do I get the embedded images while processing that?

I think I should be using the mediaToken to gain access to those, but I can’t find any documentation on how we’re supposed to use one after getting it. Presumably, there is an HTTP header that I should be putting it into?

james.dellow · May 18, 2022, 2:02am

I have no idea how to get the mediaToken - I believe that’s an internal Atlassian thing. I’ve worked around that by processing /child/attachment and matching them to the file in the URL that requires the mediaToken

SteveJorgensen · May 18, 2022, 2:16am

Getting the token is easy. Just not sure how to use it. Being able to fetch the attachments is 1 thing. Rendering the page with those linked is another. If I use something like wlhtmltopdf, it’s going to try to follow the links. That’s why I was hoping I can just tell whktmltopdf to add a header with the mediaToken value in it, or something like that.

SteveJorgensen · May 28, 2022, 7:39am

I finally figured out a way to do what I want. I can use the API to get the content, including “body.export_view”, save the value out of the response data to an .html file, then use wkhtmltopdf to render that file, but there is a trick to making wkhtmltopdf render inline image correctly.

Instead of just telling wkhtmltopdf the username and password (which won’t propagate to the requests to get the image sources), compute the basic auth header from the username and API token (combine with a colon in between and then encode as base-64) and pass that using --custom-header Authorization 'Basic <encoded-auth>'. To make that propagate to the image source requests, also supply --custom-header-propagation.

I’m not sure if it is necessary, but to be safe, I am also supplying --enable-external-links.

Finally, to keep from exposing the credentials as command args of the running wkhtmltopdf process descriptor, use the --read-args-from-stdin option and pass all of the other arguments as a line of input to wkhtmltopdf through stdin instead of directly as command arguments.