504 Issue When Fetching Bulk Confluence Content

hassandewitt · January 13, 2023, 11:22pm

My team is using the wiki/rest/api/content?expand=body.view,history,version endpoint to fetch content from Confluence. The specific focus is to access all page content which includes the created by, editor and body. Our goal is to store this content in Elasticsearch along with other app’s data used like Betterworks, Salesforce, Slack etc. for example, which we would have a search bar on the front-end to search up terms or people and return a curated list of content from each app which the content contains the search term sorted by relevance, i.e. why we’re utilizing Elasticsearch.

The content endpoint is paginated by default for efficiency which only allows us to request 100 records at a time, thus we are using a while loop in our server to make subsequent requests using the next url provided by the previous request’s response. The issue is, each subsequent request seems to take longer to response. Every response is slower and slower until the request takes too long and we receive a 504 gateway timeout.

We decided to handle the error gracefully by adding an axios interceptor which upon a 504 failure would take the URL and reduce the limit to 25. We have several conditions in place to handle failures which will eventually set the limit to 1 record at a time. When the subsequent requests have failed enough to now only be requesting 1 record at a time, eventually that even fails.

At first I thought maybe it’s our codebase and the fact that we’re making many requests, but then I decided to make a single request for a single record in Postman, and it failed. Here’s the URL I used: https://dialexa.atlassian.net/wiki/rest/api/content?next=true&expand=body.view,history,version&limit=1&start=4036. The list is set to 1 and the start point is 4036. It takes about 3 minutes, but eventually this fails with a 504. This seems like an issue with Confluence’s content endpoint and how the data is index. The higher the start is set along with the limit and what content is expanded, it appears this causes the issue.

Is this a known issue or are there any work arounds?

Also, some of the content coming back is set as archived. Is there a way to ignore archived records?

Sorry for the long explanation and thanks in advance for your replies and support on the matter.

SilvreLestang · February 14, 2023, 9:11am

Hello @hassandewitt,

Your question is a few months old but if you are still struggling with this or if others found this question after encountering the same issue, the answer is to use the new Confluence v2 API, which is much more performant when paginating over a lot of content.

See Release of v2 Confluence REST API for Pages and Blogposts (Experimental) and Release of new endpoints for Confluence REST API v2

kashev · October 9, 2024, 8:24pm

I am getting similar results in 2024 using the V2 API for content versions:

https://${BASE_URL}.atlassian.net/wiki/api/v2/pages/${CONTENT_ID}/versions?body-format=atlas_doc_format&limit=10

We have a similar geometric backoff scheme and eventually get user-facing errors anyways. I’d be curious what we’re supposed to do when requested even a single page version is failing with a 504 gateway timeout.