My team is using the
wiki/rest/api/content?expand=body.view,history,version endpoint to fetch content from Confluence. The specific focus is to access all page content which includes the created by, editor and body. Our goal is to store this content in Elasticsearch along with other app’s data used like Betterworks, Salesforce, Slack etc. for example, which we would have a search bar on the front-end to search up terms or people and return a curated list of content from each app which the content contains the search term sorted by relevance, i.e. why we’re utilizing Elasticsearch.
content endpoint is paginated by default for efficiency which only allows us to request 100 records at a time, thus we are using a while loop in our server to make subsequent requests using the
next url provided by the previous request’s response. The issue is, each subsequent request seems to take longer to response. Every response is slower and slower until the request takes too long and we receive a
504 gateway timeout.
We decided to handle the error gracefully by adding an
interceptor which upon a
504 failure would take the URL and reduce the
limit to 25. We have several conditions in place to handle failures which will eventually set the
limit to 1 record at a time. When the subsequent requests have failed enough to now only be requesting 1 record at a time, eventually that even fails.
At first I thought maybe it’s our codebase and the fact that we’re making many requests, but then I decided to make a single request for a single record in Postman, and it failed. Here’s the URL I used:
https://dialexa.atlassian.net/wiki/rest/api/content?next=true&expand=body.view,history,version&limit=1&start=4036. The list is set to 1 and the start point is 4036. It takes about 3 minutes, but eventually this fails with a
504. This seems like an issue with Confluence’s
content endpoint and how the data is index. The higher the start is set along with the limit and what content is expanded, it appears this causes the issue.
Is this a known issue or are there any work arounds?
Also, some of the content coming back is set as archived. Is there a way to ignore archived records?
Sorry for the long explanation and thanks in advance for your replies and support on the matter.