Finding a page by title

If you ever want to retrieve one or multiple pages by title using the Confluence REST API and thought that would be a matter of one simple call against the CQL or Content REST API, you have been fooled by a collection of bugs in the REST endpoints of Confluence Cloud and Server. In this post I am investigating how different types of page titles behave when used against the different endpoints and am suggesting a solution for app developers how to fetch pages by titles that also works for eccentric titles, as well as making some suggestions to Atlassians how to make our lives easy again.

Our app provides users the option to create pages. To avoid title conflicts, we are validating the title that they have typed in by querying Confluence for existing pages with the same title. As reported here, there is a bug in the Confluence REST API where page titles are handled as case-sensitive, even though they are not. As a workaround, we have to use the fuzzy search (using the ~ operator).

It seems that there have been some changes to the search API on Cloud, causing our app to crash due to a new type of error: com.atlassian.confluence.api.service.exceptions.BadRequestException: CQL was parsed but the search manager was unable to execute the search. CQL: 'title~\"\\\"\"'. Error message: com.atlassian.confluence.search.searchnsmarts.rest.SSSearchClientException: There was an illegal request passed to S&S Content Search API : HTTP/1.1 400.

Investigation

I have done some detailed investigation to find out which special characters in page titles behave in which way. I have created a bunch of test pages with a title like Test!page, where the ! is one of these characters: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ and tried to find them using different ways. Note that there might be other special characters (non-printable or non-ASCII) that cause problems and that I haven’t tested yet. If you come across such a case, please let me know.

  • Using /rest/api/content?spaceKey=TEST&title=Test!page, I could find all of them correctly. On Cloud this is case-sensitive, while on Server it isn’t.
  • Using /rest/api/search?cql=title="Test!page", I had the following results:
    • For " (using ?cql=title="Test%5C"page"), Test^page and Test`page were returned on Cloud, but not Test"page. On Server, nothing was returned.
    • For \, it worked when using ?cql=title="Test%5C%5Cpage". When using ?cql=title="Test%5Cpage", it returned Test^page and Test`page on Cloud, but not Test\page, and nothing on Server.
    • For ^ and `, both Test^page and Test`page were returned. Escaping it with a backslash didn’t change anything.
    • All other characters found the right page
    • If a page title ends with \, I have to escape that as \\\", otherwise it fails with Escaped end quote in phrase. It does not make sense but it seems to work.
    • Note that on Cloud this is case-insensitive, while on Server it is case-sensitive.
  • Using /rest/api/search?cql=title~"Test%20page" found all pages except Test'page, Test.page, Test_page and Test:page
  • Using /rest/api/search?cql=title~"Test!page", I had the following results:
    • For !:<>=\ it returned no results
    • For " I received a BadRequestException (which makes sense, because it should be escaped)
    • For ()/[]^{}, I received an SSSearchClientException on Cloud. On Server, these returned all pages except Test'page, Test.page and Test:page.
    • For *?, it returned Test'page, Test.page, Test_page and Test:page
    • For +,-;@`|~, it returned all pages except Test'page, Test.page, Test_page and Test:page
    • For '._, it returned the right page
    • The special case title~"" or title~" " throws an Internal Server Error
  • Using /rest/api/search?cql=title~"Test%5C!page" (escaping the character with a \), I had the following results:
    • For ", it returned no results
    • For '.:_, it returned the right page
    • For !#$%&()*+,-/;<=>?@[\]^`{|}~, it returned all pages except Test'page, Test.page, Test_page and Test:page
    • When searching for pages that only contain such special characters, for example a page called :, Cloud returns all pages from the space, while Server returns no results.

So I have found out the following things about the internals of CQL:

  • Searching for a page by its title works, except when the title contains a ". But it is case-sensitive on Server.
  • In a fuzzy search, the characters * and ? have a special meaning as documented in the Confluence docs. They can be escaped using a \.
  • The characters '._: are seen as a regular character, while the characters !"#$%&()*+,-/;<=>?@[\]^`{|}~ are seen as word separators. When used in a fuzzy search query, the latter ones are equal to each other and to a space.
  • The characters !:<>=\"()/[]^{} have a special meaning and have to be escaped using a \. All other characters can be escaped but don’t have to.

Solution

I have come up with these methods to fetch pages by title that should work for any kind of title. Since I’m getting lost in all these rules myself, it could be that I’ve made a mistake somewhere.

Update: Check the comments below for a more convenient method to fetch pages by title.

Finding a single page by title

  • For Server: Use the /rest/api/content?spaceKey=SPACE&title=Page+title endpoint.
  • For Cloud:
    • If the page title doesn’t contain a ", use the /rest/api/search?cql=space="SPACE"+and+title="Page+title" endpoint. Escape \ as \\ in the beginning or middle of the title and as \\\" at the end of the title. Manually filter the results and only accept that with the right title (case-insensitive comparison). Make sure to quote the space key, as some space keys don’t work without quotes.
    • If the page title contains a " but no characters that are relevant for case sensitivity, use the /rest/api/content?spaceKey=SPACE&title=Page+title REST endpoint.
    • If the page title contains both " and characters that are relevant for case sensitivity, use /rest/api/search?cql=space="SPACE"+and+title~"Page+title" and look if any page has the title that you are looking for (case-insensitive comparison). In the page titles, replace the characters !<>=\"()/[]^{}*? with spaces and escape the character : as \:. Make sure to fetch all result pages, as some fuzzy searches might return a lot of results.

Finding multiple pages by title

  • For Cloud:
    • For all pages not containing a ", like this: title in ("Page title 1","Page title 2")
    • For all pages containing a ", like this: title~"Page title"), where you replace the characters !<>=\"()/[]^{} with spaces and escape the character : as \:.
    • For all pages containing a " and only consisting of the characters !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~, you have to fetch them using the /rest/api/content?spaceKey=SPACE&title=Page+title endpoint or by fetching all pages from the space (slow for large spaces) and finding your desired page in the list.
    • Always filter the results and only keep those whose title you are looking for (case-insensitive comparison)
  • For Server:
    • For titles that don’t contain any letters (so case-insensitivity doesn’t matter) or " quotes use CQL as in title in ("Page title 1","Page title 2")
    • For titles only consisting of the characters !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ and containing at least one ", you have to fetch them manually using the /rest/api/content?spaceKey=SPACE&title=Page+title API.
    • For all other pages, use the CQL like this: title~"Page+title". In the page titles, replace the characters !<>=\"()/[]^{}*? with spaces and escape the character : as \:. Filter the results and only use those that have a matching title (case-insensitive comparison).

To find multiple pages using CQL, some using exact matching and some using fuzzy matching, do your query like this: space="SPACE" and (title in ("Page title 1","Page title 2") or title~"Page title 3" or title~"Page title 4"). Make sure to put the space in quotes, as some space keys don’t work without quotes. Make sure to fetch all result pages, as some fuzzy searches might return a lot of results.

Example code

/**
 * Decide whether to fetch a page with a particular title using the Content API, CQL (exact match) or CQL (fuzzy match).
 * @param title The title of the page to fetch
 * @param forMultiple Whether we would like to fetch the page as part of a batch request trying to fetch multiple pages at once
 */
const getTitleMethod = (title, forMultiple) => {
    const isCased = title.toUpperCase() !== title.toLowerCase();

    if (!forMultiple && (isServer || !isCased)) {
        // When fetching a single item, the Content API can be used for all titles on Server and for titles without case-sensitive characters on Cloud
        return TITLE_METHOD.CONTENT;
    } else if ((isCloud || !isCased) && !title.includes('"')) {
        // CQL exact matching can be used for all titles on Cloud and and for uncased titles on Server, except for titles with a quotation mark in them
        return TITLE_METHOD.CQL;
    } else if (!title.match(/^[!"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~]+$/)) {
        // Fuzzy matching can be used for all titles, except the ones that only consist of special characters
        return TITLE_METHOD.CQL_FUZZY;
    } else {
        // Pages that cannot be fetched through CQL have to be fetched using the content API, even when multiple pages should be fetched at once
        return TITLE_METHOD.CONTENT;
    }
};

/**
 * Put a string in quotes and escape some characters so that it can be used in CQL.
 */
export const quoteCql = (value) => (
    `"${value
        // Quotes should be escaped as \", but since there is no way to match quotes in CQL, let's remove them to avoid crashes
        .replace(/"/g, ' ')
        // Backslashes should be escaped as \\. At the end of the string, they need to be escaped as \\\"
        .replace(/\\/g, (m, o) => o === value.length - 1 ? '\\\\\\"' : '\\\\')
    }"`
);

/**
 * Put a string in quotes and escape some characters so that it can be used on the right side of a ~ fuzzy match operator in CQL.
 * @param value
 */
export const quoteFuzzyCql = (value) => (
    `"${value
        .replace(/[!<>=\\"()/[\]^{}*?]/g, ' ')
        .replace(/:/g, '\\:')
    }"`
);

/**
 * Create a CQL query to fetch the specified pages. Only works for titles that can be fetched through CQL, titles that have to be fetched through the Content API have to be fetched separately.
*/
export const makeCqlQuery = (spaceKey, titles) => (
    const parts = titles
        .filter((title) => getTitleMethod(title, true) === TITLE_METHOD.CQL_FUZZY)
        .map((title) => `title~${quoteFuzzyCql(title)}`);

    const byCql = titles.filter((title) => getTitleMethod(title, true) === TITLE_METHOD.CQL);
    if (byCql.length > 0) {
        parts.push(`title IN (${byCql.map((title) => quoteCql(title)).join(',')})`);
    }

    return `type="page" and space=${quoteCql(spaceKey) and (${parts.join(' OR ')})`;
);

What Atlassian can do

  • Announce publicly that the search API has changed on Cloud and that ()/[]^{} need to be escaped now in fuzzy search terms, otherwise the search crashes. It has not been like this before.
  • Make the /rest/api/content?spaceKey=SPACE&title=Page+title API case-insensitive on Cloud. It just does not make any sense that it is case-sensitive, when page titles are case insensitive.
  • Make the title CQL field case-insensitive on Server. (Note that I have tested on 7.0, it might have changed in more recent versions.)
  • Fix the handling of " in page titles in CQL on both Cloud and Server. Right now pages with a " in the title cannot be found using a title="Page title" query.
  • Fix the handling of \ at the end of a string in CQL on both Cloud and Server. Right now the string "\\" throws an error, while the string "\\\"" represents \, which does not make any sense.
  • Update the CQL documentation:
    • Mention that characters can be escaped using \.
    • Mention that !:<>=\"()/[]^{} have a special meaning in fuzzy search terms and need to be escaped.
    • Mention the bugs that will not be fixed.
21 Likes

Thanks for putting this together @candid!

There are similar issues with group names and the REST API: Confluence REST API and group name with "/" - #4 by marc and [CONFCLOUD-68869] group API method cannot fetch group names with a slash - Create and track feature requests for Atlassian products.

1 Like

Good work! I have already faced the same problem, but didn’t have enough time for this kind of research. So, thank you for this!

Hi all, there is a public ticket for this issue. The Confluence Cloud team is aware and working on it.
https://jira.atlassian.com/browse/CONFCLOUD-70810

2 Likes

Since I keep running into this issue and the workarounds that I have posted above are rather complicated to implement, I want to share with you a new workaround, which is equally hacky but much easier to implement.

The trick is to use the Convert content body REST API to convert a list of Confluence page references to HTML links. This will yield us the page IDs and URLs of the pages. As opposed to the various other ways to look up pages by title using the REST API, this approach is case insensitive, allows us to look up multiple pages at once and works with any kind of special characters in the page titles. In all my use cases, when I look up a page by title it is to find out whether a page with that title already exists, or to resolve a link to a page reference. If you need to find out more details about the page, you can still use this approach and make a second call where you look up the pages by ID using CQL.

In Confluence storage format, page references look like this:

<ac:link><ri:page ri:space-key="TEST" ri:content-title="Existing page"/></ac:link>
<ac:link><ri:page ri:space-key="TEST" ri:content-title="Non-existing page"/></ac:link>

Converting this to export_view format will yield a result like this:

<a href="https://cdauth.atlassian.net/wiki/spaces/TEST/pages/2870837249" data-linked-resource-id="2870837249" data-linked-resource-version="1" data-linked-resource-type="page">Existing page</a>
<a class="createlink" href="/wiki/pages/createpage.action?spaceKey=TEST&amp;title=Non-existing+page\">Non-existing page</a>

As you can see, existing pages are linked by absolute URL, and their page ID is listed in their data-linked-resource-id attribute. Non-existing pages link to the create page action and don’t have a data-linked-resource-id attribute. (Note: The data attributes seem to be only present when calling the convert endpoint without a content.id context.)

Here is some example code how to look up multiple pages by title. I copied this code together from various helper functions, so I hope it works and I didn’t make any mistakes. Also, I have only tested this on Confluence Cloud. I suspect that some small adjustments might be needed for Confluence Server.

import cheerio from 'cheerio';

interface PageReference {
    spaceKey: string;
    contentTitle: string;
}

interface ResolvedPageReference extends PageReference {
    id: string;
    url: string;
}

function encodeHtml(str: any): string {
    return `${str}`.replace(/&/g, "&amp;").replace(/</g, "&lt;").replace(/>/g, "&gt;").replace(/"/g, "&quot;").replace(/'/g, "&#039;");
}

export async function resolvePageReferences(pageRefs: PageReference[]): Promise<ResolvedPageReference[]> {
    const res = await new Promise<string>((resolve, reject) => {
        AP.request(`/rest/api/contentbody/convert/export_view}?expand=`, {
            type: 'POST',
            data: JSON.stringify({
                representation: 'storage',
                value: pageRefs.map((ref) => (
                    `<ac:link><ri:page ri:space-key="${encodeHtml(ref.spaceKey)}" ri:content-title="${encodeHtml(ref.contentTitle)}"/></ac:link>`
                )).join('\n')
            }),
            contentType: 'application/json',
            success: (result: string) => {
                resolve(result);
            },
            error: (xhr: any) => {
                let message: string | undefined = undefined;
                try {
                    message = JSON.parse(xhr.responseText)?.message
                } catch (e: any) { }
                reject(new Error(message || `Request failed: ${xhr.url}. Status: ${xhr.status}.`));
            }
        });
    }).then((json) => JSON.parse(json) as { value: string });

    const links = cheerio.load(`<html><body>${res.value}</body></html>`)('body > a');
    return pageRefs.flatMap((pageRef, i) => {
        const id = links.eq(i).attr('data-linked-resource-id');
        return id ? [{ ...pageRef, id, url: links.eq(i).attr('href')! }] : [];
    });
}
6 Likes

You may also generate storage format that references pages via content ID rather than space key and title:

<ac:link><ri:content-entity ri:content-id="2045575169"/>...</ac:link>
1 Like

and for attachments…

<ac:link><ri:attachment ri:filename="IMG_0169.jpeg" ><ri:content-entity ri:content-id="4128777"/></ri:attachment></ac:link>

1 Like