Practical SQL Observability for Forge Apps with forge-sql-orm

Over the past month, I introduced a new observability layer inside my library forge-sql-orm.
The goal was simple: make Forge SQL measurable and transparent — without breaking Runs on Atlassian compliance and without accessing any customer data.

As part of my Codegeist project, I integrated this layer into a real Forge app and connected it to PostHog.
This instantly gave me end-to-end visibility into resolver performance across environments, tenants, and app versions - all while staying fully within the platform boundary.

Why Observability Is Especially Hard in Forge SQL

Forge SQL is multitenant — the physical database is shared across many customers, and each tenant gets its own logical slice of data.

In practice, this creates two major challenges:


1. Tenants can have radically different dataset sizes

A resolver that runs in 50–100 ms for one customer may take hundreds of milliseconds, or even seconds, for another.

And you have no way to know this ahead of time:

  • you cannot see the tenant’s actual table sizes
  • you cannot log into the underlying database
  • you cannot estimate index selectivity per tenant
  • you can receive slow-query entries from TiDB, but only after a tenant has already experienced degraded performance

2. Platform-level analytics are available - but not enough

Forge SQL exposes some low-level database metrics:

  • slow-query logs
  • cluster statistics
  • execution summaries

However, these analytics are:

  • not tied to specific resolvers
  • not correlated with app versions or environments
  • not connected to payload size or resolver logic
  • not continuous (visible only when TiDB marks a query as “slow”)
  • not designed to show trends, regressions, or per-tenant behavior

They help diagnose extreme cases, but they’re not sufficient for understanding how your application performs in real-world multi-tenant conditions.

You only see what your resolver sees — nothing more.


Why That Becomes a Real Problem

As schemas grow and joins become more complex, behavior becomes unpredictable:

  • A query with a perfect execution plan can still be slow for a large tenant.
  • Pagination with large OFFSET becomes inconsistent between customers.
  • A new join may be harmless in dev but catastrophic for a tenant with millions of rows.
  • Regression detection is impossible without telemetry — you cannot see if performance worsened after a release.

Because all real data lives inside Atlassian infrastructure, the app developer has almost no visibility into how SQL behaves “out in the wild.”

This is exactly the gap that the new observability layer is designed to fill.


What the New Observability Layer Provides

The layer automatically captures performance characteristics for every resolver invocation.


Aggregated total DB execution time

All SQL statements executed by the resolver contribute to a single aggregated DB time metric.


Tiered logging thresholds

  • Debug when total DB time > 1000 ms
  • Warn when total DB time > 2000 ms

Thresholds can be tuned per resolver.


Automatic SQL plan dump

When total DB time exceeds 2000 ms, the ORM prints full execution plans for all queries executed inside the resolver directly into Forge logs.

This helps diagnose:

  • unexpected TableFullScan
  • heavy IndexJoin
  • missing indexes
  • window functions with large memory needs
  • skewed statistics for a large tenant
  • inefficient pagination

Performance telemetry sent to allowed analytics tools (e.g., PostHog)

Only anonymized metadata is sent:

  • cloudId
  • environment
  • resolverName
  • appVersion
  • totalDbExecutionTime
  • totalResponseSize

This enables weekly insights, multi-tenant comparisons, and regression detection — fully compliant and PII-free.


Automatic timeout and out-of-memory diagnostics

The observability layer also detects and analyzes failures such as:

  • “Your query has been cancelled due to exceeding the allowed memory limit for a single SQL query.”
  • “The provided query took more than 5000 milliseconds to execute.”

When these errors occur, the ORM retrieves and logs execution plans for the failed queries — making it possible to understand the issue even without access to tenant data.


Configurable by Design

One important aspect of this observability layer is that it is not global.
It is resolver-level, meaning every resolver can define its own behavior independently.

Each resolver can configure:

  • custom thresholds
  • warning levels
  • plan-dump behavior
  • analytics logic
  • additional metadata
  • sampling rules
  • environment-specific overrides
  • or disable observability entirely

This flexibility is powered by a simple callback structure:

executeWithMetadata(
  async () => {
    // resolver logic
  },
  async (totaldbTime, totalResponseSize, printPlan) => {
    // your custom logic here
  }
);

Because the logic lives at the resolver level, the system is:

  • easy to tune
  • lightweight
  • platform-safe
  • precise and predictable

Each resolver gets exactly the level of observability it needs — no more, no less.


Even If Analytics Are Disabled — Observability Still Works

Some customers disable analytics or block outbound requests entirely.

In this case, no telemetry is sent to PostHog (or any analytics tool), but the observability layer still provides full visibility.

All essential diagnostic information remains available directly in the logs:

  • totalDbExecutionTime
  • totalResponseSize
  • per-query execution details
  • full SQL plans
  • timeout and OOM diagnostics
  • resolver-level warnings and thresholds

This means that even if analytics events never leave the customer’s infrastructure, the customer can still provide logs that fully explain:

  • which resolver was slow
  • which queries were executed
  • what each plan looked like
  • why the slowdown happened (bad plan, full scan, index join, statistics skew, etc.)

Observability does not depend on outbound analytics — telemetry is optional, but diagnostics are always available.

100% inside the Forge boundary

  • No external storage
  • No outbound data beyond anonymized telemetry
  • No PII
  • No customer content

Runs on Atlassian — by design.


Integration Example

1. manifest.yml (permissions)

source

permissions:
  external:
    fetch:
      backend:
        - address: "*.posthog.com"
          category: analytics
          inScopeEUD: false

2. Wrapping a resolver

source

resolver.define('Test Resolver', async (req: Request) => {
    const resolverName = 'Test Resolver';
    return FORGE_SQL_ORM.executeWithMetadata(
        async () => {
           return ... // resolver logic
        },
        async (totalDbExecutionTime, totalResponseSize, printQueriesWithPlan) => {
            await ANALYTIC_SERVICE.sendAnalytics(
                "sql_resolver_performance",
                resolverName,
                req.context.cloudId,
                { totalDbExecutionTime, totalResponseSize },
            );

            if (totalDbExecutionTime > 2000) {
                console.warn(
                    `Resolver ${resolverName} has high database execution time: ${totalDbExecutionTime}ms`,
                );
                await printQueriesWithPlan();
            } else if (totalDbExecutionTime > 1000) {
                console.debug(
                    `Resolver ${resolverName} has elevated database execution time: ${totalDbExecutionTime}ms`,
                );
            }
        },
    );
});

3. Sending analytics to PostHog

source

const appContext = getAppContext();
const properties = {
    resolverName,
    cloudId,
    envName: appContext.environmentType,
    envId: appContext.environmentAri.environmentId,
    version: appContext.appVersion,
    parsedVersion: this.parseVersion(appContext.appVersion),
    totalDbExecutionTime: data.totalDbExecutionTime,
    totalResponseSize: data.totalResponseSize,
    eventVersion: 1,
};

await fetch("https://eu.i.posthog.com/capture/", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        api_key: process.env.ANALYTICS_API_KEY,
        event: eventName,
        distinct_id: cloudId,
        timestamp: new Date().toISOString(),
        properties: properties,
    }),
});

This is enough to build dashboards that reflect real-world performance across tenants.


4. PostHog query for weekly resolver performance

SELECT
  count(uuid) AS event_count,
  properties.envId,
  properties.cloudId,
  AVG(properties.totalDbExecutionTime) AS avgTime,
  concat(properties.cloudId, ':', properties.envName, ':', properties.resolverName) AS resolverName,
  max(properties.parsedVersion),
  max(timestamp) AS last_seen_at
FROM events
WHERE event = 'sql_resolver_performance'
  AND properties.eventVersion = 1
  AND timestamp >= now() - INTERVAL 7 DAY
GROUP BY
  properties.envId,
  properties.cloudId,
  properties.envName,
  properties.resolverName
HAVING avgTime > 500;

Putting It to the Test

To verify the full pipeline, I intentionally left a performance bottleneck in place:

SELECT SLEEP(2)

This pushed the total DB time above the 2000 ms threshold.

Here’s what happened:

  1. In PostHog, I immediately saw a spike:

  1. I opened the Forge logs and found the detailed plans:

  1. The execution plan pointed directly to the problematic place in code:

After removing the artificial delay, latency dropped from ~2200 ms → ~150 ms.


From Black Box to Glass Box

Before, Forge SQL was essentially a black box.
Now, with observability integrated directly into forge-sql-orm, it becomes:

  • diagnosable
  • measurable
  • predictable
  • transparent

This observability layer is lightweight, compliant, and extremely helpful when building applications with complex schemas, heavy joins, and tenant-specific performance patterns.


Try It Yourself

Codegeist project:
:backhand_index_pointing_right: https://github.com/vzakharchenko/Forge-Secure-Notes-for-Jira

forge-sql-orm repository:
:backhand_index_pointing_right: https://github.com/vzakharchenko/forge-sql-orm

If you want to integrate it or explore how it works — happy to help.

6 Likes

Thanks for your sharing your work on Forge SQL ORM @vzakharchenko !

await printQueriesWithPlan();

Would this print all queries that are cached in the information schema table or does it print the specific query that is slow?

We are considering in ways how the backend service can help solve this generically. We are considering exposing an additional endpoint (or use params in the same endpoint) that will run explain analyze queries, capture the results and show them on developer console/CLI. These are initial thoughts on how we want to help developers understand which queries are slow and will require developers to sample which queries they’d need the explain plans to be generated for. The data information schema (Summary tables) are not persistent and having a deterministic solution to get slow query data would be a better experience, perhaps?

Additionally, we have a custom metrics solution (in EAP) that can be used to avoid (partially) using an analytics provider - https://developer.atlassian.com/platform/forge/monitor-custom-metrics/#monitor-custom-metrics-(eap). This currently works for counter metrics, but it can work some some usecases without declaring an external analytics provider.

Thanks @varun for the insightful feedback!
You are absolutely right regarding persistence. In the context of long-running functions (complex resolvers or Async Events up to 900s), queries executed early in the logic will likely be evicted from summary tables before the function completes - especially if there are heavy non-DB operations in between.

To address this, I plan to change the default completion behavior. Instead of querying summary tables, the callback will strictly log the SQL digest text (normalized queries with ? placeholders) generated by the ORM itself. This ensures developers always see exactly which queries participated in invocation, without risking a lookup to information_schema that might return empty or stale results due to eviction. This makes the observability log deterministic regardless of function duration.

This creates a clearer workflow:

  1. Detection: Analytics highlights a degrading resolver.

  2. Investigation: The developer identifies the specific bottleneck and manually injects EXPLAIN ANALYZE exactly where the heavy SQL operations occur.

However, for OOM and Timeout errors, this approach still works great. Since we catch the error immediately, the metadata is still fresh in the buffer. So it works reliably as a “crash reporter,” even if full profiling for long functions requires careful scoping.

On the native EXPLAIN idea: A native endpoint would be interesting, but the implementation matters. If it’s just standard EXPLAIN, it offers limited value since plans are often identical to what we see in dev environments. The real value comes from EXPLAIN ANALYZE (actual row counts/timing/etc).
However, unlocking this natively faces two major hurdles:

  1. The Re-execution Risk: ANALYZE requires re-running the query. If the original statement failed due to OOM or Timeout, running it again will likely reproduce the same failure and crash — unless executed in a privileged, non-limited context.
  2. The Privacy Paradox: To get accurate execution stats, the engine needs the exact bind parameters (since data distribution depends on values). However, these often contain PII, so exposing them to developers would break compliance. Without them (using placeholders), ANALYZE becomes impossible or inaccurate.

That’s why the strategy of querying summary tables to diagnose OOM and Timeouts works so well as a post-mortem: it captures the plan generated by the actual execution (which had access to real data) without requiring the developer to handle sensitive input parameters or re-trigger a crash.

On Custom Metrics: . At the moment, it seems focused on Counters, which are great for tracking invocation counts or error rates. However, for performance observability, it would be ideal to see support for aggregations in future updates - specifically, the ability to filter by resolver and sort tenants by consumption. Once value-based metrics become available, moving to a native solution would be the natural next step.

Thanks again!

1 Like

Thanks again for the valuable feedback, @varun - I’ve shipped several improvements to forge-sql-orm based directly on our discussion. Here’s a quick overview of what changed.

1. New deterministic default mode (TopSlowest)

The default behavior no longer depends on CLUSTER_STATEMENTS_SUMMARY.

forge-sql-orm now logs the exact SQL digests executed inside the resolver, giving deterministic diagnostics even for long-running logic.

By default it prints:

  • the single slowest query, and
  • optionally that query’s execution plan (showSlowestPlans: true)

Configurable like this:

{
  topQueries: 2,          // how many slowest queries to analyze
  showSlowestPlans: true, // re-executes them with EXPLAIN ANALYZE
}

If showSlowestPlans is enabled — the library re-executes these queries with EXPLAIN ANALYZE.
If disabled — it prints only SQL + timing.

plan enabled:

plan disabled:
plan disabled


2. SummaryTable mode (optional)

SummaryTable mode still exists, but now works as an advanced diagnostic option.

It uses a short memory window:

summaryTableWindowTime: 15000 // 15s default

If resolver execution exceeds this window, forge-sql-orm automatically falls back to TopSlowest, avoiding stale diagnostics.

This keeps SummaryTable useful for fresh metadata, but avoids relying on it for long workflows.

3. Updated API

Here is the updated API with all configuration options:

executeWithMetadata(
  async () => {
    // resolver logic
  },
  async (totalDbTime, totalResponseSize, printPlan) => {
    // your custom logic:
    // analytics, thresholds, alerts, logging, etc.
    // e.g.: if (totalDbTime > 1000) await printPlan();
  },
  {
    mode?: QueryPlanMode;            // "TopSlowest" | "SummaryTable" (default: TopSlowest)
    summaryTableWindowTime?: number; // ms window for SummaryTable (default: 15000)
    topQueries?: number;             // number of slowest queries to print (default: 1)
    showSlowestPlans?: boolean;      // print EXPLAIN ANALYZE in TopSlowest mode (default: true)
  }
);

Everything is opt-in and Forge-safe by design.

4. Timeout & OOM post-mortem diagnostics

For catastrophic SQL failures, the library performs an immediate post-mortem lookup.

Right after a Timeout or OOM, TiDB’s metadata is still in memory — so forge-sql-orm extracts the actual plan of the failing query before eviction can occur.

Case A: Timeout

“The provided query took more than 5000 milliseconds to execute…”

forge-sql-orm automatically logs the execution plan of the failing query:


Case B: Out of Memory (OOM)

“Your query has been cancelled due to exceeding the allowed memory limit…”

The library captures the memory-heavy plan that triggered the crash:


Why this matters

  • No re-execution required - avoids triggering the same timeout or OOM again.
  • Plans come from the real execution - captured with actual data distribution and bind parameters.
  • No tenant data is exposed - metadata only, fully compliant.
  • Runs entirely inside the Forge boundary - no privileged access or special APIs.
  • Works reliably even for complex, deeply nested SQL workloads - joins, pagination chains, window functions, etc.

This gives developers a safe way to understand severe failures without privileged access.

5. Why developer-side observability matters

This configurability — implemented on the application side — lets developers enable observability exactly where it’s needed:

  • resolver/long function/scheduler-level instrumentation
  • custom thresholds
  • selective plan printing
  • sampling
  • environment rules
  • optional analytics

And importantly:

You don’t need my library to do any of this.
Developers can implement the same pattern manually — forge-sql-orm simply makes it easier and more consistent.

Developer-side observability naturally complements platform-level observability.

Together, they enable building Forge apps with deep SQL execution paths: complex joins, multi-stage pagination, window functions, large OFFSET workflows — while still maintaining transparency and safety.

Curious to hear your thoughts, @varun:

Does this direction seem reasonable from the platform perspective?
Or would you recommend approaching any part of it differently?

1 Like