RFCs are a way for Atlassian to share what we’re working on with our valued developer community.
It’s a document for building shared understanding of a topic. It expresses a technical solution, but can also communicate how it should be built or even document standards. The most important aspect of an RFC is that a written specification facilitates feedback and drives consensus. It is not a tool for approving or committing to ideas, but more so a collaborative practice to shape an idea and to find serious flaws early.
*Please respect our community guidelines : keep it welcoming and safe by commenting on the idea not the people (especially the author); keep it tidy by keeping on topic; empower the community by keeping comments constructive. Thanks!
For the avoidance of doubt, the Atlassian Developer Terms govern any feedback you provide, and any sample code we provide is deemed to be part of the “Atlassian Platform” under that agreement.*
Summary
This project aims to enable the consumption of app invocation metrics by third-party tools.
- Publish: 12 July 2023
- Discuss: 19 July 2023
- Resolve: 2 Aug 2023
Problem
Currently, app invocation metrics can be consumed only on the developer console. This project aims to build an API, which gives users the ability to use third party tools to:
- group and filter metrics by different attributes, like
appVersion
,contextAri
,functionKey
,moduleKey
,errorType
, and more - set highly configurable alerts on metrics (defining SLIs and SLOs as necessary)
- integrate with incident response tools, like Opsgenie, Pagerduty, and more
We intend to add any new metrics that we make available via the developer console to this API. We’re looking for feedback to make sure we’re building the best possible solution.
Proposed solution
As part of this project, we’re planning to provide an API that contains invocation metrics in OTLP protobuf JSON format. Few terms which are extensively used throughout the RFC:
- OpenTelemetry is an Observability framework and toolkit designed to create and manage telemetry data such as traces, metrics, and logs
- The OpenTelemetry Protocol (OTLP) specification describes the encoding, transport, and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends.
- The OpenTelemetry Collector (OTEL) offers a vendor-agnostic implementation of how to receive, process and export telemetry data. It removes the need to run, operate, and maintain multiple agents/collectors. This works with improved scalability and supports open source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) sending to one or more open source or commercial back-ends.
Authentication with Atlassian GraphQL API
Note: The Atlassian account making the request has to be the same account that owns the Forge app.
Follow the steps to authenticate with the Atlassian GraphQL (AGG) API.
To get started using Basic authentication :
- Copy your API token from Atlassian account.
- Include the token and your email in the header of your GraphQL request.
- Pass the
X-ExperimentalAPI
header. This is because the Forge Metrics API is still in an experimental state and is subject to change. - Provide a custom
User-Agent
header. This will help differentiate traffic coming from the developer console and your own export service. We recommend using this value:ForgeMetricsExportServer/1.0.0
Sample AGG Query
query Ecosystem($appId: ID!, $query: ForgeMetricsOtlpQueryInput!) {
ecosystem {
forgeMetrics(appId: $appId) {
exportMetrics(query: $query) {
... on ForgeMetricsOtlpData {
resourceMetrics
}
... on QueryError {
message
identifier
extensions {
statusCode
errorType
}
}
}
}
}
}
Sample AGG Query Variables
{
"appId": "ari:cloud:ecosystem::app/8ce114f4-d82c-45e2-b4fb-c6a0751d7d57",
"query": {
"filters": {
"environments": ["8cb293d5-be08-47ae-a75c-95b89da5ad1d"],
"interval": {
"start": "2023-06-18T02:55:00.000Z",
"end": "2023-06-18T02:57:00.000Z"
},
"metrics": ["FORGE_BACKEND_INVOCATION_LATENCY", "FORGE_BACKEND_INVOCATION_COUNT", "FORGE_BACKEND_INVOCATION_ERRORS"]
}
}
}
Sample AGG Query Headers
{
"Authorization": "Basic base64<email:token>",
"User-Agent": "ForgeMetricsExportServer/1.0.0",
"X-ExperimentalApi": "ForgeMetricsQuery"
}
Sample AGG Query Response
{
"data": {
"ecosystem": {
"forgeMetrics": {
"exportMetrics": {
"resourceMetrics": [
{
"resource": {},
"schemaUrl": "https://opentelemetry.io/schemas/1.9.0",
"scopeMetrics": [
{
"metrics": [
{
"name": "forge_backend_invocation_count",
"description": "",
"sum": {
"aggregationTemporality": 1,
"dataPoints": [
{
"asInt": 70,
"attributes": [
{
"key": "appId",
"value": {
"stringValue": "8ce114f4-d82c-45e2-b4fb-c6a0751d7d57"
}
},
{
"key": "appVersion",
"value": {
"stringValue": "4.64.0"
}
},
{
"key": "contextAri",
"value": {
"stringValue": "ari:cloud:confluence::site/13095d29-407d-47ec-aa57-76764a470f36"
}
},
{
"key": "environmentId",
"value": {
"stringValue": "8cb293d5-be08-47ae-a75c-95b89da5ad1d"
}
},
{
"key": "functionKey",
"value": {
"stringValue": "updateStatusTitle"
}
}
],
"startTimeUnixNano": "1687497375656000000",
"timeUnixNano": "1687497375662000000"
}
]
},
"unit": "s"
},
{
"name": "forge_backend_invocation_errors",
"description": "",
"sum": {
"aggregationTemporality": 1,
"dataPoints": [
{
"asInt": 0,
"attributes": [
{
"key": "appId",
"value": {
"stringValue": "8ce114f4-d82c-45e2-b4fb-c6a0751d7d57"
}
},
{
"key": "appVersion",
"value": {
"stringValue": "5.1.0"
}
},
{
"key": "contextAri",
"value": {
"stringValue": "ari:cloud:compass::site/6a9ea14f-759d-4f4a-b3ac-11395d8bf519"
}
},
{
"key": "environmentId",
"value": {
"stringValue": "8cb293d5-be08-47ae-a75c-95b89da5ad1d"
}
},
{
"key": "errorType",
"value": {
"stringValue": "UNHANDLED_EXCEPTION"
}
},
{
"key": "functionKey",
"value": {
"stringValue": "process-app-event"
}
},
{
"key": "moduleKey",
"value": {
"stringValue": "app-event-webtrigger"
}
}
],
"startTimeUnixNano": "1687488960000000000",
"timeUnixNano": "1687489020000000000"
}
]
},
"unit": "s"
}
]
}
]
}
]
}
}
}
}
}
Notes
- Try the AGG API here: GraphQL Gateway
- Each API call retrieves at most 15 minutes of metrics. This limit is enforced to make sure the number of data points returned is not too big in the API response.
- The preferred approach is to fetch data periodically, for example, every 3 or 5 minutes.
- A rate limit of 5 calls per minute per user token is implemented.
Expected partner flow when consuming metrics
Partner Server
To consume the Atlassian GraphQL API programmatically and ingest metrics in real-time into the monitoring tool, we visualise partner infrastructure to have following two components at their end:
CronJob Service
The CronJob service periodically polls the exposed GraphQL endpoint for required metrics. The AGG endpoint returns OTLP protobuf JSON standard format as a response. The same response is then pushed as is to the OTEL sidecar, which is running alongside this cron service. Few approaches for same:
- Serverless framework: If using AWS infra, we can configure lambda to be executed every “x” minutes or so. Similar configuration should be possible with GCP Cloud functions as well. Sample lambda configuration can look like below:
Sample lambda configuration
MyLambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: MyLambdaFunction
Runtime: nodejs14.x
Handler: index.handler
Code:
S3Bucket: my-function-bucket
S3Key: my-function-package.zip
Layers:
- !Ref OTelLambdaLayer
Environment:
Variables:
OPENTELEMETRY_COLLECTOR_CONFIG_FILE: /var/task/config.yml
MyScheduledRule:
Type: AWS::Events::Rule
Properties:
Description: My scheduled rule
ScheduleExpression: rate(3 minutes)
State: ENABLED
Targets:
- Arn: !GetAtt MyLambdaFunction.Arn
Id: MyLambdaTarget
- Server framework: If using AWS infra, we can setup a dedicated EC2 resource running a server which polls the AGG API every “x” minutes or so. This can be a VM if running an on premise data center
OTEL Collector/Sidecar
Running an OTEL Collector involves simple configuration of below three components:
- Receiver: A receiver, which can be push- or pull-based, is how data gets into the OTEL Collector. We’ll use OTLP receiver, which can receive trace export calls via HTTP/JSON. The AGG response is compatible with the accepted format for this receiver to work.
- Processors: Processors are run on data between being received and being exported. While processors are optional, these are some of the recommended ones.
- Exporters: An exporter, which can be push- or pull-based, is how you send data to one or more backends or destinations. All supported exporters can be found here.
Few approaches to run the OTEL collector with serverless or server framework as suitable:
- Serverless framework: If using AWS infra, we can leverage OTEL lambda layer. For GCP or Azure, we can use equivalent concept as applicable.
Sample lambda with lambda layer configuration
Resources:
OTelLambdaLayer:
Type: AWS::Lambda::LayerVersion
Properties:
LayerName: OTelLambdaLayer
Description: My OTEL Lambda layer
Content:
S3Bucket: my-layer-bucket
S3Key: my-layer-package.zip
CompatibleRuntimes:
- nodejs14.x
MyLambdaFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: MyLambdaFunction
Runtime: nodejs14.x
Handler: index.handler
Code:
S3Bucket: my-function-bucket
S3Key: my-function-package.zip
Layers:
- !Ref OTelLambdaLayer
Environment:
Variables:
OPENTELEMETRY_COLLECTOR_CONFIG_FILE: /var/task/config.yml
MyScheduledRule:
Type: AWS::Events::Rule
Properties:
Description: My scheduled rule
ScheduleExpression: rate(3 minutes)
State: ENABLED
Targets:
- Arn: !GetAtt MyLambdaFunction.Arn
Id: MyLambdaTarget
- Server framework: Running OTEL collector as a sidecar docker container on same VM/EC2 server responsible for Cron scheduling.
- Create a sample
otel-collector-config.yaml
file in the repository as needed. Assumingsignalfx
is the external monitoring tool (AKA exporter), the config file should look similar to:
Sample otel-collector-config.yaml file
receivers:
otlp:
protocols:
http:
exporters:
signalfx:
# Access token to send data to SignalFx.
access_token: <access_token>
# SignalFx realm where the data will be received.
realm: us1
# Timeout for the send operations.
timeout: 30s
processors:
batch:
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [signalfx]
b. Create a docker image with open source OTEL collector docker image available GitHub - open-telemetry/opentelemetry-collector-contrib: Contrib repository for the OpenTelemetry Collector using the command: docker build . -t otel-sidecar:v1
Sample dockerfile
FROM otel/opentelemetry-collector-contrib:latest
# Copy the collector configuration file into the container
COPY otel-collector-config.yaml /etc/otel-collector-config.yaml
# Start the collector with the specified configuration file
CMD ["--config=/etc/otel-collector-config.yaml"]
c. Run the above docker image: docker run -p 4318:4318 otel-sidecar:v1
. This will spin up the OTEL sidecar at http://localhost:4318
d. Make an HTTP “POST” request with the response of the above AGG API endpoint i.e- response.data.ecosystem.forgeMetrics.exportMetrics
to the sidecar running at path http://localhost:4318/v1/metrics
on same server
Sample HTTP POST curl request to OTEL sidecar
curl --location --request POST 'localhost:4318/v1/metrics' \
--header 'Content-Type: application/json' \
--data-raw '<response.data.ecosystem.forgeMetrics.exportMetrics>'
e. Metrics should now be visible in the monitoring tool configured i-e SFX in above case.
Feedback
While we would appreciate any feedback, we’re especially interested in learning more about:
- Will the proposed feature allow you to easily consume the metrics that we make available? If not, what would be your preferred method and why?
- What functionality (alerting, integration, advanced filters) will you configure once you have the metrics in your third-party tool?
- Once the initial version with the invocation metrics is released, which metrics should we add next?