From 2021-04-26 06:55 UTC to 2021-04-29 07:55 UTC we experienced an incident which affected our Forge logging service. The issue was that a database partition was under unexpected load, turning it into a hot partition resulting in the write requests being throttled which slowed down the processing of all logs. As a result, logs were delayed by about 2-5 hours during this period.
The reason this occurred was due to a 1st party (global) app which was logging a large amount which resulted in the partition dealing with this app becoming hot. Since it was a 1st party app, our partitioning strategy didn’t work since 1st party apps are installed in a global context (and our partitioning strategy went against this assumption). The issue wasn’t detected early as there hadn’t been any alerting set up to detect if a hot partition was active.
The fix wasn’t a simple one due to the partitioning strategy not being able to be adapted easily and the restrictions that DynamoDB has on write/read allowances on partitions cannot be increased. We ended up having to add some logic in to filter out logs from the 1st party apps that were causing the hot partition, and this solution has been adapted to easily allow for future apps to have their logs removed as to help avoid this problem in the future.
The fix mentioned above has been deployed, and logs are back to normal and log messages should now be returned in (close to) real time again. We are going through the post incident process to help prevent similar incidents in the future, and to detect them earlier. Some of the strategies we are looking into is early detection of hot partitions and to detect apps that are logging quite frequently. We have now also added a status indicator for Forge App Logs on the Developer Status Page (status.developer.atlassian.com) under Developer>Forge App Logs. This will indicate if we are currently experiencing an incident affecting our functionality for Forge Logs access and retrieval.