What monitoring tool are you using to watch the availability of your Connect apps, and why?

ferenc.nagy · July 12, 2018, 9:40am

I’m curious about what monitoring tool are you using to watch the availability of your Connect apps.
I would appreciate a short reasoning also.

tobitheo · July 12, 2018, 3:13pm

We’re hosting our cloud apps (1 Jira, 1 Conf, 4 Stride apps) on Heroku and are working with Papertrail and Librato for logs and monitoring respectively. Both offer a reasonable free tier and are well-integrated into the Heroku ecosystem.
Both also have their drawbacks, mostly I’m not thrilled about Librato’s UI. I’m not sure if some of them could be remedied if we put some time into it.
But we mostly use it due to their good integration into Heroku and because they were easy to get started with due to reasonable defaults. (Heroku also offers different monitoring integrations and databases, I generally really like it)

Generally, I feel like getting (not only cloud) monitoring completely right is hard and it’s more something that you have to tune over time. Which errors should send you notifications, which error from Atlassian APIs shouldn’t, what counts as high load, how (if at all) do you configure autoscaling, how much do you log, what you shouldn’t log, etc.

Some of these things also depend on how much time you want to put into it and will likely vary with the app’s price point People’s expectations from a paid app are different from their expectations for a free one. Hence the amount of testing and adjusting monitoring before going live will likely vary as well.

saurabh.gupta · July 12, 2018, 3:44pm

I am not sure but I think we are using couple of them.

Pager duty
Dynatrace

scottohara · July 13, 2018, 6:58am

Similar to @tobitheo, we host our cloud apps on Heroku (1 Jira, 2 Conf apps).

We use a combination of:

Papertrail (log aggregation) - we mainly use this to monitor for any issues immediately following a deployment (such as the dynos failing to start), and alerting for heroku platform errors (H* error codes).
New Relic (APM) - we use this for performance monitoring, such as response times and how long requests spend in different layers of our stack (e.g. queued vs node.js vs postgres)
Rollbar (error monitoring) - for both client-side and server-side errors. We really like the way rollbar can merge errors that may have slightly different messages (e.g. across different browsers) but are fundamentally the same underlying issue; it links errors to a suspect deploy; and shows how many unique IPs / users / browsers etc. are affected (which helps determine if the error is affecting just one particular section of your user base or not)
Heroku metrics - for paid dynos, Heroku offers it’s own metrics (CPU load, memory pressure, response times)

All of these add-ons are easily integrated into Heroku, and we only have a need for the free tier of each so far.