Jira Data Center CloudFormation templates are broken

remie · October 11, 2022, 12:22pm

We are currently blocked by an issue with scaling the Jira Data Center cluster using the steps in the Data Center App Performance Toolkit User Guide. The re-indexing of cluster nodes is broken, resulting in an infinite “Maintenance” mode of the cluster. We’ve been working with the DC approval team on the DC approval Slack workspace to get it working, but the manual fixes / workarounds are tedious. I’ve already spent more than 3 days on this and I still don’t have a working cluster.

If I understand it correctly, this is a known issue, however it is not being resolved because the DC approval team is in the process of replacing the CloudFormation templates with new Kubernetes / Terraform / Helm deployment.

As there is currently no communicated timeframe, this means that the Atlassian Marketplace Partner community is left with the burden of working around a broken framework for a mandatory step of an Atlassian program. Fuck this. You can keep your Data Center approval BS. I will tell people to move back to Server. I’m done with this crap.

@tpettersen you can explain it to our “mutual” customers.

CC: @jmort and @sopel as you were also in the call with the DC team telling us that the migration to Kubernetes would not interfere with the existing programs.

OleksandrMetelytsia · October 11, 2022, 12:36pm

Hi @remie.
Could you please provide steps to reproduce the issue?

“Maintenance” mode is not a “known issue”. We’ve seen this issue several times from app partners, but we do not have steps to reproduce it.
I do not believe that framework is broken. CI for Jira is green. We’ll try to reproduce this issue manually. Exact steps from you would be useful.
Framework is not abandoned and you could get support in a timely manner in the community slack.
Migration to k8s is not related to this issue at all. We want to make a better deployment solution that has a dataset and index inside to make environment setup easier and faster for app partners.

remie · October 11, 2022, 12:49pm

Not sure how else to interpret this conversation? I explicitly ask you if you can reproduce it and you’re saying it’s a know issue?

(You = DC team)

adam.labus · October 11, 2022, 12:52pm

Hi @remie

a few times I had a similar problem, the solution was to disable checking the index state of nodes [JRASERVER-66970] /status should indicate when indexes are broken on a node - Create and track feature requests for Atlassian products. and then patiently clicking and waiting for the indexes to move to all nodes.

This is not a solution, but you may be able to continue with the process

Cheers
Adam

OleksandrMetelytsia · October 11, 2022, 12:52pm

I believe Oleksandr Popov by “know issue” means that he has a workaround.
Root cause 3 from this article:
https://confluence.atlassian.com/jirakb/jira-data-center-node-state-is-showing-as-in-maintenance-while-the-node-is-actually-running-and-not-re-indexing-1095241438.html

But we are not sure why this happened, as the index snapshot from the first node should be propagated to all new nodes.

OleksandrMetelytsia · October 17, 2022, 2:48pm

Hi @remie. We were able to reproduce the issue you faced.

The workaround as Oleksandr Popov mentioned in slack works:

Go to System settings
Indexing page
Set the recovery index schedule to 5min ahead of the current time
Wait 10min until the index snapshot is created (snapshot location /media/atl/jira/shared/caches/indexesV2/snapshots)
After scaling new nodes will get an index recovered from the index snapshot

Sorry for the inconvenience. We are working on improving scripts and documentation and will include fixes in the next release.

It turns out we missed this bug, because CI was configured for a weekend, and the scaling test happened the next day after the default scheduled event already had happened.