DC App Performance Toolkit appreciation post!

remie · August 19, 2023, 9:40am

I wanted to take some time to write about my recent interactions with the DC App Performance Toolkit engineering team.

As you all know, I can be very critical of Atlassian. I was very critical of the move to Terraform/Kubernetes for the DC approval process, as this would increase complexity of the deployment process for Marketplace Partners who did not have any experience with Terraform or Kubernetes.

I became even more vocal when the new solution was introduced without a proper replacement of the old CloudFormation one-click deployment, even though this was promised by the team during a meeting on the subject. As a result I postponed our participation in the program, leading to our submission of our annual review being 233 days overdue.

In those 233 days, the DC team reached out to me and we had very open conversations. The team listened to my concerns and worked on a solution, which was made available to me from a DEV branch and eventually ended up in the 7.5.0 release of the DC App Performance Toolkit.

I used the solution the team provided to run the performance & scaling tests of 23 DC apps (!) last week, and I was genuinely impressed by the process.

The one-click docker container solution created by the team was well document, easy to follow and allowed me to provision the environment without any understanding of Terraform or Kubernetes. Although not officially supported, I even managed to run performance tests of 3 products (Jira, Confluence & Bitbucket) simultaneously on the same cluster

The main benefits of the current Terraform solution:

Three simple CLI commands to install, uninstall and terminate the cluster (using Docker)
No more manual steps: the Terraform solution provisions the cluster, the instance, the database, the shared storage, everything. All you need to do is change a few variables in a text file
It scales both vertically as well as horizontally, allowing you to run multiple products on the same cluster, saving you time to test multiple products
The solution is scriptable: for the next iteration I will be automating the entire process to run from CircleCI

The team was also very helpful in dealing with any troubles I ran into and I was able to provide feedback on the scripts.

So well done @OleksandrMetelytsia and team

jens · August 19, 2023, 11:42am

I also want to chime in here.

Similar to Remie I’ve used the docker based cluster setup before it was officially released and can only confirm what he said.
The setup process was extremely simple (one docker command) and the cluster was available and ready to go in less than an hour. That’s a huge step forward compared to previous approaches where we had to load the dataset ourselves. I’m sure that it will save the ecosystem countless hours of manual cluster setup work.

Therefore, a big ‘thank you’ to the team maintaining the testing toolkit and the whole DC team in general for these improvements and for always being very responsive in both slack and the ECOHELP tickets.

Cheers,
Jens

aragot · August 20, 2023, 7:55am

2 kudos for the new Terraform/Docker framework here!

It’s quite close to a 1-click solution,
Maybe it would be nice if it auto-deployed the JMeter instance too, it requires understanding of AWS to launch and install it,
Oleksandr seems to be working 12hrs a day, maybe even 24hrs, since I haven’t found a moment when he wasn’t available.

The framework didn’t work on the default AWS zone, it took me two days to setup and understand that.

But now that it is working, wow, I can execute the 5 tests in less than one day!!! It used to take two weeks!!! The commands are really easy, no parameter to change, it doesn’t require importing SQL data manually or reindexing a million Confluence pages! It could even probably be scripted for the run 1 & 2 and for the runs 3-4-5 separately, but it’s great work that has been performed there!

Concerning the scripting, well, it’s sad that the framework doesn’t just tell us to setup an EC2 machine, and that machine starts the Terraforming, the BZT tests, scales, starts BZT again, etc. But the current setup is already excellent and an incredibly good improvement over the past!

Kudos to that team!

hirota.takayuki · August 21, 2023, 8:38am

Let me say thank you to DCAPT team. Firstly, I was also confused to see changing from CloudFormation to Terraform/k8s platform because our team was used to the way using CloudFormation. However, the new procedure is understandable and shorter than before. It’s good for us to skip loading huge initial test data. It would be nice for me if execution environment is also automatically created.

DCAPT team’s support is always excellent and prompt, and their Slack channel always helps our team.

Thank you very much

OleksandrMetelytsia · August 21, 2023, 9:17am

Thank you for your feedback, community!

scottohara · September 11, 2023, 5:04am

We are currently in the process of conducting our annual DC performance testing, and have tried the Terraform/k8s method this time around, instead of the older Quick Start CloudFormation templates.

Overall the experience has been great. We don’t use DCAPT to test our app, as we don’t have a lot of experience with Taurus/JMeter etc. and have never really been able to get it to work properly for us (instead, we have a suite of Cypress browser tests and we measure the end-user perceived times); but in terms of standing up the enterprise scale DC cluster and dataset, the TF/k8s method has been very smooth.

The initial run takes roughly about the same time (maybe a little less) as the older Quick Start method (presumably because the bottleneck is AWS provisioning the resources), but having the data load automated using an RDS snapshot instead of the old pg_restore script involves a fewer manual steps.

One thing we have noticed, and we’re not sure if this is expected or something specific to us:

When we run the command to bring up the environment:

docker run --pull=always --env-file aws_envs \
-v "$PWD/dcapt.tfvars:/data-center-terraform/config.tfvars" \
-v "$PWD/.terraform:/data-center-terraform/.terraform" \
-v "$PWD/logs:/data-center-terraform/logs" \
-it atlassianlabs/terraform ./install.sh -c config.tfvars

…it seems to get stuck at “Acquiring state lock. This may take a few moments…” for about ~20-30 minutes. The first time we assumed it had hung, so we killed it (and then had to figure out how to manually reset the terraform state lock). The next time we just left it running and found that it eventually completed.

We assumed this was because on first run it has to create everything from scratch.
However when we went to scale the cluster from 1 to 2, and then again to 4 nodes (by editing the confluence_replica_count in the dcapt.tfvars file, and then re-running the above command), it spent the same ~20-30 minutes stuck at the same spot.

I’m not sure what it’s doing (or if there’s any way we can get a better indicator of progress during this time); but it would be great it there was a quicker way to scale the cluster.

Would it be faster to log into the AWS console and scale it manually? If so, how would we do this? (are we scaling the number of pods in the EKS cluster?)

Or is it not supposed to take this long to apply the Terraform after editing the vars?

This is really our only issue with the new process.

scottohara · August 28, 2024, 1:33am

Putting this here as I’m not quite sure if there is a better place to report this sort of issue.

As per my earlier post, we have been successfully using the Terraform/Docker/k8s method from DCAPT for our annual DC testing since last year, and across all of our apps we have used the process numerous times without issue.

This time around, however, we’re having trouble scaling our Confluence cluster from 3 → 4 nodes due to what seems to be a vCPU limit.

This is not something we encountered previously when using this method, so we can only assume that something recently changed in the TF templates that now requires additional resources, causing the number of EKS nodes to scale beyond the allowed default limit of 32 vCPUs.

DCAPT uses a default instance size of m5.2xlarge for EKS nodes, each of which have 8 vCPUs. Therefore, 4 EKS nodes will consume all 32 vCPUs.

However in the default dcapt.tfvars file we note that:

min_cluster_capacity = 1
max_cluster_capacity = 6

This would seem to suggest that the EKS cluster is allowed to grow beyond 4 nodes, which would then exceed the limit of 32 vCPUs.

Workflow

At a high-level, our workflow looks like this:

git pull the latest version of DCAPT from the GitHub repo
Put our AWS access/secret keys into aws_envs
Modify the default dcapt.tfvars file with the following changes:

products = ["confluence"]
confluence_license = "...our DC license goes here..."
confluence_replica_count = 1

Run the docker run .... ./install.sh command to provision the necessary infrastructure for a single-node cluster
Run our performance tests without/with our app installed
Bump confluence_replica_count to 2 and rerun the docker run ... ./install.sh command to scale the cluster to two nodes
Run our 2-node scale testing
Bump confluence_replica_count to 4 and rerun the docker run ... ./install.sh command to scale the cluster to four nodes
Run our 4-node scale testing

Problem

When we get to step 8 (scaling from 2 → 4 nodes), we are finding that creating the 4th node fails and we’re left with only 3 nodes running in the cluster.

Upon further investigation in EKS, we note that the Pod for the 4th Confluence node (confluence-3) is in a Pending state with the following two events showing:

Warning FailedScheduling a few seconds ago default-scheduler
0/4 nodes are available: 3 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.

Normal TriggeredScaleUp a few seconds ago cluster-autoscaler
pod triggered scale-up: [{eks-appNode-m5_2xlarge-20240828010837237600000013-bac8cb43-83dc-8654-a42c-3a586af3c029 4->5 (max: 6)}]

The TriggeredScaleUp event shows that EKS attempted to scale from 4 → 5 nodes as a result of adding this 4th Confluence node…

The EKS node group shows the following health warning:

Could not launch On-Demand Instances.
VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to.
Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.

We tried setting max_cluster_capacity = 4 to prevent the EKS cluster from scaling beyond 4 nodes, but this results in the following event for the 4th Confluence pod (confluence-3):

Warning FailedScheduling 12 minutes ago default-scheduler
0/4 nodes are available: 3 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
Normal NotTriggerScaleUp 12 minutes ago cluster-autoscaler
pod didn't trigger scale-up: 1 max node group size reached

…which suggests that it needs more than 4 EKS nodes to provision a complete 4-node Confluence cluster.

Solution?

What does Atlassian recommend vendors do here?

Should we be petitioning AWS for a increase to the 32 vCPU limit? (not necessary previously when using DCAPT)
Should we be using smaller EC2 instance sizes for our testing?
Should we be reducing the confluence_cpu = 6 setting so that each Confluence pod consumes fewer resources within the EKS cluster?

Any help would be greatly appreciated, as we are currently blocked on completing our annual DC testing as a result of this.

Thanks in advance.

AndreasEbert · October 17, 2024, 11:23am

I can also confirm that this year’s DC approvals for our apps went very smoothly, thanks to Atlassian’s DC App Performance Toolkit, and their fast and helpful support via Slack. It’s a big improvement compared to previous years, that saves tons of manual work.
Much appreciated

sash011 · October 22, 2024, 4:21pm

Indeed, such a smooth process. Just completed a DC approval in 2 days (was actually possible to do it within 1 day)
Thanks @OleksandrMetelytsia and the team!