Compatibility Scheduler not working on Confluence 5.10+

francesco.stefanini · May 18, 2017, 9:01am

Hi there,
I’m developing a server addon using the atlassian-scheduler-compat library (version 1.2) because I need to schedule a recurring task and my plugin must run on Confluence 5.9 and above (so I cannot use the plain atlassian-scheduler APIs).

On Confluence 5.9 everything works fine: the library falls back on the legacy SAL scheduler and my job runs on time, as expected.

When I switch to Confluence 5.10 or 6.0, nothing happens. The compat library understands that the atlassian-scheduler-api is available (I can see that from the logs) but when it comes to execute the task, it says “Job has unexpected run mode RUN_LOCALLY” and it fails.

I’ve tried to debug the library code: in the ClusteredCompatibilityPluginScheduler class I can see that the job is scheduled with RunMode set to RUN_ONCE_PER_CLUSTER but in the ClusteredJobRunner class, when the job gets executed, RunMode is instead equal to RUN_LOCALLY and it aborts.

What am I doing wrong? Any tips on how to handle scheduled jobs in both older and current versions of Confluence? Is using the atlassian-scheduler-compat library still a good idea, since it hasn’t been updated in years?

Thank you,
Francesco

cfuller · May 18, 2017, 10:19pm

Although it’s targeting JIRA rather than Confluence, the compat example plugin that I wrote when we were introducing the scheduler API for JIRA 6.3 should show most of what needs to happen.

Things are a little bit more complicated in Confluence because it had a separate clustering solution before the Data Center offerings were created, and their solution involved scheduling jobs locally with a flag that said they should abort if they couldn’t obtain a cluster-wide lock, which is a race condition (if the job is fast enough, then more than one node could still run it in those older versions). To make matters worse, there is a short time period where Confluence introduced the atlassian-scheduler API to satisfy requirements of the Embedded Crowd directory synchronization code, but it was a crippled version that was still backed by Quartz and could only run local jobs. I think this must be the problem that you’re running into: It is detecting the presence of the API and trying to use it, but the API is not completely functional in those versions.

So we don’t have a lot of good options here. I’m not going to be able to spend time updating the compatibility library to properly support those versions of Confluence that have atlassian-scheduler in a broken state. On the other hand, the scheduler code is open source and the compatibility library is buried in its history. If you check out the sources from the tag atlassian-scheduler-1.2, you can just take the code from there and make your own version of it to use directly in your plugin. Probably you’d just need to alter AutoDetectingCompatibilityPluginScheduler to return false for isAtlassianSchedulerPresent() when on the Confluence versions that cannot create clustered jobs via the atlassian-scheduler API (the easiest check is just whether you’re at least on Confluence 6.0 or not).

francesco.stefanini · May 28, 2017, 4:36pm

Hi Chris,
thank you for your great reply!

Unfortunately on Confluence 6.0+ (even on the latest 6.2) the job gets scheduled in local mode, not in clustered mode.

Following your suggestion, I’ve made my own version of the compatibility scheduler, but instead of altering the isAtlassianSchedulerPresent method I have removed all the lines of the code when it checks if the job has been scheduled in clustered mode. So now every job runs properly no matter what.

I’m aware it’s not a great solution at all, but now I can schedule the same task on every version of Confluence and the compatibility scheduler gets it done using either the SAL scheduler or the Atlassian one, and in the latter case it doesn’t bother if RunMode is not equal to RUN_ONCE_PER_CLUSTER.

@cfuller do you think there could be problems by doing so? Am I missing something critical that could break everything once the plugin gets installed in production?

Thanks,
Francesco

cfuller · May 28, 2017, 9:49pm

Bundling the compat library into your plugin is the way it is supposed to work anyway, and I have my suspicions that the Confluence server team wouldn’t make this a very high priority, so I agree that this is likely to be the best way forward for you. However, I’m not really sure I follow this bit:

I have removed all the lines of the code when it checks if the job has been scheduled in clustered mode. So now every job runs properly no matter what.

Would it be possible to get you to push up your changes somewhere so that I can see better what you mean? Since I don’t really understand what you mean, it’s hard to put my QA hat on and try to poke holes in it.

francesco.stefanini · May 29, 2017, 6:13am

Sure, here’s a snippet of what I meant: atlassian-scheduler-compat 1.2 — Bitbucket

Basically in those files I have commented out the bits (line 124 in the first one, lines 43 to 46 in the second) where it verifies that the job has been scheduled in clustered mode. That’s the only modification I have made to the scheduler code.

Again, thank you for your time,
Francesco

cfuller · May 29, 2017, 10:31am

Ahhh ok… now I understand. Here’s the problem… And forgive me, because it’s going to take a bit of a history lesson to explain where we’re at and why, which is in turn needed to really understand the problem with what you’ve done.

We (well… I…) first created the compat library to provide for that transition period when we had just started working on JIRA 6.3 and knew we would be targeting that release for introducing the JIRA Data Center offering. This was to be the first version of JIRA that had any kind of clustering support, and the historic direct use of the Quartz scheduler was an obvious problem. This is because Quartz can be configured to run local jobs using an in-memory store or configured to use a persistent store and do proper cluster-safe locking, but it doesn’t know how to do both at once, and we knew we needed to make both job types available.

There was also one alternative: SAL’s PluginScheduler API. There were some things that were nicer about this interface, at least on the surface, but its implementation was horribly broken in JIRA. It had historically created jobs that kind of persisted, except that the job’s payload was not. What a mess!

This is the main reason why we created the atlassian-scheduler API. It called out that some jobs were local, in-memory jobs and other were clustered, persisted jobs. It also abstracted away all of Quartz, which was a bonus both because 1) We actually needed to run 2 copies of it now to handle both job types, and 2) That allowed me to write the Caesium library that replaces it in JIRA 7 and Confluence 6 without breaking things. But we had 2 big problems:

Confluence didn’t have the atlassian-scheduler API… or rather, it did, but it was a very old copy of it from before we added clustering support in it, and their own Confluence Data Center offering wasn’t in desperate need of these changes. There was already a clustered version of Confluence, and it made clustered jobs “work” by scheduling them as local but acquiring a lock. It wasn’t perfect (sometimes more than one node ran the jobs at the given time because it was fast enough that the lock had already been released when another node came to acquire it), but due to how they ran jobs, this just wasn’t as big of a deal for them.
We knew that it was a serious problem for the JIRA ecosystem, in particular for many of the major plugins (both our own and those of third parties), that really needed jobs to be cluster-safe but also needed a way to continue to support older versions of JIRA.

So I built the compat library, and together with people from some of our own plugins and Confluence’s developers, we worked out exactly what magic potions were needed to make the compat library work in every version of JIRA or Confluence that was currently released.

But things didn’t stay that way. The atlassian-scheduler API was widely adopted by everybody except Confluence for a long time. Eventually, somebody decided to switch the Embedded Crowd library that both JIRA and Confluence use for managing user directories and syncs. But this meant when Confluence needed to update the Embedded Crowd library to a version that used it, they had to make the atlassian-scheduler API available to it. And this is where things go wrong, because I the problem wasn’t thought through completely.

To make it work, these versions of Confluence that have the modern atlassian-scheduler API but predate 6.0 have a completely crippled version of it that, no matter what you ask, will always schedule the job to run locally. It won’t even do the old locking hack that was done in the earlier versions. This is because it was only ever intended to be used by Embedded Crowd. They didn’t really think about the fact that anybody else would ever see it and try to use it. It even has a disclaimer to that effect in the code:

    <!-- Just enough atlassian-scheduler to make Crowd work. NOT CLUSTER SAFE! (Unless you do your own locking like Crowd) -->
    <bean id="schedulerServiceForCrowdOnly" class="com.atlassian.scheduler.quartz1.Quartz1SchedulerService" autowire-candidate="false">

They tried very hard to keep it away from you and tell you not to use it. But all of this happened after the compat library was written, so it doesn’t know to watch out for it and avoid using this. It’s looking only for whether or not it is able to hunt down the atlassian-scheduler API classes (which it can) so it tries to use it. And it succeeds!

But the problem is that this is not really Cluster-safe, just like it says. Those checks for the run mode that you are detecting were put there specifically to detect this kind of problem and warn you that something is wrong. Removing the checks doesn’t fix the problem; it just hides it.

If you really don’t care about cluster safety at all or you are happy to provide your own locking anyway, then it may make sense to just stick with SAL’s PluginScheduler until you are solidly on Confluence 6+ only. Alternatively, it really makes more sense to leave the run mode checks in place and instead reject the use of the atlassian-scheduler API in pre-6 versions where it is known to do the wrong thing.

Main areas that you definitely want to QA:

Does your job end up behaving correctly in Confluence Data Center in all major versions from 5.7 to 6.0?
Have you properly accounted for the upgrade pathway, such that when the compat library switches from the SAL impl to the atlassian-scheduler impl, does the job correctly get changed to a clustered one?
Is this a problem you could maybe solve by using Confluence’s job-related module descriptors instead? It looks they they went through a transition in 5.10 in preparation for the switch to atlassian-scheduler in 6.0, and I admit I don’t know much about them, but it might be another path forward.

Hope that explains things a bit better, and hope it helps!
crf