Ahhh ok… now I understand. Here’s the problem… And forgive me, because it’s going to take a bit of a history lesson to explain where we’re at and why, which is in turn needed to really understand the problem with what you’ve done.
We (well… I…) first created the compat library to provide for that transition period when we had just started working on JIRA 6.3 and knew we would be targeting that release for introducing the JIRA Data Center offering. This was to be the first version of JIRA that had any kind of clustering support, and the historic direct use of the Quartz scheduler was an obvious problem. This is because Quartz can be configured to run local jobs using an in-memory store or configured to use a persistent store and do proper cluster-safe locking, but it doesn’t know how to do both at once, and we knew we needed to make both job types available.
There was also one alternative: SAL’s PluginScheduler API. There were some things that were nicer about this interface, at least on the surface, but its implementation was horribly broken in JIRA. It had historically created jobs that kind of persisted, except that the job’s payload was not. What a mess!
This is the main reason why we created the atlassian-scheduler API. It called out that some jobs were local, in-memory jobs and other were clustered, persisted jobs. It also abstracted away all of Quartz, which was a bonus both because 1) We actually needed to run 2 copies of it now to handle both job types, and 2) That allowed me to write the Caesium library that replaces it in JIRA 7 and Confluence 6 without breaking things. But we had 2 big problems:
- Confluence didn’t have the atlassian-scheduler API… or rather, it did, but it was a very old copy of it from before we added clustering support in it, and their own Confluence Data Center offering wasn’t in desperate need of these changes. There was already a clustered version of Confluence, and it made clustered jobs “work” by scheduling them as local but acquiring a lock. It wasn’t perfect (sometimes more than one node ran the jobs at the given time because it was fast enough that the lock had already been released when another node came to acquire it), but due to how they ran jobs, this just wasn’t as big of a deal for them.
- We knew that it was a serious problem for the JIRA ecosystem, in particular for many of the major plugins (both our own and those of third parties), that really needed jobs to be cluster-safe but also needed a way to continue to support older versions of JIRA.
So I built the compat library, and together with people from some of our own plugins and Confluence’s developers, we worked out exactly what magic potions were needed to make the compat library work in every version of JIRA or Confluence that was currently released.
But things didn’t stay that way. The atlassian-scheduler API was widely adopted by everybody except Confluence for a long time. Eventually, somebody decided to switch the Embedded Crowd library that both JIRA and Confluence use for managing user directories and syncs. But this meant when Confluence needed to update the Embedded Crowd library to a version that used it, they had to make the atlassian-scheduler API available to it. And this is where things go wrong, because I the problem wasn’t thought through completely.
To make it work, these versions of Confluence that have the modern atlassian-scheduler API but predate 6.0 have a completely crippled version of it that, no matter what you ask, will always schedule the job to run locally. It won’t even do the old locking hack that was done in the earlier versions. This is because it was only ever intended to be used by Embedded Crowd. They didn’t really think about the fact that anybody else would ever see it and try to use it. It even has a disclaimer to that effect in the code:
<!-- Just enough atlassian-scheduler to make Crowd work. NOT CLUSTER SAFE! (Unless you do your own locking like Crowd) -->
<bean id="schedulerServiceForCrowdOnly" class="com.atlassian.scheduler.quartz1.Quartz1SchedulerService" autowire-candidate="false">
They tried very hard to keep it away from you and tell you not to use it. But all of this happened after the compat library was written, so it doesn’t know to watch out for it and avoid using this. It’s looking only for whether or not it is able to hunt down the atlassian-scheduler API classes (which it can) so it tries to use it. And it succeeds!
But the problem is that this is not really Cluster-safe, just like it says. Those checks for the run mode that you are detecting were put there specifically to detect this kind of problem and warn you that something is wrong. Removing the checks doesn’t fix the problem; it just hides it.
If you really don’t care about cluster safety at all or you are happy to provide your own locking anyway, then it may make sense to just stick with SAL’s PluginScheduler until you are solidly on Confluence 6+ only. Alternatively, it really makes more sense to leave the run mode checks in place and instead reject the use of the atlassian-scheduler API in pre-6 versions where it is known to do the wrong thing.
Main areas that you definitely want to QA:
- Does your job end up behaving correctly in Confluence Data Center in all major versions from 5.7 to 6.0?
- Have you properly accounted for the upgrade pathway, such that when the compat library switches from the SAL impl to the atlassian-scheduler impl, does the job correctly get changed to a clustered one?
- Is this a problem you could maybe solve by using Confluence’s job-related module descriptors instead? It looks they they went through a transition in 5.10 in preparation for the switch to atlassian-scheduler in 6.0, and I admit I don’t know much about them, but it might be another path forward.
Hope that explains things a bit better, and hope it helps!