Redact PII on Confluence

Looking to develop a solution for DLP - redact PII information on contents and quarantine attachments which have PII information.
Trying the servlet filter route to get the viewpage.action for content and upload.action for attachment. Will add a filter to parse the body of the content and use regex to mask any PII information.
Is this a good solution? Anyone having any better ideas?

For content you’ll also have to catch it on the /rest area. Don’t forget about comments as well. And if an add-on does any content ingestion, you’ll have to handle that. There’s a lot of entry points. :slight_smile: You might want to listen for an event instead and then go back into the content afterwards and edit it (only way to be 100% certain of catching it).

As far as the upload.action - again there are several ways of getting things in but in addition to that - you don’t want to do that on the request thread since somebody could upload a 30M zip file which will be a lot of work… Again events come in here.

My suggestion would be to set up an active objects table which can behave as a queue. Then when new data comes in, add the content id to that table. Then in a scheduled job, process the table (bearing in mind that you might get backed up so you might only want to do X items each time).

1 Like

Even events are not 101% safe. You might decide to disable them in future. What about search? How are you going to handle content indexing, popular index etc since you are relying in the viewpage action?

I would follow a different approach a bit more complicated though…
a. Override the saveaction and scramble data before saving
b. Put a macro in the main.vmd where you de-scramble data if your condition match
For the upload, what @daniel wrote

1 Like

Euh… just out of curiosity: how are you going to identify PII data in free text?

Thanks Daniel!
I was also checking out the events way, but was thinking instead of updating content in the backend, we could just mask the PII in the response render.
But, after researching on what you said about several entry points, yes, it could be too many to handle and several misses.
Will ponder over your ideas. Appreciate the response! :slight_smile:

I don’t know yet. Still figuring it out. :slight_smile:

Thanks Panos!

You are right. As Daniel also said, there can be several entry points such as this which can be missed.

Can you please help with an example on this one?

Appreciate your help!

Ok. Thought about this a bit more. Do you mean to say, encrypt the data in the backend and decrypt it when displaying back to the user? I think this is not what we want. The need is to hide/redact any PII information.
Please let me know if I understood you right.

So, may be, I have to listen to event and capture the content/attachment ID and then run an asynchronous job which redacts the content in the backend (database). (quaratine/delete attachments).

@sameer.v I though that redact should also have an opposite action. Sorry I misunderstood. If you run async job to redact, you need a way to force indexing like

confluenceIndexManager.getTaskQueue().enqueue(new IndexTaskFactory().createUpdateDocumentTask(page));

I still believe that redacting on beforehand is most appropriate so that you wont have to face all those situations. Overriding is generally easy, if you decide to go that way I could provide help if needed

1 Like

What does this translate to from the application user perspective, if not done?

First you need to ask yourself, will confluence update the index page if you do something like
fetch page -> get body -> update body -> save page through API? Notice here that events are running in another process. (That might save you some “why am i getting exceptions” confusions)

So, myself I am not 100% sure of the answer and thus suggesting that you make sure what is going on with indexing.

Let’s assume that content indexing works as expected and fires upon the “save page” action. Then, new and redacted body will be saved and indexed. We are good in this perspective.

Let’s assume that is not behaving like this and page is not indexed after the save page. That would immediately mean that the content you redacted is searchable and the lucene search highlighter will mark content as if it was there. Your redaction would be in vain and visible until next full content indexing or page edit through the webform

1 Like

Understood. Thanks for the explanation @Panos! Appreciate all your inputs! :slight_smile:

1 Like