Getting rid of sensitive data from a Gitlab repo
Sometimes you find something in your Git repository’s history which should not be there, such as when you started development with a small closed group but now want to invite a wider group to contribute. A simple option is to create a new repository and issue tracker for the ‘new’ development effort but why shouldn’t developers continue to have access to the whole project history, just with the historical sensitive data redacted?
Rewriting history changes all your subsequent commit hashes, but provided that you control all instances of the repository (e.g. an organisation using Git internally) then it is possible to recover using techniques described in a number of tutorials on the Internet. But what if you are using Gitlab and so your commit hashes are bound up in non-repository data like the issue tracker? This article describes how to fix this kind of problem.
TL;DR:
As a more expansive overview, what we're going to do is filter-branch the git repo that lives inside Gitlab, and then we're going to use the Rails console (Gitlab being a 'Ruby on Rails' app) to modify the project data held outside the git repo (e.g. notes on issues). The sensitive data in this case study is a list of transaction numbers used as sample data in a program. Obviously best practice is to keep real data separate from code so it never goes into version control, but we are talking here about how to atone for our sins and move forward.
By way of preliminaries:
- Gitlab runs as the user 'git'. I don't like running a shell as someone else (it deprives me of my .bash_history and feels risky) so you see a lot of 'sudo sudo -u git' in the below: this is a way to run sudo as the git user as root. No that is not a typo. Also the working directory for the commands below should be somewhere writable by both you and the git user, in this case /tmp/pat/.
- Every call to git has --git-dir=/path/to/repo.git because I didn't want to change working directory to that path, in case I made a boo-boo and spat files out all over the production repo. (It's not required for this job, but some interesting trivia is that there is a corresponding argument --work-tree=/path/to/repo-work-tree/.)
- This document uses the terms 'I', 'you' and 'we' interchangeably, but only an admin on the Gitlab server can run this fix.
- Hopefully any terminology used in this article is standard Git or Rails jargon, or otherwise discoverable via quick Google search. If not, I’m sorry and please let me know so I can make it clearer for everybody.
Git component
First we identify all the transaction numbers in the repo, including historic commits (e.g. you reverted the offending commit in order to mitigate the problem, but the offending commit is still there in the commit history).
In my terminal the final grep call helpfully highlights all the maybe-transaction numbers, i.e. the 6-9-digit numbers. I manually copy-paste out the lines which look like they are transaction numbers as opposed to, say, YYYYMMDD dates or very large numbers. Then I paste the resulting file as input into the following command:
‘txn-regexes.sed’ converts each transaction number into a redaction of the form ‘XXTXNnnnX’ where ‘nnn’ is a distinct number, which allows us to selectively ‘undo’ the redactions later if necessary.
In my case txn-regexes.sed had 20 lines; that is, 20 distinct transaction numbers were in the repo.
In order to do before-and-after checking, we need to print a list of commits prior to changing anything. The commits seem to come out in a deterministic order but I don't know why, so I check later on.
That took a long time to write, but serves as a good introduction to filter-branch. The name 'filter-branch' refers to, conceptually, taking an entire branch (not just a single commit) and filtering it: branch in, some processing is done on it, modified branch out. (Note that there is no 'filter-repo' command, because conceptually there is no such thing as a repo, only a set of refs which lead to branches.) By default, filter-branch filters the 'master' branch.
What filter-branch does is check out each commit in the branch, one after another, into a temporary working tree. It then runs the command specified with the --tree-filter argument (not present above, but see below) and then commits the resulting tree with the command specified with the --commit-filter argument. The above command is a 'no-op' but it prints out all the commit IDs in order – or at least I think they're in order; at least in practice they seem to be in the same order as other invocations of filter-branch, which is all that matters for present purposes.
Run wc -l commits-old.txt and verify that the right number of commits are accounted for.
Next is the fix itself:
That took a very long time to write, but when compared to the previous invocation of filter-branch it is easy to see what it does. The --tree-filter flag specifies a command to run in the temporary working tree for each commit. That command is find, which calls sed for every file in the tree, seeking and destroying transaction numbers. The --commit-filter command is the same as above, but prints the list of commits to a different file. It may be possible to incorporate both the 'before' list and the 'after' list into a single filter-branch invocation, but I didn't.
At this stage we have a new 'master' branch, free of transaction numbers, but the old commits are still in the repo, and still accessible from the Gitlab web UI (if you know the commit hash).
I mentioned the order of commits earlier. In order to confirm that commits-old.txt and commits-new.txt are in the same order, I use the following commands:
This prints out a list of five random before-and-afters. I then call something like git log -n 1 <commit> (on the before and the after respectively) to see the log entry and verify that the before and the after refer to the same thing (albeit one is in the redacted tree).
So it's gone, right? Not yet.
Even though you have a 'fixed' branch in your repo, the old branch is still there. Even if you delete any refs pointing to the old branch, the old commits are still there in the .git directory; hard to find, but there.
Garbage collection requires two steps:
(1) Delete tags that refer to old commits.
(1A) Update any branches in the repo to the corresponding 'new' commits.
(2) Run garbage collection, which will delete any data not validly pointed to by a ref.
As for step 1, Gitlab creates tags on the form refs/keep-around/* and refs/merge-requests/*. There is also the refs/original/refs/heads/master tag created by filter-branch.
Something like--
--should do it, but for some reason it would never get rid of all of them. For some reason, if you pipe the list of refs to a file and then pipe that file to git update-ref, it works.
I also managed to create a file called '-d' in the repo's .git directory which could not be referred to by normal means for purposes of deletion (rm -d means something else), so I did the following:
Now run garbage collection:
This is where you have either succeeded or failed. Often this is where you loop back to earlier because you've stuffed up an earlier step. What you do is git grep the repo to see if any transaction numbers remain:
At this point, unless we forgot to delete a ref earlier (etc) the old commits are no longer visible from the Gitlab web UI.
You will also notice that if someone has commented on a commit in Gitlab, and you look up the 'new' commit in the Gitlab web UI, the comment is... gone. Whoops. Time to fix that.
Rails component
If you want to learn Ruby on Rails then I recommend chapters 1 and 2 of Michael Hartl's Ruby on Rails tutorial. I didn't need to dig around in the database and write SQL because Rails exposes a very handy object-relational mapping interface called 'ActiveRecord'. You can use this to do search-and-replace in the models.
First I create a file called txn-fix-parta.rb which looks like this:
This subroutine fixes references to commits in issues and comments on commits. I then create a script (txn-fix-partb.rb) which calls that subroutine once for each commit:
txn-fix-partb.rb looks something like:
but with lots of lines like that. (See, if you learn sed then you never need to learn how to write loops in any language other than Bash.)
Now that looked way too easy so I need to give you a bit of a sense of how I worked out what needed to be in fix_commit().
The whole point of Rails is to let web developers specify their data structure ('model') and its presentation ('view'), with everything filled in with defaults but with the ability to customise the application's behaviour ('controller'). The things in brackets in the preceding sentence spell out 'MVC' which has its own Wikipedia page. For present purposes we will look at the web UI to see things but really only need to touch the model. The word 'model', in Rails-speak, refers to a table in a database or what I would normally call an 'entity'. It has attributes and relationships with other entities.
Suppose that, in the web UI, you see a comment on an issue that says 'I thought the widget was implemented in commit abcdef but actually it was a stub.' You somehow work out that it's the 'Issue' model that you want to interrogate, maybe by looking at the Gitlab source code or by exploring via the Rails console (e.g. ActiveRecord::Base.descendants.map {|f| puts f} prints a list of models). You can go into the Rails console and find the record you are interested in like so:
You can test modifying the model like so:
and observe that it appears in the web UI even without hitting the ‘refresh’ button in your web browser.
Anyway back to the fix: When you are ready, you run the Rails component of the fix like so:
Now, joy be had, everything is back where it should be:
- references to commits (e.g. hyperlinks to specific versions of files) in issue bodies or comments are updated to refer to 'new' commit hashes;
- comments on commits have re-appeared on the 'new' commits;
- the transaction numbers are nowhere to be found.
In case anyone has hyperlinks to specific commits elsewhere, or you have to fix the data in any other system which ‘hangs off’ the repository, you can produce a table of 'before' and 'after' URLs:
Things that are not fixed (but could be fixed):
- references to commits in blobs i.e. in the files themselves (would require a running record of before-and-after commit hashes, which is probably doable because the commits-new.txt file is built incrementally in order as the fix runs);
- references to commits in commit messages (as above, but goes in the --commit-filter command instead of the --tree-filter command);
- references to commit hash prefixes as opposed to full 40-character hashes (modify the various parts to search for either a string or a 6-or-more-character prefix of the string);
- if sensitive data was committed on a branch which does not now form part of 'master', I don't think the above filter-branch invocations will fix them (note that the git grep calls above would detect this; if necessary the filter-branch calls can be modified or repeated).
Finally, filter-branch can be slow. If the source tree does or ever did contain large files, or if there is a long commit history, filter-branch will take ages. It can be sped up by putting the temporary work tree on a ramdisk/tmpfs but if it is a big repo (in any dimension), to the point where filter-branch cannot complete within an acceptable downtime window, then you may need to consider more specialised tools which operate directly on Git internals without checking out each commit in turn.
Conclusion
If you have something in your Gitlab repo that shouldn't be there, you need not nuke the site from orbit. A sed scalpel will let you keep your development history, with your sensitive data neatly redacted.