How to handle Legacy SCM?
Intro
SCM (software configuration management or version control systems) are one of the most important building blocks within software development. The management of the change history, the immutability of created versions and the possibility of representing related versions of individual objects in the form of baselines or releases and thus making them reproducible are what make many other processes (e.g. build and release management) within development teams possible in the first place.
These systems are not replaced lightly, precisely because they are the basis for higher-level processes.
However, technological progress does not stop there, and is even necessary in order to be able to depict modern development methods at all.
Another factor is M&A events, which often lead to more than one SCM system being used in companies. A personal record holder was a company where 4 (in words FOUR) different SCM systems were actively used on one floor of the development team.
Whatever the trigger for a replacement or a planned phasing out of a tool from active use - the question of what to do with the legacy data immediately arises.
Migrate, archive, and what are the consequences of such a decision?
Besides the purely technical question of how to handle the data, psychological factors play an important role. Especially with old datasets (it is not uncommon for some data to have its origin 10-15 years in the past), it is very likely that the current team members are completely different from those at the beginning of a project. Even if they have official responsibility for the data, they lack the detailed knowledge of what is still needed from the dataset and what is not.
As a result of this uncertainty, projects and those responsible answer the question of what should be migrated or archived almost reflexively with "Everything!
This results in two serious problems for the IT staff responsible for the SCM systems:
- Who will bear the high costs for a complete migration and archiving?
- What exactly is the definition of "everything"?!?
If a solution can usually be found for the first question, albeit after a lengthy discussion and/or waiting period, the devil is in the small and large details behind the second question.
Although the task is the same for each SCM system, there is a certain philosophy behind each tool (commercial or open source) that is reflected in the implementation. This results in the presence or absence of objects and commands; but it is also about fundamentally different approaches.
The problems of switching from Subversion to Git
Let's stay in the open source world. Subversion and Git have been the dominant systems in the past two decades. Both are easy to operate and use, but follow two completely different approaches.
Subversion, the older of the two tools, works with a central repository into which all changes made by the individual developers at the time of the commit are transferred from the workstations and stored. However, the developers pay for this by being forced to be online at the time of the commit, as well as if they want to view the history.
Git, as a younger system, takes the opposite approach - each developer has a clone of the repository on their working environment. This brings enormous performance advantages and allows offline work, as all information is available at the time of cloning via the history. However, if all the changes of a development team are to be merged in a repository, the developers must, in addition to the commit (which typically carries the changes into the local clone), transfer them to a (logical) central archive via a push command, or fetch the changes from the central repository back into their local clone. One commit, two meanings.
While in Subversion an elaborate dump-filter-load process is necessary to remove an accidentally saved file from a repository, in an (unsaved) Git repository I can remove whole parts of the history without leaving any digital traces.
领英推荐
So back to the initial question - what does "everything" mean in the example of replacing Subversion with Git?
Everything is not possible
First of all - a 100% migration is not possible. Subversion, unlike many other systems, does not know classic labels; these are simply part of the path to a directory tree in the repository and not metadata.
The more important question, however, is how many revisions should be taken over. Since Git always clones the entire repository, this results in a significantly higher demand for disk space.
The simple rule of thumb for determining the disk space when taking over all revisions is:
(Disk space SVN repositories)*0.95*(Number of developers)
The first part is an empirical value. Git repositories are usually 5% smaller than corresponding Subversion repositories. Since the second factor, the number of developers, cannot be saved, a reduction in the size of the Git repositories after a migration is only successful if the entire history is not transferred.
There are three proven approaches to this:
- Only the last X revisions are taken over from Subversion.
- Only selected branches (or tags) are taken over.
- The two approaches just mentioned are mixed
How high X is in the first approach will be different depending on the type of development. While for a classic in-house development the levels of the last 12 months are sufficient, for an embedded development it can be the last 5 years or more.
In the second approach, the branches (or tags) are given a different meaning. In the future, do I need a bugfix branch (and thus every single change traceable) for a release that has long since been unsupported, or are the releases that actually went into production more interesting?
The third approach is the most efficient. Only a part of the history is taken over, and only branches or tags that are likely to be relevant in the future.
In some of our migrations, the initial size of the Git repositories shrank to 5-15% of the original size, even though the projects insisted that all revisions be transferred. Tests with repositories have shown that by concentrating on the last 100 revisions from subversions, the sizes can be reduced to well below 1% of the original size.
Another factor, albeit usually to a lesser extent, is the type of data stored in Subversion. Particularly with long-term users and projects, binary data is often stored in Subversion, e.g. binaries, documents, images etc.
These can still be stored in Git. Our clear recommendation, however, is to ensure that the objects find a new home during a conversion from Subversion to Git - binaries belong in a binary management system, documents and images in a document management system. Why is that? Because these objects usually have a different lifecycle than source code. If I know how a binary has to be built, I can reproduce it exactly at any time. Subversion was never designed as a binary storage and is at a disadvantage compared to a clean CI/CD environment in the context of a modern DevOps infrastructure. The same is true for documents.
The next part of the series is about the migration process, again using the example of Subversion to Git.