What are orphaned repos and why you should care
TLDR: If no one knows the code, how will you fix or redeploy it? Knowledge of code is important for maintaining, operating, and making bug or security fixes. However, removing unused repos might improve focus by removing clutter.
The software development classic The Mythical Man-Month by Fred Brooks discusses the disproportionate amount of time developers spend thinking compared to coding. Some say that source code is only 30% of the development process, with thinking, discussions, and discovery being the other 70%. So even if we've got the code, we're missing the thinking, which is still in the developer's head. Even documentation might not do justice to the days or weeks of thinking and discussions.?
Some of the "why we did this?" and "that failed so we did it this way" will be missing when the next person comes along.
I am pretty sure if you ask most developers, reading someone else's code, and changing it, with little context is hard.
It seems like companies are experiencing 15-25% churn in their developers, either through voluntary "change of scenery" or "resizing," which can cause working knowledge gaps and orphaned repos.
What is an orphaned repo?
An orphaned repo (Git code repository) is one where developers who wrote the code don't work here anymore. They haven't moved to another team internally, they are all gone.
Maybe it is not an issue if the repo is old and unused, so we can delete or archive it.
However, there can still be challenges if:
Why is this important? My other developers know language X used in this repo?
Context is hard to transfer. Just like picking up a recipe you've never cooked before, it's going to take more time even if you can read the instructions.?
The knowledge domain could be different. Just because we can read and write in a language, does not mean we can write a legal contract in that language (unless we have that expertise or experience).
A company I know who takes on maintenance of applications for other companies said that it often takes a developer around six weeks to be comfortable with the code to be able to maintain and improve it from a cold start.
When something is broken and you need to patch a security vulnerability or deploy again, it's a lot easier if you've done it before.
How to identify orphaned repos.
There a two methods of identifying orphaned repos, with varied degrees of accuracy. Both methods requiring looking at the commits or changes to a repository, but the source of "who works here" is different. The method is simple, find all the developers (typically email addresses) in a code repository, see how many appear in our "active" developer or employee list.
#1 Use a staff directory or HR Employee extract and cross reference developer commit history in the repo (Most accurate method)
Comparing developers seen in a repo to an extract from HR or staff directory will give an accurate result of if they still work for your company.
The crux of this method requires the extract or ability to query a staff directory, then look at the developers in each repository and check if they are in the extract.
#2 Use version control history and see if developers in a repo are still committing in other repositories (less accurate)
Some places use definitions of an active developer having activity in, say, 30 or 90 days. Using this method, we can stay within git to check if users in a repo have been seen in other repos and consider them "gone" if they haven't been active in X days.
The key challenge with this method is that we're "guessing" someone has left if they haven't committed, and it takes a while to consider them gone. So, the feedback loop is much better if you can query an HR extract or look up a directory.
Can we do Predictive Maintenance? (AKA, can we avoid a single point of failure?)
Sure can. Basically, if only one person (who is still here) has worked on the code, then you have a single point of failure risk. Consider setting a threshold of two or three people who have worked on a repo if its important enough to keep using.
I've seen people purposefully making a new person make changes and deploy (under guidance) to build experience in the process and reduce reliance on other people.
Are all developers the same? What about cross functional teams and different components?
So, while there is talk of "full stack" developers, from what I've seen there's still a lot of Person X is the Front end developer, and Person Y is the database person. Depending on the complexity, similar to key person risk, you may want to break down the types of code into skill sets or capabilities.
领英推荐
For example, you may have people who know the front-end code but have zero working knowledge of the database.
Argh, it looks there there are secrets in this orphaned repo?!?
Secrets seem to be a challenge for most organisations. Everyone needs them, and where do you put them (which should NOT be code, but it happens)? Secrets are a whole other topic, but we have a few interesting cases in orphaned repos.
Just say you find secrets; there are a few things to confirm:
Do we archive or delete?
This question has quite a subjective response depending on companies archiving and destruction processes to be followed.
One company I've worked with I thought had a good approach:
They were a big company with thousands of developers, so this is an ongoing process that needs to be run. Even in smaller longer running teams, these rules probably still apply.
Improve focus when there are less things to think about
From my security side, when you start working with an organisation, their developers and code, there can be a lot to take in. Often with hundreds to thousands of repos, you spend time asking what's being used in production and what's old unused code.
There are at least three parts to this focus:
Anecdotally, from discussions with development managers and application security and assurance people, you want to know what's being used and remove what's not needed. Less clutter means less wasted time.
Keep in mind that developers change all the time, the current vibe is 20-30% yearly churn. A new developer has to come into an environment has to understand what's important, so removing old unused orphaned repos I think helps. From an application security or assurance point of view, remove the noise to focus on the important stuff.
Future investigation and next steps
So far, we haven't discussed opensource libraries and code in the software supply chain. There are simple ways to check library versions to see if they look unmaintained or orphaned.
For opensource, I think it's safe to call something unmaintained and probably orphaned if a library or repo you use hasn't been updated in 12 month. I'd be worried without seeing activity after three months.
We've talked about orphans and hinted about maintenance status. Orphans refer to the lack of people who know the code, but aging or maintenance status is a different indicator worth considering.
I'd recommend looking at what hasn't been maintained recently but still being used as one way of identifying hotspots.
So, asking for a friend, what are you doing about orphaned repos?
For those interested, there's a more living version of this article:
Kospex is an MVP side project of mine that uses some analytic functions around Git to unearth orphans, maintenance risks, and hotspots.
#knowyourcode #knowyourdeveloper #gitanalytics #predictivemaintenance
Senior Architect - CSM
7 个月Great article Peter Freiberg
A terrific article. Thanks for the share Peter Freiberg
CISO | Startup Advisor | Investor | Career Mentor
7 个月This is definitely a concern. I have talked to a few people about this recently. Unowned, un-maintained code in critical business flows. Usually identified as vulns that are past remediation SLAs rather than proactively.