Top 20 Software Dev Skills Over Time
Also posted on my personal site.
V2 of my last blog post. Last time job postings were collected every few weeks in real time from remoteok.io. An alternative is to use the Wayback Machine to collect the postings. The result:
Legend
- # Posts: Total number of posts on remoteok.io saved at archive.org for that month
- Total # NEs: The sum of counts of the top 20 Named Entities (skills)
Pipeline
- Fetch the posts from the Wayback Machine
- Process the posts generating top 50 named entities for each month
- Manually reduce the top 50 named entities to the top 20 skills for each month
- Run a script in Blender (2.83 LTS) on the top 20 skills to create the animation
- Render the animation in Blender to a video file
Lessons Learned
Find Existing Sources of Data
The most obvious lesson. Instead of sampling over a long period of time seek existing sources of data. Here the objective is to discover what are the most in demand software development skills. Many preexisting sources are available for this kind of information.
Elbow Grease Makes a Difference
A different approach was taken compared to the last time. This time the top 50 named entities were generated, then reduced down to the top 20 skills by hand for each month.
- Instead of ignoring a named entity if a substring matches an ignore word, only ignore it if the whole string matches an ignore string. This time named entities reflecting actual skills did not get discarded due to a substring match on an ignore word. It was not obvious valid skills were being discarded by the old script; only working with the data now the issue became visible.
- Existing synonym map kept to help merging process but manual merging also done due to the copious variety of synonyms. Again here skills were miscounted in the old script (synonyms appearing beyond the 20th rank).
- Discovered "Happiness Engineer" is a role / skill (seen in 2017-05). This alone was worth all the effort.
- A much better sense of the data was acquired. For a one-time effort the manual approach is the right way. For a repeated process working with the data manually and then programming the know-how into a machine makes more sense.
- All data science courses teach that you need to get your hands dirty to really understand the data, it is true.
Have Clear Rules for How To Interpret the Data
What counts as a skill? Manually working with the data exposed the nuance involved. Here, languages (English, German, ...) are included as skills but very broad terms (Engineer) are not. Perceived applicability of a named entity as a skill worth learning was the rule of thumb for this toy exercise. The detail and rigor of the rules should trend with the seriousness of the subject and audience. Discovering and refining rules is an iterative process, much like coding qualitative data.
Seek Large Amounts of Evenly Distributed Data
Using the WayBack machine seemed like a great idea. Plenty of data was available. Unfortunately the distribution was not ideal. Many months are missing posts, and the majority of the posts are clustered across a few months. This project went ahead anyway, partially out of curiosity to see how months with sparse data compared to those flush with data.
Fortunately the Wayback Machine website shows you the distribution of the data up front:
Use LTS Tool Versions
The Blender script does not work in the most recent version of Blender (>=2.9x). Fortunately an LTS version of Blender was used and so it was easy to download the latest LTS version and get up and running quickly again.
Write Clear Code
The original version of the processing code is not well written in terms of clarity. Specifically, returning tuples from functions is hard to reason about. The new version of the script is a bit less complex in terms of data flow within the program. Even if software is meant as a personal toy it makes sense to invest some time in quality, at least for the future self.