Little data science project on Summer 2024 Tech Internships by Pitt CSC & Simplify
Visit my repository of the project: link
I might not know what you did during this weekend, are you have fun? hanging with friends? For me, since the intern hiring season is wrapping up, and inspired by the Data Science course I'm taking, Data 8 at Berkeley, I decided to start a project to analyze the data from a GitHub repo: Summer 2024 Tech Internships by Pitt CSC & Simplify. Here I want to share steps that I took to finish it, what I have learned, and what to expect next.
My Motivation:
If you don't know, I was once their top 10 contributors, but what I did was only adding jobs to their markdown file. To whom don't know, this repo is a very popular (I believed the top intern job board) place for stem student to look for interns, which was first ran by PITT CSC, and contributors including me, later was managed by Simplify, which I have to admitted the table became more structured.
Since I found their data might draw some insight to my future internship hunt and helped others as well, I decided to start this project.
Another reason is I wanted to strengthen my skills in data collection and data cleaning, since I haven't used request and bs4 since my first intern, there I wrote some Python script to draw pricing data from the company's competitor websites. And the data I worked within Data 8, are clean, and expected to be errorless. Those motivations combined inspired me to work on this project.
Steps that I took to finish the project
First I went on to the repo to copy only a row of data with the original html code. Reason not to request the whole website first, is to simplify the process and extend from it later on. There I created a function to scan all the td tag in the row with beautiful-soup praser. It only returns an array of Company, Role, Location, Date, since the url of the position is not useful in my application, I excluded them.
However when I scroll to the bottom of the column Date Posted, I found there's different format that didn't specify date instead, they put "May 2023", which is different from most current row "Mar 15". Therefore I added a solution to unify all the date of data, by changing "Mar 15" to two column of month and year, which would keep consistency for all the data with some sacrifice on the specific date of the month.
After the hard time of processing data of one row, It got easier, since that getting data from entry of code should work with all rows, there I included a HTML file with tr only to test it out. At the turnRowsToColumns function I turn all rows of data into five columns, Company, Title, Location, Month, Year, which would later fit with my pandas dataframe, (it's my first time work with pandas, I was using my school's datascience all the time at the class :) ).
Here I discovered another issue from the original data, some rows has missing company names, not actually empty, but containing a letter '?'. That was my project wouldn't be truly reflecting the real life under this sample, cause there's three NVIDIA jobs went missing.
My solution is first turning rows with '?', into NaN, then use the pandas built-in bfill method, where it would trace back to the last appearing company, and fill with that name. Now saved three NVIDIA jobs from my dataset.
I had also confirm it with ctrl-f search back html table, if the data I received it's right, cause I first messed up with the direction of the filling.
领英推荐
def __fixTheIssueForEmptyCompanyData(self, df):
df['Company'].replace('?', np.nan, inplace=True)
df['Company'] = df['Company'].bfill()
return df
Now basically, I have done the construction of creating a dataframe for the interns data. Then I created a fetching method, to get a fresh fresh table from the website, It's not that hard at this point, the idea is to return the same rows of jobs like I copied at first. I utilized bs4 praser to lookdown the script to locate the rows I had to take and return the string of that script to my generateDataFrame function. The program is set for now, Next I want to talk about some of my discovery.
What I have learned
You might notice that from the banner, that's one of my interesting founding. From what I observed here, although the start of fall semester is very strong (September, October), February it's not bad too, Don't give up before February when you're finding an internship. And as expected, since November has Thanksgiving, and December has Christmas, most companies don't post many internship post as many as the beginning of Fall semester.
Another interesting found is with the count of internships posting by companies. My assumption was hiring for software internship, the top employers should be also a software company like Microsoft, Google, Salesforce, but the result seems to be disagree with my view point. The list of tech internship is topped by chips manufacturing and defense companies, with Intel, Leidos, and NVIDIA.
Another thing I learned, not on data but software development side. I realized that each time I called the function it took a long time to process the dataframe, 7 to 8 second ish. Therefore I decide to look for a solution to shorten the running time, what I did was I put a checker every time I instantiate the internTable class. It would check if I already have a csv generated on the same date, and use that instead of creating a new one, to save time and computing power. After I made that changes. It got 82% faster on the time.
What to expect next?
That was a small discovery using that data, I wish to explore more insight from that, I am also interested in combining this dataframe with maybe the background to those corporates, and see the trend of different industries hiring college students. Also joining table from previous years are necessary as well I believed.
For software development sake, I wish to include software testing here for my project. Or I should have started with TDD like the CS 61B taught me, to build test for the project before getting touch on the classes and functions.
End Note
Thanks for reading until here, I am very happy that you have read my first article. Another reason why I started this project that I haven't disclosed at the begining, is I want to develop strategy for a more successfully internship search next year, since I went 0 and 0 last two years due to many reasons. Maybe I should aim for one sector instead of others, and there's some skill I have to acquire, that's something I need to consider when I discover more with this now cleaned dataset.
And because I still have to pay rents for my off-campus apartment during the summer, I have applied Uber, not their software engineering program lol, it's their drivers program to cover the cost, I might have to work at restaurant as well, but we'll see on my Linkedin job history lol. I wish I would have sufficient fund for next summer, by taking advantage from the table, to secure an internship next year.
Again if you want to visit the project repo to read the code I wrote: link