Highlights of DataFest 2018
For the second consecutive year, I've enjoyed being part of the judge team at the annual DataFest event at Chapman University. Led by Dr. Michael Fahy and his team, this year's event attracted 24 teams from universities in the greater SoCal area.
What is DataFest?
DataFest is a weekend-long data hackathon, where teams of 2-5 university students are given a large, commerical dataset and a set of questions. The teams then work together using data science methods to produce one or more findings. They share their results during the presentation portion of the event.
We judged their work on 3 categories - Overall Insight, Visualization and Data.
What do Students do with data?
This is such an interesting question for me as an industry person. I was fascinated to see the different tools, approaches and results that the teams produced. There were a couple of observations that particularly interested me:
- Most teams augmented the data that was provided with one or more sets of public data.
Students used open data, particularly government-supplied data
- The more types of data the teams used, the more data munging (cleaning) they had to do. Some teams got stuck in this area. I actually thought that was very useful and reflective of real-world work with data.
- Several teams used data visualizations early in their process, to get a 'view' of the quality or information in the data. One team found a bug in the vendor's website that populated a default which skewed their data!
- The teams generally focused on using the data to gain insight into questions in one of three areas. These were as either a) making more money for the company that provided the dataset, b) providing more useful information for students or c) investigating relationships between socioeconomic impacts (poverty, levels of education, rural markets...) and the provided dataset.
Which Tools and Languages do Students use?
In this area, it interested me to observe that there seemed to be less use the R language (than in previous year's entries) and more use of Python. The most common algorithm used was logistic regression. A couple of teams built full custom machine learning models.
I heard from several teams that they were resource constrained given the size of the dataset -- due to the lack of storage and processing power on their laptops. Being a Cloud Architect, it pains me to hear that students are not using the public cloud in this work. Here's a quote from one team:
It took 2 hours to render these heat maps on our laptop.
An obvious growth area is to include mentorship with one more of the public cloud vendors for next year's event.
Hoodies and Blankets were on display during the wee hours of the hackathon.
What's Next?
Congratulations to the hardworking hosts at Chapman University and participating students on a great event. As I did last year, I invited members of the winning team to join me in real-world work. To date, I've hired one person from the winning team of 2017 - he's doing great. The energy, creativity and skills of the students inspires me.
Let's help them grow - to contribute contact Dr. Fahy via [email protected]
Thanks for the recap Lynn. I couldn’t agree more regarding the topic of providing more real world (Cloud) tools for the teams. Several asked me how to execute James Peach’s recommendation to load the dataset into MySQL for quick high-level analysis but their laptops couldn’t handle the larger import. The one team that I know succeeded in import didn’t get viable results until just before presentation time. A standard set of on-demand Cloud resources, available to all the teams, would have allowed them to more quickly get the analysis & discovery.
Analytics, Data Science, Statistics, HR, Accounting, Budgets, Project Management, Business Analysis, Training, Process Improvement, Process Documentation
6 年Congratulations to Hernan Padilla and his son!
Managing Director | Business Enabler | Community Leader
6 年Love the narrative Lynn Langit!
Software Engineer at Qualcomm
6 年had a good time, just a little under the weather from the event. stayed up all night writing sql. :D
Principal Territory Manager
6 年Way to go Ryan! Congrats to all on the UCI team! Go Anteaters!!