Creative Data Engineering Can Drive Data Science Insights: A Datapalooza Dispatch

Data science lives in the details of your projects. But those details are dry as dust if your projects don’t intersect with something about which you care passionately.

Passions are as varied as the data scientists themselves, and immersion in project details is a sure sign of a true data scientist. The first day of Datapalooza in San Francisco brought many working data scientists together to share details of projects that excite them. It’s a three-day event in which most of the working sessions are repeated at least once on subsequent days, so if you missed some details on the first go-around (Tuesday, November 10), you always have the option of attending the same presentation on the second or third day.

Considering that data-science project overviews can be difficult to follow in every detail, I chose to make my life a bit easier. After each Datapalooza session I attended on the first day, I asked each of the presenters to send me their slide decks. Or, if what they presented was simply a live hands-on demo, to email me links to any supporting materials that I might study at leisure. Those are still trickling in.

On the whole, I was quite impressed with what I saw. In general, data-scientist passions range from the grandiose world-changing variety to the seemingly mundane. I’d put the Datapalooza day-one presentation by Edd Dumbill (@edd), Stephen O’Sullivan (@steveos), and Harrison Mebane (@harrisonmebane) of Silicon Valley Data Science (SVDS) into the latter category, but that’s not necessarily a bad thing.

What SVDS discussed was a real-time mobile app they’re building to predict train arrivals throughout the Caltrain system in the Bay Area. At first glance that might sound thoroughly unnecessary. After all, doesn’t this and pretty much every other train system in the world publish schedules for all to see?

But here’s the crux of the issue as SVDS presented it. As Caltrain riders themselves, the SVDS principals had become annoyed at the unreliable nature of the transit system’s data API. Though Caltrain on-time performance is over 90 percent, there’s no reliable real-time status information, and, on top of that, the system’s data API tends to crash and was out completely between April and June this year.

Inability to reliably predict when Caltrain cars arrive at stations has an operational impact not just as SVDS staff as commuters but on the work they do when they arrive at the office. SVDS’ facility is right next to a Caltrain station that, when trains are passing through, emits enough noise to disrupt conference calls in progress. Consequently, SVDS personnel have trouble identifying specific times when they can schedule conference calls to minimize train-noise disruption.

As Dumbill stated: “Every 7-8 minutes on a conference call a train comes by. The problem is that as trains come, you’re not sure when they will. Signs on the stations have a large margin of error. And there is no reliable real-time status information, because of outages on the system’s API.”

Essentially, SVDS took it upon themselves to build an iPhone/Android mobile app that makes their own working lives a bit easier. It’s the seemingly mundane “scratch an itch” motivation behind many of the most innovative data products being developed anywhere. And in order to accomplish that, they needed to engage in a bit of creative data engineering to source the right types of data needed to feed the algorithms that drive their app. Of the sessions I attended at Datapalooza day one, this was the one that best exemplified the theme of harmonizing your data engineering, as I discussed in this recent blog.

So what specifically did they build? As laid out in this recent article on their website, it came down to clever multi-source data acquisition, pipelining, and analysis of audio signals (the acoustic properties of overheard train whistle blasts), video feeds (the streaming visual patterns of glimpsed trains in transit), and Internet feeds (the real-time schedule data coming from the API, passenger real-time sentiment data from Twitter, and relevant data from other Web-based sources).

At a high level, SVDS’ project involves:

  • Leveraging sensor data they capture themselves from the comfort of their own office, based on audio and video capture devices that they set up on their office porch, overlooking the Caltrain tracks;
  • Gathering real-time data on the direction and speed of each passing train;
  • Feeding this data into digital signal processors (DSPs) on a Raspberry Pi board;
  • Applying Fast Fourier Transforms and PyAudio to serialize the audio into an array of 2-byte integers;
  • Streaming DSP event stream data outputs and associated metadata through Kafka and Storm into HBase, Impala, HDFS, and PostgreSQL
  • Maintaining machine-learning algorithms in Python that handle parsing, classification, detection, counting, integration, and alerting within the app’s back-end data pipeline;
  • Leveraging Hidden Markov Models, decision trees, and logistic regression to handle the trickier train-noise classification tasks;
  • Analyzing streamed data through a blend of Spark and MapReduce models;
  • Exposing analytic outputs via a REST API; and
  • Providing users with a mobile app that delivers real-time Caltrain car-arrival predictions.

In the Datapalooza session, the SVDS team discussed the pros and cons of the different algorithmic models that are designed to classify trains by means of audio and video signals. SVDS is using these models to identify, based on inputted multisource signals, the specific scheduled train runs, whether they are local or express runs, and whether they were ahead, behind, or on schedule. They also contextualize these analyses within a model of other significant events of Caltrain system-wide impact, such as, in Dumbill’s word, “calamities.”

Here’s a screenshot from the SVDS iPython notebook used in the Datapalooza presentation. It shows the thresholding analysis that is applied to image captures in order to identify the type of train run being observed in real-time.

SVDS stresses the importance of tapping an ensemble of signals to drive iterative refinement of models, thereby enabling progressive improvement of their overall predictive performance. Leveraging Apache Cordova and doing team-based work in sprints, they had engineered hooks into their mobile app so that a user’s device GPS data can deliver location-relevant train-schedule predictions.

Clearly, this is the sort of data collection that could conceivably be developed further through the Internet of Things and crowdsourcing, though that’s not in the current scope of SVDS’ project.

Also, SVDS was candid about the fact that, for this app to deliver value as a data product, they need to work on making it more “consumer friendly.”

Please reach out to SVDS directly for further details. I’ve linked to the presenters’ Twitter addresses near the top of this post.

Here are other sessions I attended on Datapalooza day one:

  • Data Engineering: Spark Search by Taka Shinagawa, a software engineer and data scientist based in the Bay Area
  • Data Science: Better Search at Scale: Leveraging Spark for Contextual NLP by Jonathan Dinu, VP of Academic Excellence at Galvanize
  • Data Science: Real Time Vehicle Telematics, by Cahlen Humphreys, Big Data Engineer at zData Inc
  • Data Science: Search by Selfie: a Spark Facial Recognition Algorithm, by Brandon Schatz, founder of SportsPhotos

If you want further details on any the projects presented at Datapalooza, I recommend that you reach out directly to the presenters via their respective linked social-media addresses.

Datapalooza may soon be coming to a city near you. Stay tuned here for updates. We hope to engage the world’s brightest data scientists wherever and whenever makes sense for you.

要查看或添加评论,请登录

James Kobielus的更多文章

社区洞察

其他会员也浏览了