登录查看更多内容

Creative Data Engineering Can Drive Data Science Insights: A Datapalooza Dispatch

James Kobielus

Research Director and Principal Analyst

发布日期: 2015年11月12日

Data science lives in the details of your projects. But those details are dry as dust if your projects don’t intersect with something about which you care passionately.

Passions are as varied as the data scientists themselves, and immersion in project details is a sure sign of a true data scientist. The first day of Datapalooza in San Francisco brought many working data scientists together to share details of projects that excite them. It’s a three-day event in which most of the working sessions are repeated at least once on subsequent days, so if you missed some details on the first go-around (Tuesday, November 10), you always have the option of attending the same presentation on the second or third day.

Considering that data-science project overviews can be difficult to follow in every detail, I chose to make my life a bit easier. After each Datapalooza session I attended on the first day, I asked each of the presenters to send me their slide decks. Or, if what they presented was simply a live hands-on demo, to email me links to any supporting materials that I might study at leisure. Those are still trickling in.

On the whole, I was quite impressed with what I saw. In general, data-scientist passions range from the grandiose world-changing variety to the seemingly mundane. I’d put the Datapalooza day-one presentation by Edd Dumbill (@edd), Stephen O’Sullivan (@steveos), and Harrison Mebane (@harrisonmebane) of Silicon Valley Data Science (SVDS) into the latter category, but that’s not necessarily a bad thing.

What SVDS discussed was a real-time mobile app they’re building to predict train arrivals throughout the Caltrain system in the Bay Area. At first glance that might sound thoroughly unnecessary. After all, doesn’t this and pretty much every other train system in the world publish schedules for all to see?

But here’s the crux of the issue as SVDS presented it. As Caltrain riders themselves, the SVDS principals had become annoyed at the unreliable nature of the transit system’s data API. Though Caltrain on-time performance is over 90 percent, there’s no reliable real-time status information, and, on top of that, the system’s data API tends to crash and was out completely between April and June this year.

Inability to reliably predict when Caltrain cars arrive at stations has an operational impact not just as SVDS staff as commuters but on the work they do when they arrive at the office. SVDS’ facility is right next to a Caltrain station that, when trains are passing through, emits enough noise to disrupt conference calls in progress. Consequently, SVDS personnel have trouble identifying specific times when they can schedule conference calls to minimize train-noise disruption.

As Dumbill stated: “Every 7-8 minutes on a conference call a train comes by. The problem is that as trains come, you’re not sure when they will. Signs on the stations have a large margin of error. And there is no reliable real-time status information, because of outages on the system’s API.”

Essentially, SVDS took it upon themselves to build an iPhone/Android mobile app that makes their own working lives a bit easier. It’s the seemingly mundane “scratch an itch” motivation behind many of the most innovative data products being developed anywhere. And in order to accomplish that, they needed to engage in a bit of creative data engineering to source the right types of data needed to feed the algorithms that drive their app. Of the sessions I attended at Datapalooza day one, this was the one that best exemplified the theme of harmonizing your data engineering, as I discussed in this recent blog.

So what specifically did they build? As laid out in this recent article on their website, it came down to clever multi-source data acquisition, pipelining, and analysis of audio signals (the acoustic properties of overheard train whistle blasts), video feeds (the streaming visual patterns of glimpsed trains in transit), and Internet feeds (the real-time schedule data coming from the API, passenger real-time sentiment data from Twitter, and relevant data from other Web-based sources).

At a high level, SVDS’ project involves:

Leveraging sensor data they capture themselves from the comfort of their own office, based on audio and video capture devices that they set up on their office porch, overlooking the Caltrain tracks;
Gathering real-time data on the direction and speed of each passing train;
Feeding this data into digital signal processors (DSPs) on a Raspberry Pi board;
Applying Fast Fourier Transforms and PyAudio to serialize the audio into an array of 2-byte integers;
Streaming DSP event stream data outputs and associated metadata through Kafka and Storm into HBase, Impala, HDFS, and PostgreSQL
Maintaining machine-learning algorithms in Python that handle parsing, classification, detection, counting, integration, and alerting within the app’s back-end data pipeline;
Leveraging Hidden Markov Models, decision trees, and logistic regression to handle the trickier train-noise classification tasks;
Analyzing streamed data through a blend of Spark and MapReduce models;
Exposing analytic outputs via a REST API; and
Providing users with a mobile app that delivers real-time Caltrain car-arrival predictions.

In the Datapalooza session, the SVDS team discussed the pros and cons of the different algorithmic models that are designed to classify trains by means of audio and video signals. SVDS is using these models to identify, based on inputted multisource signals, the specific scheduled train runs, whether they are local or express runs, and whether they were ahead, behind, or on schedule. They also contextualize these analyses within a model of other significant events of Caltrain system-wide impact, such as, in Dumbill’s word, “calamities.”

Here’s a screenshot from the SVDS iPython notebook used in the Datapalooza presentation. It shows the thresholding analysis that is applied to image captures in order to identify the type of train run being observed in real-time.

SVDS stresses the importance of tapping an ensemble of signals to drive iterative refinement of models, thereby enabling progressive improvement of their overall predictive performance. Leveraging Apache Cordova and doing team-based work in sprints, they had engineered hooks into their mobile app so that a user’s device GPS data can deliver location-relevant train-schedule predictions.

Clearly, this is the sort of data collection that could conceivably be developed further through the Internet of Things and crowdsourcing, though that’s not in the current scope of SVDS’ project.

Also, SVDS was candid about the fact that, for this app to deliver value as a data product, they need to work on making it more “consumer friendly.”

Please reach out to SVDS directly for further details. I’ve linked to the presenters’ Twitter addresses near the top of this post.

Here are other sessions I attended on Datapalooza day one:

Data Engineering: Spark Search by Taka Shinagawa, a software engineer and data scientist based in the Bay Area
Data Science: Better Search at Scale: Leveraging Spark for Contextual NLP by Jonathan Dinu, VP of Academic Excellence at Galvanize
Data Science: Real Time Vehicle Telematics, by Cahlen Humphreys, Big Data Engineer at zData Inc
Data Science: Search by Selfie: a Spark Facial Recognition Algorithm, by Brandon Schatz, founder of SportsPhotos

If you want further details on any the projects presented at Datapalooza, I recommend that you reach out directly to the presenters via their respective linked social-media addresses.

Datapalooza may soon be coming to a city near you. Stay tuned here for updates. We hope to engage the world’s brightest data scientists wherever and whenever makes sense for you.

要查看或添加评论，请登录

James Kobielus的更多文章

Driving Generative AI Deeply Into the User Experience

2025年3月13日

Driving Generative AI Deeply Into the User Experience

Generative user interfaces (UIs) are the next big trend in personalization. These interfaces build interactive context…
The Brutal Economics of AI in the Post-Training Era

2025年3月5日

The Brutal Economics of AI in the Post-Training Era

Artificial intelligence (AI) is becoming table stakes in practically every market niche. And as it evolves in this…
Ensuring that AI Remains Superaligned with Humanity’s Best Interests

2025年2月21日

Ensuring that AI Remains Superaligned with Humanity’s Best Interests

AI superalignment is a topic that’s been kicking around the research community for some time. In development all over…
The Deep Reasoning Era of Generative, Agentic, and Superintelligent AI Has Begun

2025年2月11日

The Deep Reasoning Era of Generative, Agentic, and Superintelligent AI Has Begun

We keep seeing wave after wave of revolutionary new approaches transform artificial intelligence (AI). In the past few…
What It Will Take for Prompt Engineering to Mature in the Enterprise

2025年1月28日

What It Will Take for Prompt Engineering to Mature in the Enterprise

Prompt engineering is at the heart of the modern practice of artificial intelligence (AI). It refers to the practice of…
Agentic AI: The New Application Development, Orchestration, and Governance Paradigm

2025年1月9日

Agentic AI: The New Application Development, Orchestration, and Governance Paradigm

Artificial intelligence (AI) has been the main growth driver for hyperscalers in the post-pandemic era. As enterprises…
The Formidable Challenges of Implementing Effective AI Guardrails

2024年12月18日

The Formidable Challenges of Implementing Effective AI Guardrails

Bringing strong governance to artificial intelligence (AI) is a daunting task. Where to start? When you peel the onion…

2 条评论
Information Technology’s Pivotal Role In The Post-Pandemic New Normal

2020年4月30日

Information Technology’s Pivotal Role In The Post-Pandemic New Normal

No one can say for sure whether the current public-health crisis will ever vanish completely. We’ll need to brace…

1 条评论
Must Data Privacy Take a Back Seat During the Coronavirus Panic?

2020年4月2日

Must Data Privacy Take a Back Seat During the Coronavirus Panic?

Privacy is on the run in the race to save the world from the ravages of coronavirus. COVID-19 has given surveillance…
Cybersecurity Issues Are Growing More Acute Under the COVID-19 Emergency

2020年4月1日

Cybersecurity Issues Are Growing More Acute Under the COVID-19 Emergency

Cybersecurity inevitably suffers when scares infect the populace. The current coronavirus—aka, COVID-19--outbreak…

See all articles

Creative Data Engineering Can Drive Data Science Insights: A Datapalooza Dispatch

James Kobielus

Research Director and Principal Analyst

James Kobielus的更多文章

社区洞察

其他会员也浏览了

Think Like a Pro: Data Science Challenges That Sharpen Your Critical Thinking

Dear Data Padawan 5 - Lessons from Complexity Sciences that make me a Better Data Scientist

25 Critical Insights From Experienced Data Scientists in The Data Science Handbook

Episode 6: Business understanding for Data Science

Data Science is a Team Sport

Data Science Tools and Software

How Data Science Came to Be

Breaking into Data Science

THE DATA DRIVEN DECISION MAKING - The Alternative Approach

Searching for the Fundamental Truths in Data Science: A Review of Data Science for Business

James Kobielus的更多文章

Driving Generative AI Deeply Into the User Experience

The Brutal Economics of AI in the Post-Training Era

Ensuring that AI Remains Superaligned with Humanity’s Best Interests

The Deep Reasoning Era of Generative, Agentic, and Superintelligent AI Has Begun

What It Will Take for Prompt Engineering to Mature in the Enterprise

Agentic AI: The New Application Development, Orchestration, and Governance Paradigm

The Formidable Challenges of Implementing Effective AI Guardrails

Information Technology’s Pivotal Role In The Post-Pandemic New Normal

Must Data Privacy Take a Back Seat During the Coronavirus Panic?

Cybersecurity Issues Are Growing More Acute Under the COVID-19 Emergency

社区洞察

其他会员也浏览了

Think Like a Pro: Data Science Challenges That Sharpen Your Critical Thinking

Dear Data Padawan 5 - Lessons from Complexity Sciences that make me a Better Data Scientist

25 Critical Insights From Experienced Data Scientists in The Data Science Handbook

Episode 6: Business understanding for Data Science

Data Science is a Team Sport

Data Science Tools and Software

How Data Science Came to Be

Breaking into Data Science

THE DATA DRIVEN DECISION MAKING - The Alternative Approach

Searching for the Fundamental Truths in Data Science: A Review of Data Science for Business