Week 13 - Building data pipelines
Centre for AI & Climate
Connecting capabilities across technology, policy, & business to accelerate the application of AI to climate challenges.
Jon (Product lead):
Last week we launched our MVP !
So far the reception has been fantastic. Some high level numbers so far:
The launch had three broad goals:
So the plan for this week from my perspective is to find people who need the data and understand their use cases. Some great calls booked in this week off the back of inbound interest following the launch, so it's looking promising.
Jon
领英推荐
Steve (Engineering lead):
I skipped weeknotes last week because we were focussed on the launch of Weave, but that was actually mostly Jon’s work by then. With the data together for our first prototype, I’ve started focussing on what comes next - a proper production data pipeline to handle the full set of DNO data available.
This pipeline needs to be everything our prototype wasn’t: automated, well-tested, and most importantly scalable, because it’s going to have to deal with a lot more data. No more running scripts on my laptop overnight and at weekends!?
Not being much of a data engineer before, I did a lot of research before settling on the tools that I hope will let us manage the process better, but I’m still working through them a bit. Sadly there’s nothing quite as nice as https://pangeo-forge.org/ for GeoParquet data (at least that I could find, do let us know if you know better!). Instead, we’re planning to keep things in Python for the time being and use a combination of Dagster and Dask to orchestrate and parallelise our process. If you’re an expert in either of these and want to share your hard-earned knowledge, I’d love to hear it.
The first and most important step in our pipeline is to get hold of, and archive, all the raw data. To build our prototype I just clicked around the various data portals, but we want to automate that and download new data every day. Working this out has again shown me how much friction there is in getting hold of this kind of data. Every DNO has different data portal software, which each have different API models you have to work through. Even when they use the same one, they can use it in different ways so the API requests you need are different.?
All in all, there’s likely to be several weeks of work in this kind of “plumbing” before we can expand our data’s range and quality. Imagine if every one of the 40 people who’ve downloaded it so far had to make that kind of investment!
Steve