Week 13 - Building data pipelines

Week 13 - Building data pipelines

Jon (Product lead):

Last week we launched our MVP !

So far the reception has been fantastic. Some high level numbers so far:

  • 14,000 people saw the launch post on LinkedIn
  • 950 people have visited the website
  • 40 people downloaded and played with the data


The launch had three broad goals:

  1. Make people aware of Weave and generate interest ?
  2. Get people onto the Weave website ?
  3. Convert visitors to users and get them to actually download the data ? (kind of)
  4. Identify clear use cases and applications for the data ? (not yet)

So the plan for this week from my perspective is to find people who need the data and understand their use cases. Some great calls booked in this week off the back of inbound interest following the launch, so it's looking promising.

Jon


Steve (Engineering lead):

I skipped weeknotes last week because we were focussed on the launch of Weave, but that was actually mostly Jon’s work by then. With the data together for our first prototype, I’ve started focussing on what comes next - a proper production data pipeline to handle the full set of DNO data available.

This pipeline needs to be everything our prototype wasn’t: automated, well-tested, and most importantly scalable, because it’s going to have to deal with a lot more data. No more running scripts on my laptop overnight and at weekends!?

Not being much of a data engineer before, I did a lot of research before settling on the tools that I hope will let us manage the process better, but I’m still working through them a bit. Sadly there’s nothing quite as nice as https://pangeo-forge.org/ for GeoParquet data (at least that I could find, do let us know if you know better!). Instead, we’re planning to keep things in Python for the time being and use a combination of Dagster and Dask to orchestrate and parallelise our process. If you’re an expert in either of these and want to share your hard-earned knowledge, I’d love to hear it.

The first and most important step in our pipeline is to get hold of, and archive, all the raw data. To build our prototype I just clicked around the various data portals, but we want to automate that and download new data every day. Working this out has again shown me how much friction there is in getting hold of this kind of data. Every DNO has different data portal software, which each have different API models you have to work through. Even when they use the same one, they can use it in different ways so the API requests you need are different.?

All in all, there’s likely to be several weeks of work in this kind of “plumbing” before we can expand our data’s range and quality. Imagine if every one of the 40 people who’ve downloaded it so far had to make that kind of investment!

Steve

要查看或添加评论,请登录

Centre for AI & Climate的更多文章

社区洞察

其他会员也浏览了