Week 11 - Preparing for launch
Centre for AI & Climate
Connecting capabilities across technology, policy, & business to accelerate the application of AI to climate challenges.
Jon (Product lead):
Last week I explained the predicament we’d found ourselves in and said that the plan for the week ahead was to get ourselves out of it. Well the the short version of this update is thankfully we managed to do just that.
In fact, we start this week with a clear plan to launch a prototype we’re really happy with next week.
It’s amazing how things can come together after weeks of feeling like we were taking one step forward and two steps back. I’m always surprised at how non-linear making progress can be. I’d go as far as saying that if your rate of progress feels like a straight line, you’re probably doing it wrong.
But anyway, we’ve landed on a concept for a prototype that is very much still a prototype, but is totally aligned to the grander vision and most importantly, will enable us to test our hypothesis.
In short, we’ve taken an incredibly valuable dataset - that is currently really hard to access because of its size, format, and the fact that it’s disaggregated - and made it so much easier to work with. We’ve done this by using a cutting-edge file format and structured the data so that it can be queried with a few lines of code and downloaded in a matter of seconds.
We decided to focus on the newly released smart meter data, which is being published by UK DNOs (Distribution Network Operator). We thought this was a good place to start because access to high resolution energy consumption data is a well known challenge, and due to its nature and inaccessibility, this dataset is currently being underutilised. Our aim is meaningfully to lower the barrier to entry to energy data analysis.
The data set contains domestic smart meter consumption data at half-hourly resolution, aggregated at LV feeder level. This dataset represents 100,000 LV feeders and 2,000,000 smart meters. The coverage is expected to increase steadily over the year, until all smart meters are captured.?
There is also talk to increase resolution and provide meter level data, providing consumer privacy can be maintained. This dataset has the potential to be an absolute powerhouse, but as it increases in size and resolution, the data access issues will compound further.
We’re really excited to put this out into the open. We know it’s just the beginning and that there is a lot that needs to be improved. But we really do believe that it offers a leap in value and provides a user experience that is an order of magnitude better than what’s available today.
The plan for this week is to tie up all of the loose ends and gear up for launch. Thankful there aren’t any major risks that could delay things so it’s just a case of getting the work done now.
领英推荐
Look out for the launch next week!
Jon
Steve (Engineering lead):
This week felt like we really got back into the swing of things. The change of direction to look at some new file formats had a bit of a learning curve, but once I got up to speed, it felt like a really good decision. After lots and lots of reading documentation, trying different libraries and measuring file sizes, I think we’ve ended up with something quite compelling.
s3://weave.energy/smart-meter.parquet is a single ~700MB geoparquet file, containing all of the data released by UK DNOs in Feb 2024. We put this data together back in July, as we were exploring the data, but we didn’t release it then, mostly because as a plain CSV, it runs to over 20GB! We’re obviously not going to stop at just February’s data, so it didn’t seem very practical or sustainable to be uploading 100’s of GBs of CSV. Even if you have the bandwidth, at that kind of file size, it’s not very easy to use the data unless you have a cluster of machines or an incredible amount of RAM.
The great thing about parquet files is that a) you don’t have to download the whole thing - you can do both predicate and projection “pushdown” to limit what you get and b) even once you do, it’s highly optimised in terms of compression and memory layout to load it straight into analysis tools like Pandas.
Parquet files have been a staple of analysis workflows for quite a while, but what we’re using here is a relatively new addition - GeoParquet - which allows us to include columns with geospatial information in them. We’re actually using the very latest version of the spec: 1.1.1 which not only allows you to include the geospatial data, but also do predicate-pushdown of bounding box queries. Sounds complex, but it basically means you can download a subset of the data by defining the geographic area you’re interested in. Even better, if you’re not interested in geospatial analysis, it’s backwards compatible, so you can open the file in regular old Pandas and just see the “geometry” column as a series of {x y} objects.
We’ve been putting together some examples of how you can access and use this data in this Jupyter notebook, as well as documentation of how we made it in this GitHub repo. We’d love any feedback you have, or to know if you find it useful. Just start a discussion on our Github repo!
Steve