Week 17 - SSEN smart meter data deep-dive

Week 17 - SSEN smart meter data deep-dive

Jon (Product lead):

It's a combined weeknotes this week as we've been working together on better understanding the smart meter consumption data, as a few bugs/inconsistencies started to come to light.

This week's edition is a few days late because we wanted to get to the bottom of what we found. But I'll let Steve describe the detail below.

(It's worth noting that we started with a deep dive into SSEN's data because it is generally the most up to date and covers the largest number of substations. We've already raised these potential issues and they we're incredible receptive and have taken them away to investigate.)




Steve (Engineering lead):

This week I’ve been really digging into SSEN’s data and discovering a few bugs along the way, both in my code, and potentially in theirs.?

Bug #1 - Substation locations

As I said I was planning last week, I’ve been building on the basic pipeline I’ve got running with raw SSEN data, trying to recreate the work we did to add substation locations to it. We think being able to locate the consumption data spatially is super important, so it’s a shame SSEN don’t include these directly like other DNOs do. However they do publish it in other places, so we can try to stitch things together to make it more useful. After all, this is kind of what Weave is all about!

For our prototype with February’s data, we did this in a really quick and hacky way:

  1. We built a list of postcodes served by each postcode from the postcode -> feeder mapping SSEN publish
  2. We looked up the “centroid” lat/lng location of each of those postcodes from the Office for National Statistics’ Postcode Directory
  3. We then took the “centroid” of all those points as the approximate substation location

For the most part, it seemed to work pretty well, so I started rebuilding this “properly” in our pipeline. I promise I’m not on a commission from Dagster to sell their tools, but it does offer a nice visualisation of this in their “asset lineage” graph:


However, I started to feel like we should be able to do better than an approximate centroid based on postcodes, so I started poking around in SSEN’s open data portal again, and visualising the data we already have. This is where I found our first bug.

When we matched up the postcodes to substations, we had to find some way to identify a substation that would work across both the mapping file and the smart meter data. We’re locating substations because they’re the actual physical asset which has a point location - the feeders are basically just the wires that come out and go to individual homes.

The raw smart meter data has two columns we can use for this: secondary_substation_id and secondary_substation_name. In SSEN’s data at least, the id is not unique on its own. The values are simple integers like 20, 30, 80 which get repeated a lot, for clearly different substations. What I erroneously assumed though was that if you combined it with the substation name, it would be unique. It unfortunately turns out that there are rather a lot of “580-MANOR FARM”s and probably a whole host of other dupes too.

What does this mean for our substation locations? Unfortunately, in the February data we’ve already published, it means any substation which doesn’t have a really unique name, is probably in the wrong place. This map illustrates:


The blue shapes are the area covered by all the postcodes we’ve assigned to a particular substation. Where it has a unique name (looking at you PIDDLETRENTHIDE VILLAGE) the shape is small and the location is likely pretty good, give or take a few meters. Where it’s not, like with MANOR FARM, we’ll create a massive area that covers all the MANOR FARMs across SSEN’s license areas and then pick the centre of that as the location. D’oh!

Luckily, we can fix this. After a chat with SSEN, we learnt that the dataset_id field in their data is actually a kind of compound id, made up of their internal Network Reference Numbers (NRNs) for each individual asset (substation, feeder, etc). So something like 000200200402 breaks down into:

  • Primary substation 0002
  • High voltage feeder 0002
  • Secondary substation 004
  • LV Feeder 02

That means we can take 0002002004 as the substation’s NRN and have a proper unique identifier. What’s more, SSEN do publish exact substation locations, identified by NRNs, for at least their Southern England Power Distribution area, so we can improve the accuracy of our locations too.

Bug #2 The July Dip

The second bug we found was a different kind. Since we had a longer time range of data, Jon was keen to flex his previous skills as an energy data analyst and take a look at it for interesting trends. As he’s talked about before, we think we need to highlight some of the potential of this data since it’s so new, as well as just make it available.

To enable him to do that, I plucked a random sample of 100 feeders from the full SSEN dataset, covering February all the way up to a few days ago in October. With that subset, he was able to produce all kinds of visualisations, which led him to notice something strange. In the middle of July, there was a huge dip in energy consumption across basically every feeder. See the fuzzy dark stripes in this image:

At first, we thought there must be a bug in the data pipeline. Maybe I’d not downloaded all of the days properly, or the downloads had gotten truncated somehow? Going back to the source though you can see that wasn’t the case - the raw data files for those dark days in July are 20-30% smaller - there’s just less data. It turns out that those files are missing data for thousands of feeders - over 23 thousand to be precise. It’s even clearer when you look at it like this:


This isn’t something we can fix so easily, but it’s something we’ve flagged to SSEN so they can take a look. We’ve also documented it here: https://github.com/centre-for-ai-and-climate/weave/blob/main/experiments/ssen-july-dip.ipynb so you can see our working.

This kind of stuff is what we hoped to find and figure out when we decided to invest more time in building this data pipeline, so it’s great that it’s coming to light - the data quality of Weave is only going to improve the more of these bugs we find.

要查看或添加评论,请登录

Centre for AI & Climate的更多文章

社区洞察

其他会员也浏览了