Is “Agile Data” an Oxymoron?

Is “Agile Data” an Oxymoron?

In last week’s article, I discussed how you can “boil the ocean” when building a HOOK data warehouse. The article, unsurprisingly, prompted some interesting and rather robust responses. One response in particular, from Ronald Kunenborg , went into great detail about how this bulk approach violates the principles of Agile and, therefore, does not or cannot deliver value.

Ronald made many highly valid points that got me thinking. For a short time there, I felt that I might have missed the mark and that maybe, just maybe, I had been barking up the wrong tree. That personal crisis was short-lived.

In this article, I will explain why I believe some, maybe not all, aspects of a HOOK data warehouse are scalable and remain faithful to Agile principles. For this amazing feat, I will employ my favourite analogy that I have used countless times to describe the principles of HOOK. So, let’s talk about libraries.


Although libraries seem to be disappearing at an alarming rate, I’m sure most of you reading this article will have first-hand experience of a library. As a child, I was lucky enough to have one just across the road from where I lived, and I spent many an hour in there.

Previously, I talked about how a library is like a data warehouse in my substack article on the Business Glossary, but I’ll give you a quick reminder here. Data assets in the warehouse are synonymous with books in a library. Books are stored in the library, and they have metadata attached to them that tell us, in broad terms, what the book is about. Books with similar topics are physically located near one another in a well-defined organising structure. If I’m looking for books on a particular subject, I can search for the topic in a central index and find out where suitable books are stored.

But how do the books get in there in the first place?

I will describe how that process might look if the library operated according to the agile principles of Data Vault. And then, to contrast, how the library would work if run like a HOOK warehouse.


Data Vault

Data Vault’s adoption of agile principles is vigorously hailed in much of its promotional literature. In fact, I remember receiving some supplementary information on Agile as part of the CDVP2 training, although I do not recall needing to learn any of it to pass the exam.

Agile is all about delivering continuous incremental benefits to the business, where small bite-size chunks are developed and implemented in short sprint cycles. For this discussion, let’s assume that any business modelling is also performed within these sprint cycles.

Switching over to our analogy, what does the library look like on day one? There is a large open space with lots of shelving but no books. The library is empty.

Sprint 1

Our first customer (Sprint 1) walks in the door and is interested in biographies of 1970s rock bands because they want to research an essay they are writing for a school project. There aren’t any books on this (or any other) subject, so the library staff set about acquiring suitable books using their limited budget. They may order books from suppliers or publishers and must wait until those books arrive.

When the books arrive, the staff will categorise them and place them in suitable locations in the library. There are no predefined categories or organising structures for the library; the staff need to begin building one. What are the categories? There are many ways we could categorise these books, for example by:

  • Biographies
  • Era (1970s)
  • Music, or more specifically, rock music.

Are these three separate categories, a single category or something in between? It’s hard to say at this early stage with limited inventory.

Job done. The first sprint is complete, and there are now books in the library that the customer can use to help write their essay.

Sprint 2

A second customer walks in looking for British crime fiction. Again, there is nothing in the library on that subject, so the staff set to work and acquire the necessary literature. When the books arrive, where do they put them? We have some new categories:

  • Fiction
  • British
  • Crime

How do these categories stack up against the categories derived in the first sprint? We now have a ‘Fiction’ category, which probably means there will also be a ‘Non-fiction’ category. If that is the case, then is there not an overlap with the ‘Biographies’ category? Biographies are, by definition, non-fiction (most of the time). Maybe ‘Biographies’ is a sub-category of fiction. Or perhaps it has its own category, and we keep those books separate.

The staff might decide to be proactive and consider whether the new books can be categorised by other topics. For example, maybe our crime fiction book can be organised by era, say, British 1950s Crime fiction. Suddenly, it becomes increasingly difficult to group the books, further complicated by the fact that the books might need to be physically relocated on the shelves.

Sprint N

Of course, as more and more books are brought into the library, these problems compound and managing the library becomes extremely difficult. The staff will spend more time shuffling the books around and will have less time to go and acquire new books. Furthermore, the library customers will find this constant moving of books makes their experience unpleasant, and they may lose interest.


HOOK

How would the library operate if we took a HOOK-oriented approach?

At the start, the aim is to get as many books into the library as possible (budget permitting). The library staff would place bulk orders for a broad range of publications. When the books arrive, the library staff place them on the shelves in the most convenient location. At this stage, the books are not organised in any way, but the location for each of them has been recorded. From a customer’s perspective, this is unhelpful. There is little point in having all those books if you can’t find any of them.

When the books arrive, there will most likely be some form of manifest listing all of the books along with some useful information about each (metadata). If we’re lucky, there will be enough information to categorise each book in broad terms. It might not be perfect, but it will allow customers to find the types of books they are looking for.

Sprint 1

Now, we have a bunch of books in the library that are roughly categorised. The first customer comes along asking for biographies of 1970’s rock bands. The chances are there won’t be specific categories to precisely identify those books, but there will probably be some broader categories that get us close. There will most likely be categories of 'Biographies' and 'Music', and the library staff can provide a list of those books to the customer. Will that list contain biographies of 1970s rock bands? Highly likely.

Once the customer has chosen the books they want, we can refine their categorisation and add new categories where necessary. The customer experience, although not perfect, is certainly quicker than the equivalent Data Vault-like process.

Sprint 2

The second customer arrives looking for British crime fiction. 'Fiction' and 'Crime' are most likely categories the library already knows about from the initial delivery manifest, therefore, the staff can provide a list of books (and their locations). The concept of ‘Nationality’ might be new, so the library staff can add the new category to the central book catalogue.

Even though, in both Sprints 1 and 2, the categorisations for books can change, the location of the books always stays the same. There is no need to shuffle the books around.

Sprint N

Future sprints work precisely the same way, with the organising structure of the library constantly updating and improving. Of course, there will be occasions when the customer cannot find any suitable books, in which case the library staff may need to acquire some new stock. Once received, these new books will be categorised in the same way as before.


Conclusion

Although the end goal for Data Vault and HOOK is the same, the approaches are quite different. Are they both agile?

For Data Vault, I would have to say that it is. For HOOK, the answer isn’t as clear-cut. The initial ‘bulk’ load of data is not agile because there is no immediate customer benefit. However, once in place, we can operate in a more agile way, iterating much faster than an equivalent Data Vault warehouse.

When you also consider that the modelling overhead is much more flexible and nimble, there is a compelling argument to say that HOOK is more agile than an equivalent Data Vault data warehouse.

Disagree? Well, there is only one way to find out. Put the two side-by-side and see for yourselves. Any takers?

?Shane Gibson

Agile Data Coach. I help data teams change their Way of Working, to deliver more, in less time, while having more fun! | Reach out if I can help your team improve the way they work.

1 年

#AgileDataWow “Agile Data” is, perhaps, an oxymoron” Thems fighting words!

Nick Pinfold

Principal Data Analyst at Wellington City Council

1 年

This discussion is about technical solutions and not solving business problems. If the book borrower needs to do research and reports the results in the week and the order process takes two weeks then this will not meet business requirements. Back to data If you have a template approach to acquisition, you spend money to setup the template so data can he acquired immediately. this approach requires you to get the metadata upfront so it can be searched. Problem with now preacquiri the data is you cannot analyze the data and model or use for analysis. If you automate acquisition and metadata and the release processes and your storage costs are cheap its a no-brainer no acquire internal source data. In my previous role we had a metadata store and acquired the metadata from our internal data stores every night and searchable via a dashboard. Thus gives the ability of what data is available. Acquisition was an extract, standardized and load to scd2 or rransaction tables. As the process was standardized it was automated. Having data available speeds up the analysis and modeling making that process quicker, more reliable and thus cheaper. As for the costs of moving data, just like moving a library it will cost.

回复
Ronald Kunenborg

Enterprise Data Architect, data expert, coach and data modeller

1 年

I think it's a good basis for further discussion on this topic. The library example also illustrates several of my issues with this approach: First off, if you are loading in a ton of books in your library, that you don't know if anyone will need, what is the cost?. Now, I'm sure books have become cheaper but they are still pricey. And if you say, "sure, but data is free" then I have to disagree: you're just not yet aware of all the hidden cost. Storage may be cheaper all the time, but try to egress that data to a different location and you will soon find out how expensive storage can become. It's not just the storage itself, it's also the cost of inventory management. Hence the Agile principle, don't build inventory. Also, the risks. Suppose you live in Florida and the library is a school library, and you load in all those books without checking them. How will that work out for you if someone finds you just included the complete works of Marquis De Sade, Mao, Stalin and the Kama Sutra Kindergarten edition by accident? People are far too trusting of their data. I'm treating it like toxic waste until proven harmless. But yes, there can be a use case for loading all the data. But it's far more rare than commonly assumed IMO.

Niko Korvenlaita

Dad | Husband | Co-founder & CTO @ reconfigured

1 年

Didn’t see the original comment but this makes me laugh and cry at the same time > this bulk approach violates the principles of Agile and, therefore, does not or cannot deliver value. This kind of dogmatism is what’s one of the biggest problems blocking value

要查看或添加评论,请登录

Andrew Foad的更多文章

  • Capability vs. Features

    Capability vs. Features

    As many of you know, I've been critical of the fact that a high percentage of data projects seem to fail, which was one…

    3 条评论
  • Is Data Modelling Dead?

    Is Data Modelling Dead?

    I’ve been using data modelling techniques for most of my 35-year career, and I consider it an essential skill for…

    48 条评论
  • Data Vault is Garbage (change my mind)

    Data Vault is Garbage (change my mind)

    If you are unfamiliar with the Hans Christian Andersen story of “The Emperor’s New Clothes”, ChatGPT summarises it…

    7 条评论
  • Introducing HOOK. Part 1 - Logical Architecture

    Introducing HOOK. Part 1 - Logical Architecture

    This article is the first in a short series on the Hook approach to Data Warehousing. As I said in my introduction…

  • Always try to boil the ocean!

    Always try to boil the ocean!

    Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for…

    16 条评论
  • HOOK and the Unified Star Schema

    HOOK and the Unified Star Schema

    First, we should probably understand what a Unified Start Schema is before we start talking about how we can use it in…

    6 条评论
  • HOOK vs Data Vault: Willibald Part 6

    HOOK vs Data Vault: Willibald Part 6

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

  • HOOK vs Data Vault: Willibald Part 5

    HOOK vs Data Vault: Willibald Part 5

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

    10 条评论
  • The Attempted Assassination of ELM

    The Attempted Assassination of ELM

    Around the middle of the 19th Century, the students of a public boy’s school in the English Midlands invented a game. A…

    13 条评论
  • HOOK vs Data Vault: Willibald Part 4

    HOOK vs Data Vault: Willibald Part 4

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

    5 条评论

社区洞察

其他会员也浏览了