Demystifying Data (for Beginners)...

Demystifying Data (for Beginners)...

I started my career as a BI Analyst, a creator of reports, a writer of queries. And now, as a Business Analyst, I seem to have migrated into data as a specialism. I’ve done all sorts of projects, but the number of data projects massively outweighs all the others. If you’ve also done a fair bit of data work, then this post isn’t for you. This post is for all the people, other BAs sometimes, Project Managers, all sorts of other people, who tell me that they get a bit confused by data. They don’t really understand it that well. They get a bit lost when tech people start talking about it. As a Business Analyst, a significant part of my job is to try and make things easy to understand. So that’s what I’m going to try and do here.

Like I said - if you already have a good understanding of data and think you’re going to find the following a bit patronising, then do yourself a favour and go read something else…

The Abstract Database

Forget about No SQL databases for now, forget about Operational Data Stores, we’ll come to them later. For now, let’s just look at a bog standard database, the type which comprises about 90% of the databases in the world. You might hear them called ‘relational databases’. I recently ran through an explanation of these when trying to teach someone SQL, so it might be useful for someone else.

Imagine, for a minute, you live in a world without databases. All you have is Excel. Everything you want to record, about anything, has to be done in an Excel spreadsheet. You run, I don’t know, a video store (remember them?) and you need to keep a track of your customers, the videos you own, and who has which video at any particular point in time. The best way to do this is via three spreadsheets. The first has a list of your customers in it, their names and addresses and phone numbers, maybe some additional information about who else in their family (their husband/wife/kids) can take videos out in their name. The second has a list of all the videos your shop owns - the title of the film, how many copies you have, a value to say how many copies are back in your shop or out with a customer, maybe how much it costs to replace if it gets damaged. The third spreadsheet is a list which combines a customer with a video and maybe a couple of dates to show when the customer took that video out and when they need to bring it back.

If you want to know where a video is at any particular time, you have to look at the video spreadsheet and find the ones that indicate a video is out with a customer. Then you have to look at the rental spreadsheet and find the last record for that video and which customer it is linked to, and then go and look at the customer spreadsheet to get the contact details of that customer so you can phone them up and tell them you want the video back.

Your spreadsheets are, to all intents and purposes, database tables. This is exactly what a set of database tables designed to do the same thing would look like. The benefit a database gives you, over your spreadsheets, is that instead of looking up values from each table and then going to find the corresponding records in another table, you can write a query that does that work for you. You can join your three spreadsheets in the query and get it to retrieve for you all the videos that are out at the moment, which customers these are out with, and the telephone numbers of these customers.

Normalisation

You could do everything I described above in one big spreadsheet, but I suggested using three because you don’t want to have to type out the same information again and again every time a customer takes out a video. You don’t need to record the address and phone number of the customer, and the cost and distributer of the video every time, right? It’s the same address, it’s the same video. You only need the information about these things once. You can go and look them up if you need to. So you put those things in their own sheets. As long as there’s a way to tie the customer renting the video back to their address, and tie the video they’ve taken out to the additional info you have stored for that video, that saves you time in the long run.

The first record you have in your customer spreadsheet is Bob Jones. The first video you have in your video spreadsheet is Ferris Bueller’s Day Off (I know, I’m going full on ‘80s now). When Bob Jones takes out a copy of Ferris Bueller’s Day Off your rental spreadsheet has to have entries on it which say “Bob Jones” and “Ferris Bueller’s Day Off” to tie the two things together. In a database, though, this is a little bit inefficient. Those queries you can write work better and faster with numbers than they do text. Also, you might end up with two people called Bob Jones and you need to be able to tell them apart. So what you do in your customer spreadsheet is add a customer number, and this Bob Jones becomes customer 0001. In your video spreadsheet, you give Ferris Bueller his own unique stock number, let’s say it’s 123456. Your rental spreadsheet/table no longer then has to contain the values “Bob Jones” and “Ferris Bueller’s Day Off”, it contains instead customer ID 0001 and stock number 123456. That way you can still associate the video to the customer, the right customer, your query will work quicker, you won’t find duplicate records, and you don’t need to keep storing long text strings over and over again in different places (thereby reducing the size of your database).

In a nutshell, that’s normalisation.

Primary Keys and Foreign Keys

The ID fields I just described above, the customer number and the stock ID, these are the primary keys of the tables I created them in. All that means is, they are the unique identifying value (usually a number, but not always) for each individual record in that table. They can’t duplicate. They have to be unique. Meaning each record in that table is a distinct record, different from every other record.

A foreign key is effectively the appearance of the primary key of a particular table in another table to enable you to link those two tables together. In our video store example, the rental table contains two foreign keys - the primary key of the customer table and the primary key of the video table. The rental table would typically have its own primary key too, which uniquely identifies that rental, that particular combination of the other two keys on a given date to show the customer/video rental transaction.



To extend our video database out further, you might also store a customer’s title. As above, you don’t want to keep writing “Mr” or “Ms” or “Professor” every time you type a new customer in. So you have a table called Titles which contains the text values and an ID field: Mr = 1001, Ms = 1002, Professor = 1003 etc. Your customer table can then contain the primary key, which for Bob Jones is customer number 0001, and a foreign key for his title which is 1003 and tells you, by lookup to the Titles table, that he is “Professor” Bob Jones.



SQL Queries

You can skip this section and the next if you like. Do you care about SQL? I think it’s interesting, and it gives context to my No SQL bit further on. But feel free to read, scan-read or skip, it’s up to you.

In any case, don’t be put off by SQL, it is easy. To clarify: it can get quite complex, there are lots of different things you can do with it, but the basics, which are all you’ll ever need about 95% of the time, are really pretty simple. I’m going to use the older syntax below, which DBAs and such like will sneer at, but we’re going for simplicity and this is the simplest way. There are five parts to a SQL clause you need to remember, and you can use just two of them for the simplest queries, will need three of them most of the time, and can add the fourth and/or fifth if they’re required:

  • SELECT
  • FROM
  • WHERE
  • GROUP BY
  • ORDER BY

SELECT tells the query what it is you want to retrieve. Ie. field names. The simplest form of this is SELECT * which means ‘select all the fields’. In our video store database, I might want to select some customer information so I might write SELECT Forename, Surname, Postcode, PhoneNumber. Separated by commas you'll notice (but not with a full stop at the end).

FROM tells the query where you want to retrieve from. Ie. table names. To continue the example from above, I write FROM Customer (as this is what I called my customer table in the database).

WHERE tells the query about any filter conditions you want to apply. If I don’t add this, and just use the two lines I’ve written above (SELECT Forename, Surname, Postcode, PhoneNumber FROM Customer) I will get back every record in the Customer table. Here I can filter out the ones I’m not interested in. So let’s say I’m only interested in customers with the surname Jones, I would write WHERE Surname = “Jones”. I could just as easily look at only those customers that have a specific Postcode, WHERE Postcode = “L11 1AU”. I could look for both people called Jones who are also in that postcode by writing WHERE Surname = “Jones” AND Postcode = “L11 1AU”. And I could look for people who are either called Jones or live in that postcode by writing WHERE Surname = “Jones” OR Postcode = “L11 1AU”.

I could also use different functions here like ‘does not equal’, ‘is greater than’, ‘contains the characters’ etc. etc. But I won’t go into them, you can find info on which ones you can use literally anywhere on the internet.

Let’s leave GROUP BY for now and come back to it in a minute. You use this if you want to calculate something in your query. Typically, there’s nothing to calculate from a simple customer lookup so we’ll cover this separately.

ORDER BY lets you, funnily enough, order your results. You can do it in ascending or descending order on any field, or combination of fields, you’ve selected. Let’s say, for example, I know there are a load of customers called Jones. The one I’m looking for though, I can’t remember if it’s Kevin or Keith. I could run my query SELECT Forename, Surname, Postcode, PhoneNumber FROM Customer WHERE Surname = “Jones” and then add ORDER BY Forename ASC. (If you don’t specify ascending or descending, SQL always assumes ascending by default). That might give me 20 records, in alphabetical order on forename, I can scan down to the K’s, I can see there is no Keith, but there is a Kevin. Bingo.

Table Joins

I want to write a more complex query now. I want to know, if Kevin Jones doesn’t bring back the two videos he has, what will it cost me to replace them. And how does that compare to the value of all the videos he did bring back. To do this I need to join all three of my main tables together and I need to do a calculation. Let me show the SQL first and then I’ll explain it:

SELECT V.In_Shop, SUM(V.Value)

FROM

Video V,

Rental R,

Customer C

WHERE

V.Stock_ID = R.Stock_ID

AND R.Customer_No = C.Customer_No

AND C.Surname = “Jones”

AND C.Forename = “Kevin”

GROUP BY

V.In_Shop

Ok, so what I’ve actually done is use the WHERE clause not just to apply filter conditions (like the surname and forename) but also to join primary keys to foreign keys across tables. Remember my little table diagram? The syntax above is how I tell SQL about those primary and foreign key links.



This allows SQL to do the work of looking up the values I need from the different tables rather than me having to do it myself. You can see in the FROM clause I’ve listed multiple tables, not just one, and I’ve given them a pseudonym - C, V or R. This is so when I type the field names into the SELECT and WHERE clauses I can prefix them with the pseudonym and the query knows which of the multiple tables I’m talking about. I could have typed out the whole table name - ie. customer.surname rather than c.surname - but pseudonyms make it quicker to type.

V.In_Shop is a value in the Video table which tells me if a title is in the shop at the moment or not. I’ve returned this so my result shows the videos associated to Kevin Jones that are both in the shop again, and not in the shop at the moment. And I did a sum of the value of the videos that fall into these categories as SUM(V.Value). This should tell me the combined value of all the videos that Kevin has brought back, and the combined value of all the ones he hasn’t.

I’ve used GROUP BY because you need this when calculating something. It essentially tells SQL the name of the field you want to present the results by. Ie. a sum of something broken down by what? Think of it as the X and Y axis on a graph. The Sum is the Y axis, the field you GROUP BY is the X axis. So you can get things like the number of sneezes I’ve done, by day. Or the total money the company has made, by month. In this case I’ve said value of all the videos associated to Kevin Jones, by location (in the shop or not in the shop).

Does that make some kind of sense?

I won’t go into anything else here on SQL. There are loads of other things you can do with it. It can get as complex as you want it to be in all honesty. I’ve written SQL queries that when printed out have run onto thirty or forty pages of paper. Queries so complex than whenever I had to amend them I’d have to spend half a day first trying to work out how I originally wrote them. For the purposes of simple data understanding though, we don’t need to go into any of that nonsense.

Cardinality

Wow, this sounds like a technical concept. It isn’t, honestly. All it means is the nature of the joins between your tables. There are four types of joins: one to one; one to many; many to one; many to many. What on earth does that mean? Well, let’s go back to my table diagram again.



There is only one Bob Jones, or to be more precise, there is only one Bob Jones identified as Customer Number 0001. But Bob Jones comes in and rents videos all the time. So there are lots of records in the Rental table that belong to Bob Jones. One time he took out Ferris Bueller’s Day Off, as we know, another time he took out The Breakfast Club, then he rented Home Alone 2 (and didn’t think it was as good as the first one). The relationship between Bob’s Customer record, of which there is only one, and his Rental records, of which there are many, is, perhaps unsurprisingly, ‘one to many’.

Remember the relationship we created between the Customer table and a Titles table?



Well, Bob can only have one title. He can’t be Professor/Mr/Mrs Bob Jones, he can only be one of these. And there is only one instance of the title he has in the Title table - only one record for ‘Professor’, and similarly only one record for ‘Mr’, ‘Ms’, ‘Mrs’ etc. So you would be forgiven for thinking this means the relationship between these two tables is ‘one to one’. But actually it isn’t. It’s ‘many to one’, and the reason for that is that there is indeed only one instance of each title in the Titles table, but multiple customers will have the same title. So there are many instances of it in the Customer table, across different customer records.

You can have one to one relationships, in systems where a customer has maybe an account record and a bank account record (and you only allow one bank account per person). You can have many to many relationships, although these are rare. They’re rare because they are messy and difficult to query. Under normalisation, a good data modeller will always try to avoid many to many relationships and structure tables deliberately to try and eliminate them. Sometimes, they can’t, and so these things do exist as exceptions.

Anyway, when we say ‘cardinality’, this is what we mean. Those one to one/one to many/many to one concepts. It’s not really any more complicated than that. But if you did read the SQL section previously, it will probably already be dawning on you that knowing these relationships is critical when writing queries. You don’t really want to write a query that is supposed to return one record per account and then join to a table with a one to many relationship because suddenly that one record per account will become multiple records per account. (Incidentally, that’s often called ‘inflation’ of your data). People talk about left joins and right joins, inner joins and outer joins. I won't go into these as it's more information that you need right now, but suffice to say these just mean ways of handling those relationship types in your SQL so you get the result you want.

Enterprise Data Models

I have a colleague who will, if you’re lucky enough, take you through his history of Data Warehousing. It takes about 90 minutes. It's a thrilling ride and not at all tedious. I’m not going to do that here as this post is already going to be the longest one I’ve ever written. Suffice to say, companies typically put data in a Data Warehouse so you can run all the SQL queries you want without impacting the performance of the main database this stuff comes from. You don’t want to be slowing down your customer database with stupid queries when three hundred call centre staff are trying to use it to manage customer accounts in real time. Because of this, the thinking goes, wouldn’t it be even better than instead of just copying the data across as is, we actually modelled it instead in a way that would make reporting on it easier for everyone involved?

My simple video store database example breaks down a bit here because that’s so straightforward there’s not much you can do to simplify it. But imagine a really complicated database with loads of different types of records in it. Then imagine the business has about six of these that have built up over time, and via different company acquisitions, so it has ended up managing different sets of customers in several different places. What you can do, instead of relying on anyone who wants to report on customers remembering there are pockets of them all over the place, put them all together in your data warehouse. Standardise the data for reporting.

You’ll hear people talking about Facts and Dimensions. There is some complexity around these concepts here and there, but to simplify it, a fact is a thing - a number, a count, a sum - and a dimension is the thing you want to present it by - a date range, a geographical area, an account type. It’s those X and Y axis things again. Data Warehouses these days typically model source data in such a way as to align it to the general reporting principles and KPI concepts that a business has. Precisely so you can easily report a sum of sales by a date range, or an average age of customer by the area of the country they live in or the number of times they buy a certain product.

I’ll try and give a better example. I worked on a Data Warehouse project for a company that sold lots of different products. Home Insurance, Car Insurance, Travel Insurance, a ton of other things. We captured different data about a customer depending on what the customer bought. But we were able to abstract out general concepts and model the data to fit these concepts. For example, in every case there was a Customer. So we created a Customer table (or Dimension) which contained things like their name, age, gender etc. (The ages and genders were things that we then used to create counts and averages that became the Facts that we could report against the Dimension). In most cases they provided us with an address, so we were able to model a Property table (or Dimension) with address, utilities meter reference numbers (if we had them from the sale of gas and electricity supply), lock types (if we had them from a Home Insurance sale). Some Customers had a car, so we could model a Vehicle table (or Dimension) containing details about the car/motorbike/van/mobile home etc. In the operational databases this information was scattered all over the place. In our Data Warehouse we could pull it all together and put it in a structure that anyone could wrap their head around.

That’s all an Enterprise Model really is - a logically structured dataset that turns your data asset into something that the wider business can use easily. Building it, however, is another matter. Because that takes a lot of painstaking work to analyse each little bit of data you can find across all your different systems, to understand what it is and what it means, and to abstract it into one of the concepts you think the business will find useful. You can’t do this via shortcuts. You HAVE to do the painstaking sifting and sorting. If you don’t, then you don’t stand any chance of getting it right.

No SQL Databases and Operational Data Stores

As I’ve tried to explain, most databases are relational in nature, with tables you can join up and report from using SQL. Data Warehouses are at the extreme end of this as they’re designed specifically with this in mind above all else. Most other databases are designed to suit this purpose the best they can and support an operational system. But these types of databases in general are not strictly the best when you want to run real time processing and get responses really really quickly. Running a SQL query can take some time, from several seconds to several minutes, if you’re joining and trying to read and return results from lots of different tables. If you’re looking to buy something from Amazon, for example, and you want to click on an image to get a product description, that website will also be running queries to return the data to display that description. You don’t want to click the image and then wait for four minutes for the query to run. Amazon don’t want that to happen either because you’ll get fed up, switch your browser off and go and do something less boring instead (to quote a kid’s TV show from the ‘70’s). They, and you, want the result to be almost instantaneous. Certainly no more than a second or so at the very most. This is where No SQL databases and Operational Data Stores come in.

These are databases that are designed so they carry only the data needed for the website or the system they support, and they hold it all in one place (one big spreadsheet if you remember where we started on normalisation, although typically the data isn’t stored in a standard table format, usually more of a JSON or XML type format). This way the query can execute really really quickly and nobody gets unduly upset. You can’t join tables in a No SQL database, they’re useless for reporting, but that’s the point - they aren’t built for reporting. They’re built to fulfil a different and very specific purpose.

I’m not going to go into all the different ones here. I only have experience looking at a few of them anyway - Mongo DB mostly. There are a load of others. They all work slightly differently, but for the level we’re aiming for here, that’s not too important.

Data Migrations and ETL Explained

ETL stands for Extract, Transform and Load. All it means is, when you write data to a Data Warehouse, or when you migrate data between two systems, you extract it from the source, you make any changes to it that you need to make, and then you load it to the target database. As I explained with Data Warehouses, you’ll probably want to do loads of transformation in this journey as you standardise all your data into the business friendly concepts we discussed. The same will apply for data migrations. For example, your old system stores dates of birth in the format dd/mm/yy. Your new system, the one you are migrating your data to, needs it to be in the format yyyy/mm/dd. One of your transformation rules will be to change the format of these values as the data moves between the two databases.

I’ve done quite a lot of data migrations now and they’re really not that difficult. They’re labour intensive, but not complicated as a rule. It really just involves a lot of gap analysis, and mappings between your source data and your target data. I’ll try a simple example. I want to move my homemade video store database onto a shiny new Oracle database I’ve bought. I look at the Customer table as it exists in both of these. Mine just has Customer Number, Title Code, Forename, Surname, Address (as one value), Postcode, Phone Number. That’s it. The Oracle one requires Customer ID, Title (as a text value), Forename, Middle Initial, Surname, Address Line 1, Address Line 2, Town, County, Postcode, Country, Phone Number, Email Address.

I don’t have all the values the Oracle table needs in my source. Some of them I do, old Forename = new Forename, old Surname = new Surname. Those are easy. I don’t have any initials, so I put a rule in to set all of these as blank. Old postcode = new postcode, but old address needs splitting out via rules into new Address Line 1, new Address Line 2, new Town and new County. I don’t have Country, but all my customers live by my shop so I can put a rule in to set them all as “UK”. I have to derive the Title value from the code my table contains and populate that value in the new table instead. You get the idea.

The complexities come in when the new database needs phone numbers to all have an international prefix and no leading zero, while my database has a mix of mobile numbers and landlines, some with leading zeroes, some without. These are the things that require decisions making on them and some slightly more complicated rules defining in the Transform bit of your ETL. But they’re not overly complicated, just a bit fiddly.

What Else?

Well, I could probably go on all day but I’m tired now, and so are you. What I’ve tried to do above is to give some context and explanation around the things people talk about on data projects and data programmes and try to prove that they aren’t big scary concepts but quite straightforward actually. There’s no need to be scared by or confused by data. In fact, data, because at the end of the day you can print it out and look at it, is arguably one of the easiest things to understand in IT. It doesn’t have to be purely conceptual, stored in your brain, you can make it into a physical thing you can see, and then it becomes self-explanatory and much easier to get your head around. Don’t let anyone bamboozle you with it. Just smile nicely at them, then go into a quiet corner and google what they were banging on about. I guarantee that logically it will fit somewhere into the concepts I mentioned above. And if it doesn’t, then you probably don’t need to worry too much about it.

要查看或添加评论,请登录

Neil Schiller PhD, MBCS的更多文章

  • The changing face of, er, Change

    The changing face of, er, Change

    So, I write a bit on LinkedIn from time to time about the things that interest me in my job: Business Analysis, Agile…

    9 条评论
  • What is the point of Agile?

    What is the point of Agile?

    "What is the point of all this?" If I had a pound for every time I'd heard this since starting to work with Agile I…

    10 条评论
  • Business Analysis and Agile

    Business Analysis and Agile

    I've worked on Agile projects now for about the last five years, on and off. And it's fair to say those projects have…

    8 条评论
  • The Rules of Professional Engagement (not The Art of War)

    The Rules of Professional Engagement (not The Art of War)

    A different kind of post this one really, but it's something I feel quite strongly about. Since I left university in…

  • The Joy of Non-Functional Requirements

    The Joy of Non-Functional Requirements

    Non-Functional Requirements, NFRs, meh. Nobody likes them.

    5 条评论
  • Analysis is just Analysis, right?

    Analysis is just Analysis, right?

    If I've learned anything in the time I've been working in IT it's this: every project is different. The deliverables…

  • Demystifying Business Analysis

    Demystifying Business Analysis

    It’s a truth universally acknowledged that everyone thinks their job is important: it’s tricky, it takes some skill…

    3 条评论
  • The Agile Elephant

    The Agile Elephant

    I remember, way back now at almost the start of my career as a Business Analyst, a company I was at deciding that wow…

    5 条评论
  • The Business Analyst and the Subject Matter Expert

    The Business Analyst and the Subject Matter Expert

    This mainly starts with a conversation I had with a recruitment agent a few months ago. They were looking for a…

    8 条评论
  • Business Analysis: What exactly is it you do?

    Business Analysis: What exactly is it you do?

    I’m biased, obviously, but despite this I still genuinely believe the role of the Business Analyst is the most…

    13 条评论

社区洞察

其他会员也浏览了