Getting Item Data Right

Getting Item Data Right

A friend of mine in who's opinion I greatly trust recently questioned my use of negative tones in my blogs. The easy answer is that it's easier to find things wrong in the presentation of an item than it is to find things that are right. I am my own self-confirming bias, so to speak. I have spent so long looking for data issues that the bright spots don't stand out to me anymore.

The more difficult answer is that it is really difficult to show when a business does item data the right way. If the data is right it is just... "right". There should be nothing special about data that is correct, because that is the expectation. Incorrect data can stand out like a veggie burger at a Texas barbeque, so it is much easier to point out.

So to try to reverse the trend of displaying what is wrong with data here are some examples of data done correctly. I've included comparisons where the data is not-so-right, but only to illustrate my point. As usual, the screen shots are limited to avoid calling out individual businesses or websites, and I have pulled from a variety of sites to avoid being accused of being biased or partial.

#1 Case Conundrum

One of the most common errors in data is mixing case standards. What usually happens is a business decides to use Title Case, and then mid-stream decides to change to Sentence case or other variation. When this happens this is what you end up with:

Notice that there is title case for the top three values and all lower case for the other two. Because of this the title case values appear on top because they faceting engine sees capital letters before lower case. This makes it very difficult to find the feature you are looking for, as it is difficult to know if it will appear in title case or lower case.

Here is an example of this done correctly:

All 42 values in this list are in title case. In fact, when searching this web site I found no instances of mixed case errors. That isn't to say there are not any: I just could not find the experience where those errors occurred. This is important because I know exactly where to find Infant Body Support in this list. In the previous example you might have to look two places if you even scrolled far enough to see the case issue.

This issue links directly to item findability. If someone in the first example required a car seat with a canopy they may not find it in the list depending on how the word canopy's case is applied. In the second example they know exactly where to find it. In the first example the customer has two choices: Keeping looking through the list if they don't see canopy spelled with a capital "C", or look somewhere else. In the second example the customer doesn't have to make that choice.

Secondly, it just looks nicer. Seeing a single case unifies the presentation that looks professional and complete. Faceting is an example I like to use a fair amount because those facets come from item data. The presentation level of the web site has spent an enormous amount of time building the right mix of filtering and container development, and the data supports that resource spend. Would you say the first example justifies that spend based on the presentation of the data shown?

#2 Classification Chaos

Another findability enabler that comes from item data is classification on a web site. During item data collection an item is classified into a hierarchy of other common items. That classification is then used to move items into the hierarchy on the E-Commerce or CMS experience. If the item is classified incorrectly in the data collection hierarchy there is little chance of it finding its way to the correct location in the presentation. You cannot fix bad data with presentation, and these kinds of errors become very visible very fast.

However, some times terminology trips us up. For example, the industry might refer to an item with one classification name and consumers understand it with another. We all know the word plastic, but there are dozens of types of plastic, from ABS (Acrylonitrile Butadiene Styrene) to PTFE (Polytetrafluoroethylene) to PS(Polystyrene) to HDPE (High-Density Polyethylene). If you saw a web site with ABD, PTFE, PS, and HDPE as the materials for an item it would make no sense versus just stating Plastic. This can happen with classifications as well, such as with Induction Cooktops.

Let us start once again with the bad example. This is an experience that involves induction cooktops and smooth surface cooktops showing in this presentation. Technically an induction cooktop has a smooth surface, but a smooth surface cooktop is not an induction cooktop. They are actually two different electric cooking technologies that need to be classified separately. See the image below:

 The reason this experience appears like this is due to two possible reasons. Either the classification system separates smooth surface from induction, but not everyone that selected a classification of induction selected the correct cooktop surface type, or there are smooth surface cooktops that have been classified as Induction in their classification. Because Induction cooktops are smooth this is a rational choice in data as long as the context is missing. Now let's see this done correctly:

You may notice that they do not show a separate smooth top and induction categorization at the top of this facet. That is because both smooth top and induction are electric, and therefore can appear in the same classification. They then appropriately break out electric and gas cooktops based on their technology in the Cooking Surface attribute. They even allow for combination electric and gas cooktops. There is no confusion over what cooktop belongs to each classification. This is done correctly, avoiding confusing faceting situations and the clutter that occurs when incorrect items appear in a presentation experience.

There are two important aspects to why this works. The first is that it is documented somewhere both the item data collection program and the website navigation program have access to. Secondly, the item taxonomy team and the web navigation program are obviously communicating to each other to come up with the best presentation experience based on the terminology in the data collection taxonomy. Do you think the first example shows good communication or documentation?

#3 Normalization Nonsense

 One of the keys to any data collection system is the ability to normalize the values across like items. It is important to ensure that you use the same terminology across your platform with the same constraints for the same data point. As comparison engines have grown more complex they highlight where this normalization process fails, such as below:

Here are three items in a comparison engine with their battery requirements listed. Notice that the middle item has quotes around AAA but the outside two do not. Also notice that (not included) and (sold separately) mean the exact same thing but look different in this presentation element. It is definitely a small issue, but an issue non-the-less.

Next we can see normalization in a comparison engine working correctly:

As this is a discussion regarding normalization we will leave out any comment regarding the choice of attributes to display and how to display them. All the Yes/No attributes either show up with a "Y" or an "N", as much as "Yes" or "No" might be preferable. The name of the Print Width attribute is awkward but the values displayed are consistent. The Shipping Description attribute uses the same value in a consistent format.

This presentation functions from an item data perspective because the system limitations for data collection enforce normalization. Many systems provide you an open text entry box that you can type whatever you want in, meaning item to item variations should be expected. A good controlled vocabulary with systems that enforce data normalization will make even the dullest data at least seem consistent.

In Summary 

My blogs regarding the results of item data issues are becoming more difficult to write. Simply put, the industry is starting to see the value of normalized controlled data. It used to be that I could go to any website, search for 10 minutes, and have multiple examples of bad data or bad data practices. Now, it takes longer. The issues are more difficult to find, and they take someone more trained in looking for them to see them.

This is great news for those businesses that see presentation beyond just making items available on their web site. It means that a data centric approach has provided value to that business, and they see their item data as an asset. However, it also comes as a cautionary tale: It is easy to become lax and let data quality slip. Some of the issues I found for this blog were not there months ago. A data quality program is an investment in your data assets, but so is putting the resources into maintaining that data quality program. Seeing your data quality systems as a project rather than a process and program can severely hinder the return on investment in that asset. 

 

Charles Meyer Richter

Principal information architect & diagnostician at Ripose Pty Limited

8 年

Daniel thank you for posting this article and Winston S. thank you for bringing this to my attention. As far as I understand 'normalised control data' is that it is used to remove redundancies. The examples you have provided are 'things' which may or may not have 'data' elements such as colour, heat_source etc. There are 2 ways of looking at this classification conundrum: One has to do with object orientation and the other with data normalisation. Object orientation programming (c1990) which requires the viewer to use 3 terms of reference, namely encapsulation (binding an object to a higher level object eg 'Cooking_surface') polymorphism (finding similarities eg electric_sourced, battery_sourced, carbon_sourced) and inheritance (navigations eg given a gas_sourced is a type of carcon_sourced surface). The problem with polymorphism is that it only (according to my research) included the mutually exclusive property (the OR) but not the mutually inclusive property (the AND) nor the capability of recursion (or many-to-many relationships between mutually exclusive types. If this had been done properly then the induction property of the cooking surface would be a mutually inclusive property of the cooking surface, but only if the cooking surface was electric or battery, which then means that the original classification of the Cooking_surface, is incorrect. In data normalisation where data is known, Ted Codd, the 'father' of data normalisation taught the 'value' of the 3rd normal form of data was to ensure that the data item (for example date_of_birth) depended on the key, the whole key and nothing but the key (ie non-redundancy). The problem with this is that the 'key' is an artificially selected attribute, whereas the data item like date_of_birth is a real attribute. The late Ted Codd (1923-2009) may have realised this conundrum in 1974 when he and Raymond F. Boyce (1947-1974) discovered the BCNF or 3.5 normal form. Unfortunately, with the early death of Boyce, Codd probably had no one else to collaborate with on this nf and so probably left it there. If Ted Codd had realised the similarity between object orientation's flaw in polymorphism, he may have pushed the 3.5 nf to become 4nf (mutually exclusive), 5nf to cover the mutually inclusive property and 6nf to cover recursion. Perhaps Codd would have also discovered that normalisation required data whereas an artifact like knowledge did not, then perhaps he would have avoided the trap he set for the unwary data analyst. In 1984 I discovered the potential flaw in the 3.5nf conundrum and introduced my intermediate solution which was to state that a typed entity could also play multiple roles. But in 1990 I dropped the role approach and settled on the mutually inclusive form (5nf). The AI engine I wrote (1990) thus included both the extended polymorphic property and the 5nf as they were one and the same as well as the 6nf. These discoveries will (if implemented) greatly reduce the time and effort to produce high-quality, non-redundant databases and processes. However, they require a more advanced form of business modelling than the current enterprise architectural paradigms provide as these not only use 7 different starting positions but take far to long to produce a strategic plan that DevOps can then focus their attention to find the relevant data items and thus completely avoid the data normalisation step. From the pen of a 45+ year veteran in the domains of business information (objectives, knowledge & strategies) & information projects (data & applications)

回复

要查看或添加评论,请登录

Daniel O'Connor的更多文章

社区洞察

其他会员也浏览了