Big Data has a Language Problem.

It's not an urban legend, the Inuit really do have over 50 words for snow. This is because the more we deal with something the more we need to differentiate it. Powdery snow, flakey snow, slushy snow, snow that can be packed into blocks, different levels of thickness in sea ice; it all matters greatly to someone who has to hunt, travel, survive and thrive in such an environment, but not as much to someone who has to shovel it twice a year.

Professions are no different. Say you knew nothing of wood working and commissioned a carpenter to build a table. The carpenter tells you they need some “wood” for the project. So, you collect some wood that fell off the local trees. Could that work? How about a collection of Russian nesting dolls; those are made of wood too, is that any better? Cedar shingles? A chair? The vast majority of wood you reached for is rejected outright by the honest carpenter. To them, wood isn’t just “wood”, it has compositions, dimensions, and properties that matter greatly to the quality of the project, and the customer may not have the language to understand what they need to provide to the carpenter.

So is the case with data. It’s everywhere, yet little is ideal for analytics. Data has defining properties: type, accuracy, relations, dimensions, cardinality, and so much more... And like the chair or the cedar shingles in the wood analogy, it’s often a finished product, resting in a transactional database, much of its value has already been cut away, not needed in its “transactional” form. Running ETL on it and piping it into a Datawarehouse is often like the carpenter accepting all the twigs, chairs and wooden spoons, sending it through a wood chipper or cutting out the choice bits and gluing it into a table. The customer is left disappointed, they may think they need to find someone with a PHD in carpentry to make the right table (or design better models of wood chippers~!) - sound familiar?

To do data analytics right for an organization, one needs to “go back to the lumber yard” and cut the wood right from the tree. This means modernizing applications and schema, and even changing operational business processes so as to collect the state around each meaningful event in a process. Only then can data science truly deliver on its promise. Some industries are doing this quite well. In the oil and Gas industry, particularly Halliburton, uses the term “SMART” data, where devices are designed to generate data holistically with multiple uses in mind. However other industries are using this term to describe utilized “Dark Data” (put simply: data that is horded and not analyzed). In the Software Development realm, Karsun Solutions has https://www.golean.io/, where we record event data at low granularity across the Software Development Lifecycle, but these are tools and philosophies that address the problem, rather than terms that describe it.

Some of us have identified the problem and are doing what we can about it. Unfortunately, without the vocabulary or “buzz words” spanning the discipline, or at best, siloed within respective industries; without this vocabulary known by the customer, companies will fall short in Big Data.

What words are you using to describe this problem?

 

 

要查看或添加评论,请登录

社区洞察

其他会员也浏览了