Cs & As of data

Data is crucial information awaiting to be crunched to get fruitful conclusion out of it. I am puzzled at the similarity of English word “data” & word in regional language Marathi as also “Data”, the entity that drives our world today, the commence & the end. We start with data & wrap it with conclusion based on data. The metaphor is my personal belief with which reader may choose to agree or disagree. No one would undermine the role data is playing from the time you get up in morning & switch on blue light till you doze off back in your bed. In today’s era, data is money & whole realm of knowledge waiting to unveil to reach the core of reality. Its usefulness depends on the interpreter. The quality of data is characterized by following attributes.

Consistency

Consistency refers to no mismatches in data irrespective of sources of information. If in manufacturing unit, DCS (distributed control system- operating panel where all plant parameters can be seen & controlled) shows temperature for column bottom pump in increasing trend then the process historian should also reflect the same behavior. Similarly, if there is mismatch in birth date for a person in his birth certificate & any other national ID is an inconsistent error.

?Accuracy

This is attribute of data which determines gap between what it should be & what it is. It highlights gap between what data says & what it should have said. It also pinpoints to an extent on reliability of source of data. It is degree of correctness of data to indicate its synchronization with facts. Census data how closely it quantifies population of the nation speaks of the accuracy of the data. Flowmeters are installed in process plant to give an indication of operating flow of stream in manufacturing unit, instrument bias will decide how correctly it will show operating temperature.

Completeness

It ensures capturing of all data points not specific to one data set or to specific time. For any incident investigation, capturing all data points would help us to reach to root cause of incident completeness w.r.t time. It also could be crucial attribute, suppose in case of customer satisfaction survey if factor like availability of product is not part of survey, then assessment of characteristic feature of product is futile.

Current nature?

Time of the data is imperative since actions require updated data to be worked upon. Process parameters varies immensely with time. There is significant difference between start of run & end of run condition of process unit in manufacturing sector. Their behavior is different at various times. Similarly, stock market changes dynamically with time & hence updated information is crucial to take correct decision or it might land analyzing part in soup.

Availability

So, timeliness of data is critical based on requirement of data at a specific span else conclusion could be misleading. Timeliness is ease at which data is readily available to user for analysis & taking decisions at required instance. For analyzing database of same incident occurring on account of same root cause if no proper database exists about incident historian, then it would be difficult to fetch these details for incident investigation root cause analysis. During COVID crisis, the details of infected population, healthy & their location were readily available in government app which helped tracking & also assisted in deciding measures to be taken by government to prevent further spread.

Conformed

Data must be valid to fit within given boundaries and in synchronization with the fact observed. The information should follow standard format, guideline, rule or process. For example, account opening date for any customer must be a date that is in past as compared to future date. During an operating run in a heat exchanger, heat duty pick up during clean condition will always be more than that in fouled condition.

Data Crunching

Data is flexible enough to be tussled with & present it in a manner that user intends to. This is beauty of the numbers that allow user to harness crucial information from raw data. For such data handling, following techniques are used.

·???????? Classification?uses predefined classes to assign to items. These classes describe the characteristics of items or represent what the data points have in common with each. This data mining technique allows the data to be grouped categorically underlying data to be based on similar features.

·???????? Clustering?resembles classification in many ways. However, clustering identifies similarities between objects, then groups those items based on what makes them different from other items. While classification may identify groups such as "mammals," "reptiles," "crustacean," and "insects," clustering may group them such as "vertebrates" and "invertebrates”. It involves grouping data based on similarities. It helps in knowledge discovery, anomaly detection, and gaining insights into the internal structure of the data.

·???????? Regression- It is a method to correlate Xs to Y where Y=f(X). The objective is to find nexus between various parameters & to ascertain whether the effect is on account of individual parameter or X or there is synergistic impact of 2 or more Xs together. For linear regression analysis, linear function (y = mx+c) is applied to arrive at equation, similarly multiple linear regression, quadratic regression, etc., can also be used to account for additional kinds of relationships. Simulation models are based on regression Examples of such scenarios are judging customer class based on products he purchases he makes. Correlating column upset scenario in manufacturing plant with help of available parameters to screen out factors responsible for the event.

·???????? Association rules is tool to find out relationships between various factors, for e.g. relationship between purchasing pizza & carbonated beverage. It is also referred to as market basket analysis. This relationship brings to notice the connection within the data set as it strives to explore pieces of data. For example, association rules would help to analyze purchase of products like board games & food supplies that went high during COVID scenario.

·????????Decision trees whose name is self-explanatory. The tool is used for decision making based on inputs of cascading questions. It resembles a tree, the main query at the top, based on its responses pinpointing to branch & the leaf gives the outcome.? Every branch is a decision to make based?are used to classify or predict an outcome based on a set list of criteria or decisions. It is drilling down options & evaluating the same with respect to implementation. Suppose consider call to be taken pertaining clothes to be packed for vacation based on weather conditions that would prevail during holiday span at destination.

·???????? Neural networks -Data is mapped in the same fashion in which the human brain is interconnected. A neural network is made up of densely connected processing nodes, like neurons in the brain. Each node may be connected to different nodes in multiple layers above and below it. These nodes move data through the network in a feed-forward fashion, meaning the data moves in only one direction. The node “fires” like a neuron when it passes information to the next node. A simple neural network has an input layer, output layer and one hidden layer between them. A network with more than three layers, including the input and output, is known as a deep learning network. In a deep learning network, each layer of nodes trains on data based on the output from the previous layer. The more layers, the greater the ability to recognize more complex information — based on data from the previous layers. Neural networks are a form of machine learning in which an algorithm that mimics the human brain “learns” how to make a prediction or infer an unknown characteristic by analyzing and training itself using existing data. In this sense, neural networks refer to systems of artificial “neurons,” or mathematical functions. The network makes decisions by assigning each connected node to a number known as a “weight.” The weight represents the value of information assigned to an individual node. When a node receives information from other nodes, it calculates the total weight or value of the information. If the number exceeds a certain threshold, the information is passed onto the next layer. If the weight is below the threshold, the information is not passed on. In a newly formed neural network, all weights and thresholds are set to random numbers. As training data is fed into the input layer, the weights and thresholds refine to consistently yield correct outputs.

Neural networks can adapt to changing input data, enabling the network to generate the best possible result without needing to redefine the output criteria. It is a technique that is gaining popularity in fields such as fraud prevention, healthcare, credit scoring & trading. Neural networks require lengthy training periods, making them more suitable for applications where it is possible. Neural networks' strengths include their high level of noise tolerance and their capacity to classify patterns for which they have not yet been taught.

·???????? Predictive analysis?focuses on to utilize historical data to build graphical or mathematical models to forecast future outcomes. This technique strives at calculating an unknown factor in the future based on current data available. The best example that comes to my mind for this case is astrological prediction of future of a person studying his birth chart & hand. In manufacturing unit, precursors of equipment breakdown are used to predict the breakdown & preventive maintenance is planned prior to the same.

·???????? Outlier Detection -in a bell-shaped curve, there might be couple of data points that do not follow a normal distribution. Those are called outliers. A dataset may contain data points that do not comply with the general behavior or model of the data. These data points are outliers. The analysis of why these data points deviate from observed behavior is known as outlier miming. An outlier may be detected using statistical analysis which assume a distribution or probability model for the data. Deviation-based techniques also distinguish outlier by inspecting differences with respect to rest of dataset. In scenario when data doesn’t provide clear conclusion, the outlier detection technique is useful. It involves identifying anomalies or “outliers” in your dataset to understand specific causations or derive more accurate predictions. Here’s an example. Suppose if the production of diesel from your plant always ranges 10 to 13 KT for a day & suddenly for a week production drops down to 7 KTPD. It is important to analyze this data set and reason for drop in production.

?

Emmanuel Omene

Process/Chemical Engineer

1 年

Very informative and insightful article with both day-to-day life and industry based scenarios for better assimilation.

JOHN RAJIAH

General Manager (Rotary) at Dangote Petroleum Refinery Company

1 年

The article is well framed,? flow is smooth. With simple topic Data, you brought out various aspects how it is useful in an Industry or in day to day life. Regression, Neural networks & Outlier Detection sections are brought out very well. Overall an excellent article ?? ?? ???

Yakubu Abbas Muhammad

Process Engineer | Process Data Analyst| Crude unit and Merox Specialist

1 年

A beautiful piece... Its crucial to maintain high quality data for successful data analysis and data crunching. Clinging on to the As, Bs and Cs frame work and adhering to appropriate data quality assurance measures, one can easily improves the trustworthiness of his data-driven insights....

要查看或添加评论,请登录

社区洞察

其他会员也浏览了