7. Mining Internet of Things Data

7. Mining Internet of Things Data

This chapter is focused on the presentation of techniques for mining IoT data, including machine learning techniques for IoT. Hence it is devoted to the “analytics” part of the IoT analytics, coming as a natural add-on to the techniques for collecting and unifying IoT streams prior to their analysis, which was the topic of the previous chapter.

As part of the chapter we will present the main disciplines that comprise data mining and enable the discovery of knowledge from data set. Likewise an overview of machine learning techniques will be provided, including both supervised and unsupervised learning methods. Moreover, we will provide a classification of IoT-oriented data mining approaches, including multi-layer, distributed and grid based techniques.

7.1 IoT Data: The Mining Process

The process of mining IoT data towards discovering knowledge from them touches on multiple disciplines and therefore requires multiple skills. These include databases, machine learning, visualization and statistics. This is also the reason why data-driven knowledge discovery (including BigData techniques) is very much challenging, while requiring multi-disciplinary teams.

The selection of a proper machine learning and data mining model for discovering knowledge based on IoT data is very challenging, given the different problems and deployment environment comprising the wealth of IoT applications nowadays. To this end, disciplined methodologies for analyzing the data and evaluating the performance of alternative data mining models are employed. The most popular methodologies for analyzing and mining datasets are CRISP-DM (Cross Industry Standard Process for Data Mining), KDD (Knowledge Discovery in Databases) and SEMMA (Sample, Explore, Modify, Model, and Assess). These methodologies are all iterative and cross sector, which means that they are commonly used in various application domains other than industrial energy, healthcare, manufacturing, smart cities and more. Furthermore, all of them provide the means for evaluating the performance of a given model on the supplied datasets, prior to the final selection and field deployment of a data mining algorithm. Also, it should be outlined that these methodologies have not been developed for IoT data and applications. Rather they are general purpose knowledge discovery models, which are applied and used for almost all data mining problems, including mining of IoT data.

In following paragraphs will illustrate CRISP-DM, which is the most popular among the three methodologies outlined above. CRISP-DM is an iterative methodology comprising six major phases. The phases are sequential in the sense that each one is based on the outcomes of the previous one. Nevertheless, as a result of the iterative nature of the methodology it is possible and in most cases required to revert from one phase to a previous one. The six phases of CRISP-DM are as follows:

  • Business Understanding: This initial phase sets the scene and decides the scope of the data mining activities. It establishes the requirements and ultimate goals of the data mining process, including what is the expected result. To this end, the target business question has to be formulated. This could for example be the prediction of a machine’s end-of-life based on vibration and ultrasonic data or even the prediction of traffic within an urban traffic route. In addition to formulating the target business question, a preliminary plan to resolving this question is developed, including the datasets to be used and the models that should be explored.
  • Data Understanding: As part of this phase datasets are collected and reviewed. For the success of the data mining process, it is very important to inspect the available datasets in order to identify data quality problems, but also to understand which models could be effective and which not. Even though every problem is different, experienced data scientists can prioritize the methods to be tested and evaluated, simply by reviewing the available data.
  • Data Preparation: In this phase the datasets to be used for extracting and evaluating the data mining model are prepared. This may involve several transformations to the raw data that have been collected by sensors and other IoT sources, including filtering of the datasets (e.g., selecting specific attributes), transforming them in different formats, combining datasets (e.g., joining datasets from different sensing modalities), as well as cleaning them (e.g., getting rid of empty or incomplete fields). The ultimate objective of this phase is to ensure that the data are ready to be loaded and used by data modelling tools.
  • Modeling: As already outlined, there is a variety of models that can be used for classification, prediction or even rules extraction. The purpose of this phase is to apply some of the available methods, while at the same time calibrating them by tuning their parameters. Note that each model to be produced may require different datasets. Therefore, it very common to go back to the data preparation phase in order to prepare alternative datasets as needed.
  • Evaluation: Following the development of the data model(s) in the previous phase, this one performs a thorough evaluation of the operation of the selected models against the target objectives. The evaluation is conducted in terms of the performance of each model. For example, it is tested whether a model can produce traffic productions that are very close to the known traffic on a given route. However, apart from evaluating the performance of the model, it is also important to assess (at a higher level) whether the business objectives can be met. This latter assessment will drive the final decision on whether a model can be moved to production. It quite common for the data mining team to go back from this phase to the business understanding phase in order to reformulate the business problem/question at hand.
  • Deployment: This phase is concerned with the deployment of successful data mining models on the field. It is not confined to the integration of algorithms within platforms and database systems. Rather it also includes the implementation of proper ways for presenting the information to the end-users, including production of reports and dashboards.

The other two methods (i.e. KDD, SEMMA) are quite similar and based on a phased approach as well.  

7.2 Common Data Mining Techniques

7.2.1 Definitions

For the study and validation of different machine learning methods, it’s important to introduce the following terms:

  • Instance: An instance (also called item or record) is an example of an entity, described by a number of attributes. For example, a day can be described by temperature, humidity, and cloud status.
  • Attribute: An attribute (also called field) provides the means for measuring an aspect of the Instance, e.g., a day’s temperature.
  • Class (Label): A class refers to a grouping of homogeneous instances, e.g., gold customers within a customer relationship management database.

Given the previous definitions, here are some very common tasks that data miners need/try to perform:

  • Classification, which refers to the task of assigning an item to a class or in other word to the process of predicting the class of an item based on available datasets about similar items and their classes.
  • Clustering, which refers to the process of grouping related data (instances) within a given dataset. In other words, clustering refers to finding clusters in data.
  • Associations, which is the task of finding associated items i.e. items attributes or conditions on data that occur frequently together.
  • Visualization, which refers to the process of visualizing data in order to facilitate exploration of the datasets by humans and related discovery of knowledge.
  • Summarization, which refers to the extraction of knowledge & attributes about a group of data.
  • Deviation Detection, which refers to the process of finding changes between given datasets and items within them.
  • Estimation, which is the task of using datasets in order to predict a continuous (rather than a discrete) value.
  • Link Analysis, which is about finding relationships between different datasets and their items.

In following parts of the chapter, we will present more details about some of these tasks, starting with classification.

7.2.2      Classification of Machine Learning Mechanisms

Independently of the specific data mining task that they address, machine learning algorithms can be classified in the following three categories:

1) Supervised learning: In supervised learning, a property (label) is available for a certain dataset (which is the training set). This same property is missing for new instances and hence needs to be produced based on (supervised) machine learning. Decision trees, Na?ve Bayesian classification, Support vector machines are some popular supervised learning techniques.

2) Unsupervised learning: Unsupervised is about discovering implicit relationships in a given unlabeled dataset. Hence, in unsupervised items are not pre-assigned and no training datasets is used. Clustering methods provide a prominent example of unsupervised learning.

3) Reinforcement learning: Reinforcement learning is usually used for more complex problems such as recognition of complex patterns. It involves take actions in order to maximize cumulative rewards. Reinforcement learning is typically formulated in a Markov decision process (MDP) environment and entails dynamic programming techniques. It is used for applications involving Game theory, operations research, multi-agent systems, etc. Note that reinforcement learning is a more advanced form of machine learning, which is employed in several AI (Artificial Intelligence) applications, such as Google’s AlphaGO AI engine, which in March 2016 managed to beat one of the world’s best experts in the “GO” game.

7.3 Data Mining and Machine Learning Models

7.3.1      Introducing Classification

Classification is about learning a method that can predict the instance of a class based on a dataset of pre-labeled (pre-classified) instances. Referring to the figure/slide, the classification problem can be defined as follows: Given a set of points from classes, what is the class of a new point (i.e. which is not classified yet)?

There are many approaches for solving a classification problem, including regression, decision trees, bayesian approaches, neural networks and more.

7.3.2      Decision Trees

Decision trees are one of the most popular, intuitive and easy to understand classification methods. From a technical viewpoint they are decision support tools that use a tree-like graph or model of decisions and their possible consequences. They operate through evaluating chance-event outcomes, resource costs and utility of decisions in each step. From a business viewpoint, decision trees can be seen as the minimum number of yes/no questions that one has to ask in order to assess the probability of making a correct decision. Hence, they provide a structured and systematic way to arrive at a logical conclusion. An example of a decision tree is shown in the slide/figure, where different attributes associated with the weather of the day (item) are assessed in order to identify whether a sports event will take place during that day or not (i.e. it will be postponed).

A simple way for perceiving a decision tree based classifier is to view it as a series of if-then-else statements. This is shown in the figure in a two dimensional space based on two variables (X, Y). The values of these two variables are assessed in order to classify a point in an appropriate area in space. This is because different areas in space are used to denote different classes.

7.3.3      Least Squares Regression

Another classification methods is the Least Squares Regression (LSR). It is a method of performing linear regression i.e. fitting a straight line through a set of points in an optimal way. Optimality is specified in terms of the vertical distances between the point(s) and the line. In particular, the distances of the various points are added and the line that gives the smallest possible sum provides the fitted line.

Note that the term “linear” in the title of the LSR denotes the kind of model used to fit the data. At the same time the term “Least squares” denotes the error metric that is optimized/minimized in order to lead to the optimal fit. In particular, least squares refers to summing and them minimizing the squares of the distances of the given points from the line.

7.3.4      Na?ve Bayes Classification

Na?ve Bayes Classification refers to a family of simple probabilistic classifiers based on Bayes’ theorem. The main equation comprising the Bayes theorem is illustrated in this figure/slide: It is used to calculate the posterior probability for a class c, given the probability of a known event x (i.e. P(c | x), based on the likelihood of the event x for class c , the prior probability for an even being in class c and the predictor prior probability of the known event X (P(x)). This relationship is straightforward, assuming however independence between the features and events x, c. This assumption is in practice unrealistic. However, there are cases where some strong independence between the events can be assumed.

Classifiers based on the “na?ve” versions of the Bayesian theorem are very easy to implement. It’s however quite impressive that they give decent results in several occasions. Their operation is based on the calculation of the posterior probability for an event/item being in class c given an event x which is known / observed in the available dataset.

Note that Na?ve Bayesian classifiers have several applications such as:

  • Marking an email as spam or not spam, given the previous (known) classification of other emails as spam or non-spam.
  • Classify a news article about technology, politics, or sports, based on some of its attributes (e.g., words) and given the known classification of other articles in the available datasets.
  • Check a piece of text expressing positive emotions or negative emotions, which is known as sentiment analysis.
  • Classifying faces in some known categories (face recognition) based on some facial features.

7.3.5      Logistic Regression

Logistic regression provides a powerful statistical way of modeling a binomial outcome with one or more explanatory variables. It measures the relationship between the categorical dependent variable and one or more independent variables. Accordingly, it estimates probabilities using a logistic function, which is the cumulative logistic distribution. As shown in the slide/figure it provides a probability fashion for the classification based on a logistic rather than a linear model. The logistic model can provide a much more accurate probability distribution than a linear model.

Real World Applications, where logistic regression is applied include credit scoring, measuring success rates of marketing campaigns, predicting revenues of a certain product, predicting a specific weather characteristic within a particular day and more.  

7.3.6      Support Vector Machines (SVN)

Support Vector Machines (SVN) is a popular binary classification algorithm. Given a set of points of two types in N dimensional place, SVM generates a (N — 1) dimensional hyperlane to separate those points into two groups. SVM produces a straight line, which separates two points into two types situated as far as possible from all those points

SVN classification is used in various practical examples, including display advertising , human splice site recognition, image-based gender detection, large-scale image classification and more.

7.3.7      Ensemble Methods

Ensemble Methods refer to a class of learning algorithms that construct a set of classifiers and then classify new data points by taking a weighted vote of their predictions. Originally ensemble methods used bayesian averaging. However, more recent algorithms were expanded to include error-correcting output coding, bagging, boosting and other methods. The main advantages of sampling methods include that:

  • They average out biases and hence provide more balanced results.
  • They reduce variance due to diversification (e.g., like in a stocks portfolio) .
  • They are unlikely to over-fit, since they combine predictions from non-over fitting models (e.g., average, weighted average, logistic regression).

7.3.8      Clustering

Clustering represents a whole range of methods for grouping a set of objects into various categories in a way that objects in the same group (cluster) are more similar to each other than to those in other groups. Prominent examples of clustering algorithms include centroid-based algorithms, connectivity-based algorithms, density-based algorithms, probabilistic algorithms, dimensionality reduction algorithms, as well as neural networks. Neural networks comprise also a class of deep learning algorithms which are nowadays very popular in terms of their ability to identify complex patterns in multimedia (IoT) datasets.

Note that clustering algorithms fall in the class of unsupervised learning algorithms. This means that the classification of instances in various classes is not based on the exploitation of existing “labelled” datasets that are used as training datasets. Rather clustering algorithms find a “natural” grouping of instances given unlabeled data as shown in the figure where instances are groups in three areas in space.

Neural networks is a popular clustering technique, which can be used to select more complex regions as part of the clustering process. Furthermore, neural networks can yield more accurate results. On the downside, they can overfit the data, which means that they could lead to the identification of patterns in random noise. In the following figure it is shown that the patterns found by neural networks can be quite complex.

Neural Networks can select more complex regions and can be more accurate, yet they may also overfit the data i.e. find patterns in random noise

Different clustering algorithms have been devised in order to address a variety of different (data mining) cases, including:

  • Cases involving numeric and/or symbolic data.
  • Deterministic and probabilistic data modelling cases.
  • Hierarchical and flat cases.
  • Top-down and bottom-up cases.

The evaluation of the effectiveness of the clustering algorithms can be performed based on different techniques, including manual inspection, benchmarking on existing labels, as well as cluster quality measures based on different metrics (e.g., distance measures, high similarity within a cluster, low across clusters).

K-Means is one of the most popular clustering algorithms. It works with numeric data only and operates as follows:

  • It picks a number (K) of cluster centers (at random).
  • It assigns every item to its nearest cluster center (e.g., using Euclidean distance).
  • It moves each cluster center to the mean of its assigned items.
  • The above two steps are repeated until convergence is achieved. Convergence is achieved when change in cluster assignments is less than a threshold.

7.3.9      Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is one more statistic analysis technique that is used in machine learning contexts. It is a statistical procedure that uses an orthogonal transformation, which converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The transformation is illustrated in the figure in this slide, which depicts the transformation of the instances in the original data space to a linear component space. Note that the linearly uncorrelated variables are called principal components (or sometimes, principal modes of variation). The number of principal components is less than or equal to the smaller of (number of original variables or number of observations). Popular applications of PCA can be found in Interest Rate Derivatives Portfolios and Neuroscience.

Domain knowledge is very important while choosing whether to go forward with PCA, as PCA is based on domain specific assumptions about the underlying structure.

7.3.10   Independent Component Analysis (ICA)

Independent Component Analysis (ICA) is another statistical technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. It defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. Latent variables are assumed non-Gaussian and mutually independent, and they are called independent components of the observed data. ICA is a special case of Blind Source Separation (BSS), which is very commonly used in problems involving audio signals. A very popular problem is the so-called "cocktail party problem“, which refers to the task of separating different conversations in the scope of a noisy room where different groups of people speak simultaneously.

Note that ICA is related to PCA since it transforms data into independent components. Nevertheless, it is considered much more powerful than PCA for several problems. ICA is used in many different applications including Digital images classification, document databases, economic indicators identification and psychometric measurements.

7.4 Models Evaluation

7.4.1      Overview

Given the availability of a host of different methods for mining data, it’s reasonable to wonder which one can be the most effective for a real-life problem at hand. In principle, no model is uniformly the best. When mining data (including IoT data) data scientists are expect to evaluate and test many different methods in order to identify the optimal one for the problem at hand. The comparison of different methods can be performed against different dimensions depending on the business requirements. Some of such dimensions include:

  • The speed of training of the machine learning technique/model.
  • The speed of model application.
  • The noise tolerance of the method.
  • The extent to which it is intuitive, understandable and explanatory, which all highly depend on the semantics of the selected model.

In real-life problems hybrid integrated models are very commonly used as they yield better results than any single method alone.

7.4.2      Classification Techniques Evaluation

Taking as example the different classification techniques (e.g., decision trees, linear regression) we need to evaluate how predictive they are i.e. how effective they are in terms of their ability to classify new instances. To this end, the error on the training data is not a good indicator of performance on future data. This is quite reasonable given that the new data will probably not be exactly the same as the training data. Furthermore, there is an overfitting problem i.e. fitting the training data too precisely, which usually leads to poor results on new data. Hence, other data should be used for the evaluation.

In order to evaluate different classification methods and measures can be used including:

  • Classification accuracy i.e. the ability of a classifier to correctly/accurately classify new instances.
  • Total cost/benefit, especially when different errors involve different costs. Indeed, in some cases an error is also associated with some business cost (or risk), which can be taken into account in the evaluations.
  • Lift and ROC (Receiver Operating Characteristic) curves, which are graphical plots that illustrate the performance of a binary classifier system as its discrimination threshold is varied.
  • The error in numeric predictions.

 7.4.3      Classifier Error Rate

The use of the Classifier Error Rate is a very popular method for ranking the performance and efficiency of classification methods. Its calculation is based on the following simple and straightforward principles:

  • When an instance’s class is predicted correctly the classifier is successful.
  • When an instance’s class is predicted incorrectly the classifier is erroneous.
  • The error rate is defined as the proportion of errors made over the whole set of instances.

As already outlined, it’s not credible to calculate the error rate on the training set, since this will give way too optimistic results. Rather additional data are required for computing the error rate.

 7.5 Building a Classifier

In order to alleviate the problem of using the training data set for evaluation, a sufficiently large dataset of instances with known classification will be required.

In cases where many (e.g., >1000) examples are available (including > 100 examples from each class) a simple evaluation will give useful results. To this end, we can randomly split data into training and test sets as follows:

  • 2/3 for training (training set).
  • 1/3 for testing (test set).

Accordingly we can build a classifier using the training set, and evaluate it using the test set.

The process of building a classifier involves using a subset of known results which is used as a training set, to build a model. This model is evaluated against the testing set, which is used to calculate the error rate of the classifier. The latter process involves a comparison of the classifications produced by the classifier with the known classes in the testing sets. In-line with the CRISP-DM data mining process, the evaluation may lead to changes and reconsiderations about the model, including the development of alternative models.

7.6 Dealing with Unbalanced Data

Note that the evaluation of the classifiers given previously can be hardly applied in cases of classes with very unequal frequency. Indeed, there are many problems where several events/facts need to be classified to classes that occur rarely, for example:

  • Attrition prediction: 97% stay, 3% attrite (in a month).
  • Medical diagnosis: 90% healthy, 10% disease.
  • Classification of clients based on their eCommerce behaviour: 99% don’t buy, 1% buy.
  • Classification of individuals in the scope of urban security applications: > 99.99% are not terrorists.

Similar situations apply in the cases with multiple classes. In all these cases, the majority class classifier can be 97% correct. However, this high percentage is useless since we are mostly interested about classifying events that do not occur frequently anyway.

 In cases where classifications in one out of two classes is required, the following techniques can be applied in order to deal with unbalanced data:

  • Start by building balanced train and test sets, and accordingly train a model based on a balanced set.
  • Select randomly the desired number of minority class instances.
  • End up by adding an equal number of randomly selected majority class.

 This technique could be generalized in the case of classification across multiple Classes (> 2). In particular:

  • The “balancing” can be generalized across all classes.
  • It has to be ensured that each class is represented with approximately equal proportions in train and test datasets.

 7.7 Parameters Tuning

It’s important to note that test data should not be used to create the classifier. Rather a two stage approach should be followed, involving:

  •  Building the basic structure of the classifier and then
  • Optimize parameter settings.

Test data can’t be used for parameter tuning. Hence, three sets are needed including:

  • Training data.
  • Validation data.
  • Test data.

In this case validation data is used to optimize parameters. Validation data are different from train and test data.

 A validation set is used in order to carry out parameter tuning and optimization. Once the (machine learning) model gets a shape close to final it can be evaluated against the test data. Training, test and validation datasets are non overlapping (i.e. distinct datasets).

 7.8 Data Mining Models for IoT

7.8.1     Overview of IoT Data Mining Models

All the previously presented data mining models and techniques are applicable to all types of datasets, including IoT datasets, other streaming datasets (e.g., Twitter data), traditional batch datasets and more. The deployment of these models in IoT context can be performed based on there different techniques, which have been identified in the IoT literature. These include:

  • The Multilayer data mining model, which entails data collection, data processing, event processing, and data mining services.
  • The Distributed data mining model, which is used to solve problems associated with the storage of IoT data in different locations.
  • The Grid-based data mining model, which utilizes the potentially unlimited amount of data by using a grid/cloud computing infrastructure.

 We can revisit and consolidate the peculiar challenges of IoT data mining, which are primarily addressed by the previously presented data mining models for IoT. In particular:

  • IoT data are real-time, which introduces timing and streaming challenges.
  • IoT data are provided based on an uninterrupted data flow, which should be addressed at the level of IoT nodes.
  • IoT data can be potentially unlimited, which makes it challenging to store and process them in-memory.
  • IoT applications are dynamic which asks for constructing data models and machine learning models that are dynamic and adaptive, rather than being static and unchanged during the lifetime of the IoT application.
  • Several IoT applications involve actuation and use the data in order to drive decisions and provide intelligent feedback to other systems, which is another factor that differentiates IoT from other types of data-intensive systems.

 7.8.2  The Multi-Layer Model

The multilayer data model for IoT data mining, includes a set of layered components as follows:

  • A data collection layer entails IoT data collection and acquisition based on appropriate interfaces to sensors and other IoT devices.
  • A data management layer undertakes the storage of these data in a centralized or distributed database (e.g., SQL or noSQL database). Furthermore at this layer, data could be structured and persisted in a data warehouse.
  • An event process layer undertakes to process (e.g., filter) the IoT data with a view to producing events.
  • An event mining layer deploys data mining and machine learning techniques (such as the ones presented earlier) in order to produce/extract knowledge from the IoT data.

Note that most IoT platforms employ this multilayer data model (or a variation of it) in order to enable BigData analytics based on IoT data.

 Each of the layers of the multi-layer data model provides some added value functionalities as follows:

  • The data collection process/layer deals with energy saving features, alleviates misreading, provides the means for repeatable reads, ensures fault tolerance and basic data filtering close to the source, while at the same time undertaking networked communications with the systems supplying the data.
  •  The Data management layer manages collected data based on centralized or distributed database or even a data warehouse. It undertakes data abstraction & compression.
  • The Event-processing layer is used effectively to analyze events and realize inquiry analysis based on the events.
  • The Data mining service layer deals with data, data mining and knowledge. In this context, data refers to the data from the lower transmission, data mining emphasizes the extraction of characteristics and knowledge represents a specific service which is based on the processing and analytics of IoT data.

 7.8.3      The Grid-Based Data Mining Model

The grid computing data model to data mining is based on the integration of IoT data and data processing functions to the conventional grid/cloud computing infrastructure. In this context, usual cloud/grid computing functionalities (such as worklow management, data services and security) becomes applicable to IoT data-centric services and IoT datasets. This provides also the foundation for the implementation of data mining and knowledge extraction functions.

7.9 Epilogue

This chapter has provided an overview of the IoT data mining processes. It has also discussed a range of popular machine learning models and their use in the scope of IoT data mining. We have ended-up discussing IoT architectures for data mining, which includes modules for data collection and (pre)processing.



要查看或添加评论,请登录

John Soldatos的更多文章

社区洞察

其他会员也浏览了