Teaching Notes


?

Example 1: (KTO1, KTO2, STO1)

a. What are the advanced analytics methods?

b. What are modeling philosophies?

?

Example 1: Solution

a. Advanced Analytics Methods

1.?????? Linear Regression

2.?????? Logistics Regression

3.?????? Classification and Regression Trees

4.?????? Segmentation and clustering

5.?????? Time Series Analysis

?

b. Modelling Philosophies

Whatever your approach to model development, some ideas should be kept in mind.

??????? Models should start with theory.

??????? Data used to develop the model should not be used to test it.

??????? Testing down generally creates less bias; however, it is not always possible.

??????? Multiple tests create compounding errors, consider using singular tests of multiple constraints where feasible.

??????? Models should always be subject to specification and other tests before being accepted.

??????? Model development is an art, not a science.

??????? Practitioners vary greatly in their approaches to model development.

??????? Many of these approaches violate the assumptions that are required to assure that inferences can be drawn, and tests conducted with appropriate levels of confidence.

??????? Generally, it is believed that such violations lead to more useful results.

?

Example 2: (KTO1, KTO2, STO1)

a. What is linear regression?

b. What is a dependent variable Y in linear regression? What are independent variables(xs)?

c. What is the model description for the linear regression model?

d. What method is used to estimate the parameters in linear regression?

?

Example 2: Solution

a. In general, regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest.

b. Often, the outcome variable is called a dependent variable because the outcome depends on the other variables. These additional variables are sometimes called the input variables or the independent variables.

c. As the name of this technique suggests, the linear regression model assumes that there is a linear relationship between the input variables and the outcome variable.

????????????????????

???

?

d. Ordinary Least Squares (OLS) is a common technique to estimate the parameters. With OLS, the objective is to find the line through these points that minimize the sum of the squares of the difference between each point and the line in the vertical direction. In other words, find the values of β0 and β1 such that the summation shown in the equation is minimized. The least squares estimators are often called BLUE, best linear unbiased estimators. The term best refers to the property of minimum variance. OLS: Ordinary Least Square Method

·???????? Set a difference between the dependent variable and its estimation:

·???????? Square the difference:

·???????? Take summation for all data.

·???????? To get the parameters that make the sum of square difference become minimum, take partial derivative for each parameter and equate it with zero.

?

Example 3: [KTO1, KTO2 , KTO3 , STO1]

a. What is simple linear regression?

b. What are hypothesis testing in linear regression?

c. What are the assumptions of linear regression?

?

Example 3: Solution

a. Simple linear regression can be used to more fully describe the relationship between two continuous variables, as opposed to scatter plots and Pearson correlations. Regression model parameter estimates not only define the line of best fit corresponding to the linear association between variables, but they also describe how a change in a predictor corresponds to a change in the response. To practice performing simple linear regression, let's build a model using Lot_Area as the predictor and SalePrice as the response. In simple linear regression, the goal is to identify the equation that characterizes the linear association between the predictor variable and the response variable, and use the model to then estimate the response for a given value of the predictor. The regression line is the expected value, or mean of Y (at any given X), which equals β0 + β1 times X. The intercept is often of less interest than the slope. Sometimes, the intercept corresponds to an impossibility. For example, in a regression of height on weight, the intercept would indicate the height of someone who weighs nothing. Other times, when X=0 is a possible value, it can be outside the range of actual data points. Therefore, be cautious when you interpret the regression relationship outside the range of your data.

?

b. To determine whether the predictor variable explains a significant amount of variability in the response variable, the simple linear regression model is compared to the baseline model. The fitted regression line in a baseline model is just a horizontal line across all values of the predictor variable. The slope of this line is 0, and the y-intercept is the sample mean of Y, which is Y-bar. In a baseline model, the X and Y variables are assumed to have no relationship. This means that for predicting values of the response variable, the mean of the response, Y-bar, does not depend on the values of the X variable.

?

For each data point, you calculate Y-i minus Y-hat-i and square the difference. Then sum these squared values to find the error sum of squares, or SSE, which is the amount of variability that your model fails to explain. The total variability is the difference between the observed values and the mean of the response variable. For each data point, you calculate Y-i minus Y-bar and square the difference. Then sum these squared values to get the corrected total sum of squares, or SST, which is, of course, the sum of the model and error sum of squares. The SSM and SSE are divided by their corresponding degrees of freedom to produce the mean-square model (MSM) and mean-square error (MSE). The significance of the regression analysis is assessed the same way as ANOVA, that is, by computing the F ratio, the mean squared model divided by the mean squared error, and the corresponding p-value. In fact, you'll see an ANOVA table in your regression output as well.

?

c. For a simple linear regression analysis to be valid, four assumptions need to be met. The first assumption is that the mean of the response variable is linearly related to the value of the predictor variable. In other words, a straight line connects the means of the response variable at each value of the predictor variable. The error terms are normally distributed with a mean of 0, the error terms have equal variances, and the error terms are independent at each value of the predictor variable.

?

Example 4: [KTO1, KTO2 , STO1]??????????

a. What is multiple linear regression?

b. What are the advantages and disadvantages of multiple linear regression?

c. When do you use multiple linear regression?


?

Example 4: Solution

a. When you have two predictor variables, you model the relationship of the three variables - three dimensions - with a two-dimensional plane. Let's look at a model with two predictors.

??????????????????????????????????????

Y is the response variable, X-1 and X-2 are the predictor variables, ε is the error term, and β0, β1, and β2 are unknown parameters. β0 is the y-intercept and has the same meaning as the intercept in a simple linear regression.

?

b. Why would you perform multiple linear regression instead of a series of simple linear regressions?

The biggest advantage is that multiple regression enables you to determine the relationship between a predictor and response while controlling for all other predictors included in the model. Sometimes a hidden relationship can be revealed or a strong relationship can disappear when additional predictors are accounted for by including them in the regression model. You can determine whether a relationship exists between the response variable and several predictor variables simultaneously. You can also test for interactions just like in ANOVA.

The more predictors you have, the more complicated interpreting the model becomes. Consider an example with one response variable and seven potential predictor variables. The increased complexity makes it more difficult to interpret the models, and to decide which model to use. Overall, the advantage of performing multiple linear regression over a series of simple linear regression models far outweighs the disadvantages. In practice, the response often depends on multiple factors that might interact in some way.

?

c. It's a powerful tool for both explanatory analysis and for prediction. In explanatory analysis, you develop a model to test the statistical significance of the parameter coefficients to determine whether a relationship exists between the response variable and the predictor variables. For example, does increasing the number of police officers affect the crime rate? In a situation like this, you're not necessarily concerned about predicting crime. Instead, you're trying to understand what relationship certain factors have on the crime rate.

When you interpret the parameters, you take the magnitudes and signs of the coefficients into account.

?

?

Example 5: [KTO1, KTO2 ,STO1]

a. What is logistic regression and how it is different from linear regression?

b. What are the model descriptions for the logistic regression?

c. How do you fit and diagnostics logistic regression model?

d. What are the reasons for choosing the logistic regression model?

?

Example 5: Solution

a. In linear regression modeling, the outcome variable is a continuous variable. Suppose a person’s actual income was not of interest, but rather whether someone was wealthy or poor. In such a case, when the outcome variable is categorical in nature, logistic regression can be used to predict the likelihood of an outcome based on the input variables.

?

b. pi represents the probability that y equals 1 given the inputs. ?The term posterior means that the probability is calculated after you provide the input information to the model. The probabilities have a nonlinear relationship with the input variables. We use a logit transformation that linearizes the outcome of the logistic regression model. The standard logistic regression model assumes that the logit of the posterior probability is a linear combination of the input variables.

???????

?

?

c. Model Fits Statistics

  1. AIC : Aikaike Information Criteria
  2. SC? :? Schwartz Criteria
  3. -2 log L
  4. For comparison of competing models
  5. The smaller the number the better the model is.


Fitting and Diagnostics of Logistic Regression Model

Fitting a Logistic Regression Model in SAS

Specify the Model: Define which variable is the response (dependent variable) and which variables are the predictors (independent variables). The response variable should be binary. Fit the Model: Use PROC LOGISTIC to fit the model. You can specify options for the fitting method and request various types of output, including odds ratios, predicted values, and more. Example: proc logistic data=mydata; ??? class categorical_variable; /* If you have categorical predictors */ ??? model binary_response(event='1') = predictors / selection=stepwise details; run; In this example, binary_response is the dependent variable, and predictors represents one or more independent variables. The event='1' option specifies which outcome of the binary response is considered the 'event' (usually the one of primary interest). The selection=stepwise option is used to perform stepwise variable selection, and details provides additional output information.

Evaluating Model Diagnostics and Fit Statistics

After fitting the model, you'll want to assess its performance and diagnostic measures: AIC (Akaike Information Criterion): A measure of the relative quality of a statistical model for a given set of data. It deals with the trade-off between the goodness of fit of the model and the complexity of the model. SC (Schwarz Criterion) or BIC (Bayesian Information Criterion): Similar to AIC but with a higher penalty for models with more parameters, making it more stringent against model complexity. -2 Log L (-2 Log Likelihood): Represents the log likelihood of the model multiplied by -2. The log likelihood measures how well the model predicts the observed data; thus, -2 Log L is often used in model comparison. Model Comparison: These statistics are particularly useful when comparing competing models. A smaller value in AIC, SC, or -2 Log L indicates a model that has a better balance between goodness of fit and simplicity.

SAS Output

In the output from PROC LOGISTIC, you'll find these statistics under the 'Model Fit Statistics' section. SAS automatically calculates and reports them, allowing you to compare different models or configurations easily.

Additional Diagnostic Tools

Beyond these model fit statistics, it's also important to consider other diagnostics and tests to evaluate the logistic regression model fully: - Hosmer-Lemeshow Test: Tests the goodness of fit specifically for logistic regression models. - Confidence Intervals for Estimated Odds Ratios: Help assess the precision and significance of the predictors. - ROC Curve and AUC: Evaluate the predictive accuracy of the model. - Residual Analysis: Identifies outliers or observations that are not well explained by the model. Each of these tools and statistics provides different insights into the model's performance, helping ensure that your logistic regression model is both accurate and meaningful for your data.

?

?

???????????????????

?

  • Percent concordant : 80% and above.
  • The four values in the column (Somer.s , Gamma,Tau-a and c) are all measures of rank correlation and are computed from concord, discord, and tied.
  • c value it ranges from 0.5 to 1 (AUC : Area Understand Curve)- the higher the better;



d. Linear regression is suitable when the input variables are continuous or discrete, including categorical data types, but the outcome variable is continuous. If the outcome variable is categorical, logistic regression is a better choice. Both models assume a linear additive function of the input variables. If such an assumption does not hold true, both regression techniques perform poorly. Furthermore, in linear regression, the assumption of normally distributed error terms with a constant variance is important for many of the statistical inferences that can be considered. If the various assumptions do not appear to hold, the appropriate transformations need to be applied to the data.

?

Although a collection of input variables may be a good predictor for the outcome variable, the analyst should not infer that the input variables directly cause an outcome. For example, it may be identified that those individuals who have regular dentist visits may have a reduced risk of heart attacks. However, simply sending someone to the dentist almost certainly has no effect on that person’s chance of having a heart attack. It is possible that regular dentist visits may indicate a person’s overall health and dietary choices, which may have a more direct impact on a person’s health. This example illustrates the commonly known expression, “Correlation does not imply causation.”

?

?

Example 6: [KTO1, KTO2, STO1]??

a. What are the additional regression models?

?

Example 6: Solution

a. In the case of multicollinearity, it may make sense to place some restrictions on the magnitudes of the estimated coefficients. Ridge regression, which applies a penalty based on the size of the coefficients, is one technique that can be applied. In fitting a linear regression model, the objective is to find the values of the coefficients that minimize the sum of the residuals squared. In ridge regression, a penalty term proportional to the sum of the squares of the coefficients is added to the sum of the residuals squared. Lasso regression is a related modeling technique in which the penalty is proportional to the sum of the absolute values of the coefficients.

Only binary outcome variables were examined in the use of logistic regression. If the outcome variable can assume more than two states, multinomial logistic regression can be used.

?

Example 7: (KTO1, KTO2, STO1)

a. What is CART?

b. What is the process of growing a decision tree?

c. What are the advantages and disadvantages of CART?

?

Example 7: Solution

a. CART: Classification and Regression Tree

The linear regression approaches that we used in our previous course represents some powerful methods. However, when our data has large numbers of explanatory variables that may interact in very complicated ways, building a global linear model can be difficult and challenging. Decision trees, a data mining method that allows us to explore the prescience of potentially complicated interactions within data by creating segmentations or subgroups.

?

Like liner regression, decision trees are statistical models designed for what are known as supervised prediction problems. Decision trees are so-named because the predictive model can be represented in a tree-like structure. Due to its flexibility and easy visualization, decision trees are commonly deployed in data mining applications for classification purposes.

?

Decision trees can be applied to a variety of situations. They can be easily represented in a visual way, and the corresponding decision rules are quite straightforward. Additionally, because the result is a series of logical if-then statements, there is no underlying assumption of a linear (or nonlinear) relationship between the input variables and the response variable.

?

b. The process of growing a decision tree

Based on the variables considered by the model, the decision tree method works by making binary splits in a sample to maximize correct classification of those with and without the target of interest. All possible separations or cut points are tested and the separation yielding the minimum impurity or error is selected. And subgroups showing similar outcomes, but different explanatory variable constellations, are generated.

Decision tree methods also commonly employ a cross-validation procedure to guard against an overfit model, in which splits are included in the tree that do not actually improve the fit of the model. Following an initial growing of the tree, a random subset of the data is tested, that is, validation sample. And only branches of the tree that improve the correct classification rate are retained.

?

???????????????????????????????

?

Generally, the data is divided into training and validation or test sub-samples. And the training sample is used to grow an overly large tree, while the validation or test sample is then used to estimate the rate at which the cases are misclassified. The misclassification rate is calculated for every sized tree, and the selected sub-tree represents the lowest probability of misclassification.

?

c. Advantages of CART

·???????? Simple to understand, interpret, visualize.

·???????? Decision trees implicitly perform variable screening or feature selection.

·???????? Can handle both numerical and categorical data. Can also handle multi-output problems.

·???????? Decision trees require relatively little effort from users for data preparation.

·???????? Nonlinear relationships between parameters do not affect tree performance.

?

Disadvantages of CART

·???????? Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.

·???????? Decision trees can be unstable because small variations in the data might result in a completely different tree being generated.

·???????? The algorithms cannot guarantee to return the globally optimal decision tree.

·???????? Decision tree learners create biased trees if some classes dominate.

?

Example 8: (KTO1, KTO2, STO1)

a. What is random forests?

b. What is the process of growing a random forest?

?

Example 8: Solution

a. Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. The algorithms cannot guarantee to return the globally optimal decision tree.

Decision tree learners create biased trees if some classes dominate. By applying a series of simple rules or criteria over and over again, which choose variables that best predict our target variable. While decision trees proceed by searching for a split on every variable in every node, Random Forests searches for a split on only one variable in a node. The variable that has the largest association with the Target among candidate explanatory variables but only among those explanatory variables that have been randomly selected to be tested for that node.

?

b. The process of growing random forests

·???????? First, a small subset of explanatory variables is selected at random.

·???????? Next the node is split with the BEST variable among the small number of randomly selected variables.

·???????? Not the best variable of all the variables, as is true when we are interested in creating only single decision tree.

·???????? Once the best variable from the eligible random subset of variables is used to split the node in question.

·???????? A new list of eligible explanatory variables is selected on random to split on the next node.

·???????? This continues until the tree is fully grown, and ideally there is one observation in each terminal mode.

?


?

·???????? With a large number of explanatory variables, the Eligible variables set will be quite different from node to node.

·???????? However, important variables will eventually make it into the tree.

·???????? And their relative success in predicting the target variable will begin to get them larger and larger numbers of "votes" in their favor.

·???????? Importantly, each tree is growing on a different randomly selected sample of Bagged data with the remaining Out of Bag (OOB) data available to test the accuracy of each tree.

·???????? For each tree, the Bagging Process selects about 60% of the original sample, while the resulting tree is tested against the remaining 40% of the sample.

?


Example 9: [KTO1, KTO2 , KTO3 , STO1]

a. What is Na?ve Bayes Theorem?

b. What is Na?ve Bayes Classifier?

c. What is smoothing and diagnostics for Na?ve Bayes Classifier?

?

Example 9: Solution

a. Bayes' theorem defines P(C|A) = P(A|C)P(C)/P(A).

The probability of testing positive, that is P(A), needs to be computed first. That computation is shown in Equation below:

???????????????????????????????

?

According to Bayes' theorem, the probability

???????????????????????????????

?

b. With two simplifications, Bayes' theorem can be extended to become a na?ve Bayes classifier. The first simplification is to use the conditional independence assumption. That is, each attribute is conditionally independent of every other attribute given a class label ci. See Equation 7-13.

??????????????? ???

Therefore, this na?ve assumption simplifies the computation of P(a1, a2,..., am|ci) .

The second simplification is to ignore the denominator P(a1, a2, ..., am ).

Because P(a1, a2, ..., am ) appears in the denominator of P(ci|A) for all values of i, removing the denominator will have no impact on the relative probability scores and will simplify calculations.

Na?ve Bayes classification applies the two simplifications mentioned earlier and, as a result, P(ci| a1, a2, ..., am ) is proportional to the product of P(aj|ci) times P(ci).

This is shown in Equation 7-14


?

c. Smoothing and diagnostics for Na?ve Bayes Classifier

If one of the attribute values does not appear with one of the class labels within the training set, the corresponding P(aj|ci) will equal zero. When this happens, the resulting P(ci|A) from multiplying all the P(aj|ci)(j∈[1, m]) immediately becomes zero regardless of how large some of the conditional probabilities are.

Therefore overfitting occurs. Smoothing techniques can be employed to adjust the probabilities of P(aj|ci) and to ensure a nonzero value of P(ci|A). A smoothing technique assigns a small nonzero probability to rare events not included in the training dataset.

?

Smoothing techniques are available in most standard software packages for na?ve Bayes classifiers. However, if for some reason (like performance concerns) the na?ve Bayes classifier needs to be coded directly into an application, the smoothing and logarithm calculations should be incorporated into the implementation.

Diagnostics

Unlike logistic regression, na?ve Bayes classifiers can handle missing values. Na?ve Bayes is also robust to irrelevant variables—variables that are distributed among all the classes whose effects are not pronounced.

The model is simple to implement even without using libraries. The prediction is based on counting the occurrences of events, making the classifier efficient to run. Na?ve Bayes is computationally efficient and is able to handle high-dimensional data efficiently.

?

Compared to decision trees, na?ve Bayes is more resistant to overfitting, especially with the presence of a smoothing technique. Despite the benefits of na?ve Bayes, it also comes with a few disadvantages. Na?ve Bayes assumes the variables in the data are conditionally independent. Therefore, it is sensitive to correlated variables because the algorithm may double count the effects. As an example, assume that people with low income and low credit tend to default. If the task is to score “default” based on both income and credit as two separate attributes, na?ve Bayes would experience the double-counting effect on the default outcome, thus reducing the accuracy of the prediction.

?

Although probabilities are provided as part of the output for the prediction, na?ve Bayes classifiers in general are not very reliable for probability estimation and should be used only for assigning class labels. Na?ve Bayes in its simple form is used only with categorical variables. Any continuous variables should be converted into a categorical variable with the process known as discretization, as shown earlier. In common statistical software packages, however, na?ve Bayes is implemented in a way that enables it to handle continuous variables as well.

?

Example 10: [KTO1, KTO2 , STO1]????????

a. What do you perform the diagnostics of classifiers?

?

Example 10: Solution

a. A confusion matrix is a specific table layout that allows visualization of the performance of a classifier.

True positives (TP) are the number of positive instances the classifier correctly identified as positive. False positives (FP) are the number of instances in which the classifier identified as positive but in reality are negative.

?

·???????? True negatives (TN) are the number of negative instances the classifier correctly identified as negative.

·???????? False negatives (FN) are the number of instances classified as negative but in reality are positive.

?

In a two-class classification, a preset threshold may be used to separate positives from negatives. TP and TN are the correct guesses. A good classifier should have large TP and TN and small (ideally zero) numbers for FP and FN.

?

???????????????????????????????

?

It's easy to visually inspect the table for errors, because they will be represented by any nonzero values outside the diagonal.

???????????????

?

?

?

Example 11: [KTO1, KTO2 , STO1]

a. What are the additional classification methods?

?

Example 11: Solution

a. Besides the two classifiers introduced in this chapter, several other methods are commonly used for classification, including bagging, boosting, random forest, and support vector machines (SVM). Bagging, boosting, and random forest are all examples of ensemble methods that use multiple models to obtain better predictive performance than can be obtained from any of the constituent models. Bagging (or bootstrap aggregating) [15] uses the bootstrap technique that repeatedly samples with replacement from a dataset according to a uniform probability distribution. “With replacement” means that when a sample is selected for a training or testing set, the sample is still kept in the dataset and may be selected again. SVM [16] is another common classification method that combines linear models with instance-based learning techniques.

?

Support vector machines select a small number of critical boundary instances called support vectors from each class and build a linear decision function that separates them as widely as possible. SVM by default can efficiently perform linear classifications and can be configured to perform nonlinear classifications as well.

?

?

Example 12: (KTO1, KTO2, STO1)

a. What is the goal of clustering?

b. What is K-means clustering algorithm?

C. What is the process of K-means clustering?

c. What are the limitations of K-mean clustering?

?

Example 12: Solution

a.The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. The goal is to partition the observations in a data set. That into a smaller set of clusters and each observation belongs to only one cluster. Cluster analysis is an unsupervised learning method.

Meaning, there is no specific response variable included in the analysis.

?

b. With cluster analysis, what we want is to obtain clusters that have less variance within clusters and more variance between clusters. (Homogenous within). That is, we want observations within clusters to be more similar to each other than they are to observations in other clusters. (Heterogenous across).

?

???????????????????????????????????????????????

?

c. The first step in the K-mean cluster analysis is to randomly choose two points in the two-dimensional space.

These points will start as the center or centroid for each of the two clusters. Then the distance between each point and cluster centroids is calculated. Then the process starts all over again by calculating the distances between the points and the new locations of both centroids. Reassigning points to the closest centroid, and then relocating the centroid to the place where the sum of the new distances for the points assigned to the cluster is at the minimum. This process is repeated using multiple iterations until the location of the centroids doesn't change very much. During the process, observations that were originally assigned to one cluster may end up in a different cluster.

?

???????????????????????????????

?

There are multiple ways to calculate the distance between observations. The most commonly used distance measuring, K-means cluster analysis, is called Euclidean distance. The Euclidian distance measure determines how close observations are to each other by drawing a straight line between pairs of observations and calculating the distance between them based on the length of this line.

?

d. First, we need to specify the number of clusters. But we don't know the true number of clusters. And figuring out the correct number clusters that represent the true number of clusters in the population is pretty subjective. On top of that, your results can change depending on the location of the observations that are randomly chosen as initial centroids. K-means cluster analysis is not recommended if you have a lot of categorical variables. If you have a lot of categorical variables, then you need to use a different clustering algorithm that can better handle them.

?

K-means clustering, assumes that the underlying clusters in the population are spherical, distinct, and are of approximately equal size. As a result, tends to identify clusters with these characteristics. It won't work as well if clusters are elongated or not equal in size. K-means cluster analysis is a good starting point because its simplicity makes it easier to convey the concepts.

?

Example 13: (KTO1, KTO2, STO1)

a. What are the reasons to choose K-mean clustering?

?

Example 13: Solution

a. K-means is a simple and straightforward method for defining clusters. Once clusters and their associated centroids are identified, it is easy to assign new objects (for example, new customers) to a cluster based on the object's distance from the closest centroid. Because the method is unsupervised, using k-means helps to eliminate subjectivity from the analysis.

Although k-means is considered an unsupervised method, there are still several decisions that the practitioner must make:

????????????? What object attributes should be included in the analysis?

????????????? What unit of measure (for example, miles or kilometers) should be used for each attribute?

????????????? Do the attributes need to be rescaled so that one attribute does not have a disproportionate effect on the results?

????????????? What other considerations might apply?

?

Example 14: [KTO1, KTO2 , KTO3 , STO1]

a. What are the additional clustering algorithms?

?

Example 14: Solution

a. The k-means clustering method is easily applied to numeric data where the concept of distance can naturally be applied. However, it may be necessary or desirable to use an alternative clustering algorithm. As discussed at the end of the previous section, k-means does not handle categorical data. In such cases, k-modes [3] is a commonly used method for clustering categorical data based on the number of differences in the respective components of the attributes.

?

Because k-means and k-modes divide the entire dataset into distinct groups, both approaches are considered partitioning methods. A third partitioning method is known as Partitioning around Medoids. In general, a medoid is a representative object in a set of objects. In clustering, the medoids are the objects in each cluster that minimize the sum of the distances from the medoid to the other objects in the cluster. The advantage of using PAM is that the “center” of each cluster is an actual object in the dataset. PAM is implemented in R by the pam() function included in the cluster R package.

?

Other clustering methods include hierarchical agglomerative clustering and density clustering methods. In hierarchical agglomerative clustering, each object is initially placed in its own cluster. The clusters are then combined with the most similar cluster. This process is repeated until one cluster, which includes all the objects, exists. The R stats package includes the hclust() function for performing hierarchical agglomerative clustering. In density-based clustering methods, the clusters are identified by the concentration of points. The fpc R package includes a function, dbscan(), to perform density-based clustering analysis. Density-based clustering can be useful to identify irregularly shaped clusters.

Example 15: (KTO1, KTO2, STO1)

a. What is association rules?

b. What are some possible questions that association rules can answer?

?

Example 15: Solution

a. The general logic behind association rules. Given a large collection of transactions (depicted as three stacks of receipts in the figure), in which each transaction consists of one or more items, association rules go through the items being purchased to see what items are frequently bought together and to discover a list of rules that describe the purchasing behavior. The goal with association rules is to discover interesting relationships among the items. (The relationship occurs too frequently to be random and is meaningful from a business perspective, which may or may not be obvious.) The relationships that are interesting depend both on the business context and the nature of the algorithm being used for the discovery.????????????

?

b. Here are some possible questions that association rules can answer:

·???????? Which products tend to be purchased together?

·???????? Of those customers who are similar to this person, what products do they tend to buy?

·???????? Of those customers who have purchased this product, what other similar products do they tend to view or purchase?

?

Example 16: (KTO1, KTO2, STO1)

a. What is the Apriori Algorithm?

b. What is an example of Apriori Algorithm?

?

Example 16: Solution

a. The Apriori algorithm takes a bottom-up iterative approach to uncovering the frequent itemsets by first determining all the possible items

?

b. (or 1-itemsets, for example {bread}, {eggs}, {milk}, ...) and then identifying which among them are frequent.

Assuming the minimum support threshold (or the minimum support criterion) is set at 0.5, the algorithm identifies and retains those itemsets that appear in at least 50% of all transactions and discards (or “prunes away”) the itemsets that have a support less than 0.5 or appear in fewer than 50% of the transactions. The word prune is used like it would be in gardening, where unwanted branches of a bush are clipped away.

In the next iteration of the Apriori algorithm, the identified frequent 1-itemsets are paired into 2-itemsets (for example, {bread,eggs}, {bread,milk}, {eggs,milk}, ...) and again evaluated to identify the frequent 2-itemsets among them.

???????????????????????????????????

Example 17: [KTO1, KTO2 , KTO3 , STO1]

a. How do you evaluate candidate rules?

b.

?

Example 17: Solution??????

a. Frequent itemsets from the previous section can form candidate rules such as X implies Y (X → Y). This section discusses how measures such as confidence, lift, and leverage can help evaluate the appropriateness of these candidate rules. Confidence [2] is defined as the measure of certainty or trustworthiness associated with each discovered rule. Mathematically, confidence is the percent of transactions that contain both X and Y out of all the transactions that contain X.

???????????????????????????????

?

?


Example 18: [KTO1, KTO2 , KTO3 , STO1]??????????

a. What are the applications of Association rules?

?

Example 18: Solution

a. the term market basket analysis refers to a specific implementation of association rules mining that many companies use for a variety of purposes, including these:

·???????? Broad-scale approaches to better merchandising—what products should be included in or excluded from the inventory each month

·???????? Cross-merchandising between products and high-margin or high-ticket items

·???????? Physical or logical placement of product within related categories of products

·???????? Promotional programs—multiple product purchase incentives managed through a loyalty card program

Besides market basket analysis, association rules are commonly used for recommender systems and clickstream analysis.

Many online service providers such as Amazon and Netflix use recommender systems. Recommender systems can use association rules to discover related products or identify customers who have similar interests. For example, association rules may suggest that those customers who have bought product A have also bought product B, or those customers who have bought products A, B, and C are more similar to this customer. These findings provide opportunities for retailers to cross-sell their products.

?

Example 19: [KTO2, KTO3 , STO1]

a. How do you validate and test the Apriori algorithm?

?

Example 19: Solution

a. After gathering the output rules, it may become necessary to use one or more methods to validate the results in the business context for the sample dataset. The first approach can be established through statistical measures such as confidence, lift, and leverage. Rules that involve mutually independent items or cover few transactions are considered uninteresting because they may capture spurious relationships.

As mentioned in Section 5.3, confidence measures the chance that X and Y appear together in relation to the chance X appears. Confidence can be used to identify the interestingness of the rules.

Lift and leverage both compare the support of X and Y against their individual support. While mining data with association rules, some rules generated could be purely coincidental. For example, if 95% of customers buy X and 90% of customers buy Y, then X and Y would occur together at least 85% of the time, even if there is no relationship between the two. Measures like lift and leverage ensure that interesting rules are identified rather than coincidental ones.

Another set of criteria can be established through subjective arguments. Even with a high confidence, a rule may be considered subjectively uninteresting unless it reveals any unexpected profitable actions. For example, rules like {paper}→{pencil} may not be subjectively interesting or meaningful despite high support and confidence values. In contrast, a rule like {diaper}→{beer} that satisfies both minimum support and minimum confidence can be considered subjectively interesting because this rule is unexpected and may suggest a cross-sell opportunity for the retailer. This incorporation of subjective knowledge into the evaluation of rules can be a difficult task, and it requires collaboration with domain experts. As seen in, “Data Analytics Lifecycle,” the domain experts may serve as the business users or the business intelligence analysts as part of the Data Science team. In Phase 5, the team can communicate the results and decide if it is appropriate to operationalize them.

?

?

C. Classroom Exercises and Solutions

1. Use Cases: Logistic Regression Models [KTO1, KTO2 , STO1][M-medium]

a. Medical: Develop a model to determine the likelihood of a patient’s successful response to a specific medical treatment or procedure. Input variables could include age, weight, blood pressure, and cholesterol levels.

b. Finance: Using a loan applicant’s credit history and the details on the loan, determine the probability that an applicant will default on the loan. Based on the prediction, the loan can be approved or denied, or the terms can be modified.

?

???????????????

c. marketing: Determine a wireless customer’s probability of switching carriers (known as churning) based on age, number of family members on the plan, months remaining on the existing contract, and social network contacts. With such insight, target the high-probability customers with appropriate offers to prevent churn.

?

d. engineering: Based on operating conditions and various diagnostic measurements, determine the probability of a mechanical part experiencing a malfunction or failure. With this probability estimate, schedule the appropriate preventive maintenance activity.

?

?

2. Decision Trees Use Cases: [KTO1, KTO2 , STO1][E-easy]

a. Retailers can use decision trees to segment customers or predict response rates to marketing and promotions.

b. financial institutions can use decision trees to help decide if a loan application should be approved or denied.

·???????? In the case of loan approval, computers can use the logical if-then statements to predict whether the customer will default on the loan.

·???????? A checklist of symptoms during a doctor's evaluation of a patient.

c. The artificial intelligence engine of a video game commonly uses decision trees to control the autonomous actions of a character in response to various scenarios.

?

2. Examples of Decision Tree

?

??????????????????????????

?

?

In this example, the selected tree contains 11 nodes with 5 internal nodes in blue, and 6 terminal nodes or leaves in green. Each internal node represents a decision point. ?For each node, the number of observations is presented along with the rate of adolescents who have or have not smoked regularly. The tree also includes the variables and cut points used to form the nodes.

?

As the tree diagram illustrates, white adolescents who have ever used marijuana and scored greater or equal to 4.05 on the deviant behavior scale are more likely to have smoked regularly.

?

Looking at the right side of the tree, these groups include, one, adolescents who did not smoke marijuana or drink alcohol. And two, those adolescents who never smoked marijuana, but have had alcohol. An additional binary split on grade point average shows that adolescents with a grade point average above 2.68. Which is about a B+ or C- average, are at even lower risk for regular smoking than those with a great point average below approximately a C+ or B-.

?

3. K-mean Algorithm Use Cases: [KTO1, KTO2 , STO1][M-medium]

Question analysis is often used in marketing, to develop targeted advertising campaigns.

For example, question analysis using data on types of groceries people buy, can group people together based on their buying patterns. The results can be used to develop individual purchased profiles to target specific advertisements incentives to people depending on their buying patterns. This is often referred to as market segmentation.

?

??????????????????

?

???????????????

?

Health researchers might use cluster analysis to identify individuals at greatest risk for health problems, and to develop targeted health messages based on patterns of health behavior.

?


4. The 'database' below has four transactions. What association rules can be found in this set, if the

minimum support (i.e coverage) is 60% and the minimum confidence (i.e. accuracy) is 80% ?

Trans_id Itemlist

T1 {K, A, D, B}

T2 {D, A C, E, B}

T3 {C, A, B, E}

T4 {B, A, D}

?

4.? Solution

Let’s first make a tabular and binary representation of the data:


STEP 1. Form the item sets. Let's start by forming the item set containing one item. The number of occurrences and the support of each item set is given after it. In order to reach a minimum support of 60%, the item has to occur in at least 3 transactions.

A 4, 100%

B 4, 100%

C 2, 50%

D 3, 75%

E 2, 50%

K 1, 25%

STEP 2. Now let's form the item sets containing 2 items. We only take the item sets from the previous

phase whose support is 60% or more.

A B 4, 100%

A D 3, 75%

B D 3, 75%

STEP 3. The item sets containing 3 items. We only take the item sets from the previous phase whose

support is 60% or more.

A B D 3

STEP4. Lets now form the rules and calculate their confidence (c). We only take the item sets from the

previous phases whose support is 60% or more.

Rules:

A -> B P(B|A) = |B∩A| / |A| = 4/4, |c: 100%

B -> A c: 100%

A -> D c: 75%

D -> A c: 100%

B -> D c: 75%

D -> B c: 100%

AB -> D c: 75%

D -> AB c: 100%

AD -> B c: 100%

B - > AD c: 75%

BD -> A c: 100%

A -> BD c: 75%

The rules with a confidence measure of 75% are pruned, and we are left with the following rule set:

A -> B

B -> A

D -> A

D -> B

D -> AB

AD-> B

DB-> A

?

?

D. Homework Exercises and Solutions

[KTO1, KTO2 , STO1][C-challenging]

1.? Using the bodyfat2 data set (see attached), perform a simple linear regression model.

?

a. Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor.

b. What is the value of the F statistic and the associated p-value? How would you interpret this in connection with the null hypothesis?

c. Write the predicted regression equation.

d. What is the value of R-square? How would you interpret this??

?

2. In the simple linear regression model, what does β1 represent?

??????????????? Y = β0 + β1X + ε

a.?????? the variation of X around the line

b.?????? the predictor variable

c.?????? the variation of Y around the line

d.?????? the slope parameter

e.?????? the intercept parameter

?

3. Given the following PROC REG output and assuming a significance level of 0.05, which of the following statements is true????????????????????????????

a.?????? The model explains approximately 15% of the variation in the response variable.

b.?????? You should reject the null hypothesis.

c.?????? The model explains less than 1% of the variation in the response variable.

d.?????? Height is statistically significant for predicting the values of the response variable.

?

4. Which statistic is used to test the null hypothesis that all regression slopes are zero, against the alternative hypothesis that they are not all zero?

a.?????? F test in the ANOVA table.

b.?????? F test in the Regression table.

c.?????? Global t test in the parameter estimates table.

d.?????? R square

e.?????? Adjusted R square

?

5. Using the bodyfat2 table, fit a multiple regression model with multiple predictors, and then modify the model by removing the least significant predictors.

a.?????? Run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.

b.?????? Compare the ANOVA table with this one from the model with only Weight (from question 1). What is different?

c.?????? How do the R-Square and the adjusted R-Square compare with these statistics for the Weight regression??

d.?????? Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change??

e.?????? To simplify the model, rerun the model from Question 1, but eliminate the variable with the highest p-value. Compare the output with the model from Question 1. Did the p-value for the model change?

f.??????? Did the R-Square and the adjusted R-Square values change?

g.?????? Did the parameter estimate, and their p-values change?

h.?????? To simplify the model further, rerun the model from question 5.e, but eliminate the variable with the highest p-value. How did the output change from the previous model?

i.??????? Did the number of parameters with p-values less than 0.05 change?

?

?

[KTO1, KTO2 , STO1][C-challenging]

6. Use the “bank marketing dataset” (see attached) to develop decision trees.


a.?????? Load the data set into SAS using either PROC IMPORT or DATA step.

b.?????? Describe the data set using PROC CONTENTS, and profile the data sets for the important variables for the decision tree models.

c.?????? Conduct the exploratory data analysis (missing values, outliers, and data transformation, if necessary, recode target variables as 0/1)

d.?????? Identify and select the input variables for the target variable “term deposit product as yes (sale) and no (no sale).

e.?????? Using PROC SURVEYSELECT to split randomly the data into training (70%) and testing (30%).

f.??????? Develop a decision tree in SAS using PROC HPSPLIT. Use appropriate methods to grow and prune the trees.

g.?????? Interpret the mode fit statistics and evaluate the importance of variables.

h.?????? Evaluate the model using “test data set” to see if the tree adequate predicts which observation will lead to a sale. (Hint: use code file statement in HPSPLIT procedure to output the code to apply the model to test dataset.)

7.? What are the lower and upper bounds for a logit?

a)?????? lower bound=0, upper bound=1

b)????? lower bound=0, no upper bound

c)?????? no lower bound, no upper bound

d)????? no lower bound, upper bound=1

8. The variable Income has the values High, Low, and Medium. You've parameterized the variable with reference cell coding using the default reference level. For which value of Income do both design variables have the value 0?

a)?????? High

b)????? Low

c)?????? Medium

9. You're modeling the relationship between the variables Gender (with the levels female and male) and Survived (with the levels yes and no). How do you interpret the odds ratio in the output from this PROC LOGISTIC program?

???????????????

a)?????? The odds of females surviving were 10 times the odds of males surviving.?

b)????? The odds of males surviving were 10 times the odds of females surviving.

c)?????? The probability of a female surviving was 10%.

d)????? Females aboard the Titanic had a 10% survival rate.

10. The insurance company wants to model the relationship between three of a car's characteristics, weight, size, and region of manufacture, and its safety rating. The safety data set(Attached) contains the data about vehicle safety.

???????????????????????????????

a)?????? Use PROC LOGISTIC to fit a multiple logistic regression model with Unsafe as the response variable and Weight, Size, and Region as the predictor variables.

b)????? Use the EVENT= option to model the probability of Below Average safety scores.

c)?????? Specify Region and Size as classification variables and use reference cell coding. Specify Asia as the reference level for region, and 3 (large cars) as the reference level for Size.

d)????? Request profile likelihood confidence limits, an odds ratio plot, and the effect plot.

e)????? Do you reject or fail to reject the null hypothesis that all regression coefficients of the model are 0?

f)??????? If you reject the global null hypothesis, then which predictors significantly predict safety outcome?

g)?????? Interpret the odds ratio for significant predictors.

?

Solutions

1.? Using the bodyfat2 data set (see attached), perform a simple linear regression model.

a.?????? Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor.

PROC REG DATA = AKM.BODYFAT2;

?MODEL PctBodyFat2? = Weight ;

RUN;

QUIT;

b.?????? What is the value of the F statistic and the associated p-value? How would you interpret this in connection with the null hypothesis?


?

c.?????? Write the predicted regression equation.

d.?????? What is the value of R-square? How would you interpret this??


?

2. In the simple linear regression model, what does β1 represent?

??????????????? Y = β0?+ β1X + ε???????????????????????????? sales = B0 +B1*ad amt

?

??Answer d.

a.?????? the variation of X around the line

b.?????? the predictor variable

c.?????? the variation of Y around the line

d.?????? the slope parameter

e.?????? the intercept parameter

?

3. Given the following PROC REG output and assuming a significance level of 0.05, which of the following statements is true?

?????????????????????????

?? Answer : C

a.?????? The model explains approximately 15% of the variation in the response variable.

b.?????? You should reject the null hypothesis.

c.?????? The model explains less than 1% of the variation in the response variable.

d.?????? Height is statistically significant for predicting the values of the response variable.

?

4. Which statistic is used to test the null hypothesis that all regression slopes are zero, against the alternative hypothesis that they are not all zero?

Answer a. The F test in the ANOVA table tests the global hypothesis for the model. F tests in the Type I and Type III tables, as well as the t tests in the parameter estimates table only test individual effects. The R-square and Adjusted R-square are measures of model fit.

a.?????? F test in the ANOVA table.

b.?????? F test in the Regression table.

c.?????? Global t test in the parameter estimates table.

d.?????? R square

e.?????? Adjusted R square

?

5. Using the bodyfat2 table, fit a multiple regression model with multiple predictors, and then modify the model by removing the least significant predictors.

Answers

a.?????? Run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.

PROC REG DATA = AKM.BODYFAT2;

?MODEL PctBodyFat2? = Age? Weight? Height? Neck? Chest?

??????????????????????????????????????????????????????????????????????????????????????????????? Abdomen? Hip? Thigh? Knee? Ankle?

??????????????????????????????????????????????????????????????????????????????????????????????? Biceps? Forearm?? Wrist? ;

RUN;

QUIT;

b.?????? Compare the ANOVA table with this one from the model with only Weight (from question 1). What is different?


c.?????? How do the R-Square and the adjusted R-Square compare with these statistics for the Weight regression??

?


d.?????? Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change??

e.?????? To simplify the model, rerun the model from Question 1, but eliminate the variable with the highest p-value. Compare the output with the model from Question 1. Did the p-value for the model change?

f.??????? Did the R-Square and the adjusted R-Square values change?

g.?????? Did the parameter estimates and their p-values change?

h.?????? To simplify the model further, rerun the model from question 5.e, but eliminate the variable with the highest p-value. How did the output change from the previous model?

i.???????? Did the number of parameters with p-values less than 0.05 change?

?

?Solutions

6. Use the “bank marketing dataset” (see attached) to develop decision trees

a)?????? Load the data set into SAS using either PROC IMPORT or DATA step.

?

PROC IMPORT OUT= AKM.BANK

??????????? DATAFILE= "C:\Users\akmar\Desktop\BA Aug 2021\Data\CART\Data\bank-full.csv"

??????????? DBMS=CSV REPLACE;

??????????????? ?DELIMITER =";";

???? GETNAMES=YES;

???? DATAROW=2;

???? GUESSINGROWS=15000;

RUN;

?

?

b)????? Describe the data set using PROC CONTENTS, and profile the data sets for the important variables for the decision tree models.

PROC CONTENTS DATA = AKM.BANK;RUN;

PROC CONTENTS DATA = AKM.BANK VARNUM SHORT;RUN;

c)?????? Conduct the exploratory data analysis (missing values, outliers, and data transformation, if necessary, recode target variables as 0/1)

d)????? Identify and select the input variables for the target variable “term deposit product as yes (sale) and no (no sale).

?

DATA BANK;

?SET AKM.BANK;

? IF Y = "yes" THEN TARGET =1;

?ELSE TARGET =0;

RUN;

PROC SORT DATA = BANK;

BY TARGET;

RUN;

?

?

?

e)????? Using PROC SURVEYSELECT to split randomly the data into training (70%) and testing (30%).

PROC SURVEYSELECT DATA = BANK RATE =0.7 OUT=AKM.BANK_SAMPLE SEED=987654321 OUTALL;

RUN;

?

PROC FREQ DATA = AKM.BANK_SAMPLE;

?TABLE SELECTED*TARGET;

RUN;

?

DATA BANK_TRAIN BANK_TEST;

?SET AKM.BANK_SAMPLE;

?IF SELECTED = 1 THEN OUTPUT BANK_TRAIN;

?ELSE OUTPUT BANK_TEST;

RUN;

?

f)??????? Develop a decision tree in SAS using PROC HPSPLIT. Use appropriate methods to grow and prune the trees.

PROC HPSPLIT DATA = BANK_TRAIN;

?CLASS? TARGET marital education default housing loan contact month campaign pdays previous? ;

?MODEL TARGET (EVENT= '1') = age marital education default balance housing loan contact day month duration campaign pdays previous;

?PRUNE COSTCOMPLEXITY;

?PARTITION FRACTION (VALIDATE =0.3 SEED =54321);

?CODE FILE = "C:\Users\akmar\Desktop\BA Aug 2021\Data\CART\BANK_TREE.SAS";

?OUTPUT OUT=DT_SCORED;

RUN;

?

?

?

?

g)?????? Interpret the mode fit statistics and evaluate the importance of variables.

?



?

7.? What are the lower and upper bounds for a logit?

Answer c: A probability is bounded by 0 and 1. The logit of the probability transforms the probability into a linear function, which has no lower or upper bounds.

e)????? lower bound=0, upper bound=1

f)??????? lower bound=0, no upper bound

g)?????? no lower bound, no upper bound

h)????? no lower bound, upper bound=1

8. The variable Income has the values High, Low, and Medium. You've parameterized the variable with reference cell coding using the default reference level. For which value of Income do both design variables have the value 0?

Answer c :

d)????? High

e)????? Low

f)??????? Medium

9. You're modeling the relationship between the variables Gender (with the levels female and male) and Survived (with the levels yes and no). How do you interpret the odds ratio in the output from this PROC LOGISTIC program?

???????????????

Answer: c :? In the output, the odds ratio of survival for females to males is 10.147. This means that the odds of females surviving were 10 times the odds of males surviving.

e)????? The odds of females surviving were 10 times the odds of males surviving.?

f)??????? The odds of males surviving were 10 times the odds of females surviving.

g)?????? The probability of a female surviving was 10%.

h)????? Females aboard the Titanic had a 10% survival rate.

10. The insurance company wants to model the relationship between three of a car's characteristics, weight, size, and region of manufacture, and its safety rating. The safety data set(Attached) contains the data about vehicle safety.

?

Answers:

h)????? Use PROC LOGISTIC to fit a multiple logistic regression model with Unsafe as the response variable and Weight, Size, and Region as the predictor variables.

i)??????? Use the EVENT= option to model the probability of Below Average safety scores.

j)??????? Specify Region and Size as classification variables and use reference cell coding. Specify Asia as the reference level for region, and 3 (large cars) as the reference level for Size.

k)?????? Request profile likelihood confidence limits, an odds ratio plot, and the effect plot.

?

PROC LOGISTIC? DATA = AKM.SAFETY plots(only) =(roc oddsratio);

CLASS Region (PARAM=REF REF ="Asia") Size (PARAM=REF REF = "3" );

MODEL Unsafe (EVENT ="1") = Size Weight Region/CLODDS =PL ;

RUN;

QUIT;

l)??????? Do you reject or fail to reject the null hypothesis that all regression coefficients of the model are 0?

Null hypothesis is rejected.

???????????????

?

?

?

m)??? If you reject the global null hypothesis, then which predictors significantly predict safety outcome?

?

From the analysis of effect table, size is the only predictor significantly predict safety outcome.

???????????????????????????????

?

?

n)????? Interpret the odds ratio for significant predictors.

·???????? Only size is the significant predictor of safety outcome.

·???????? ?Size =1(small or sports) cars are 14.560 times the odds of having a below –average safety rating compared to the 3(Large or sport/utility). The 95% confidence interval is (3.018, 110.732) that does not contain 1 which is statistically significant at the 0.05 level.


?

?

要查看或添加评论,请登录

Haiqing Hua的更多文章

社区洞察

其他会员也浏览了