Text analytics models with RapidMiner, deployment and extensions (Part 3 Advanced Models)

Text analytics models with RapidMiner, deployment and extensions (Part 3 Advanced Models)

Model setup and analysis

The classification methodology to identify fraudulent claims built using historical claim details consists of:

1) Data preparation by retrieving Claims data created in previous read and write steps that have renamed some attributes (removed ‘=’ from flag fields, which otherwise causes problems upsampling below) and placed in repository. Setting role of Fraud Flag to label as this is what is to be predicted. Selecting relevant attributes and testing for outliers using k-nn global anomaly detection, which depending on the model is utilised by filtering out. That is, in structured only models, if outlier>0.9 it is considered an anomaly, however in text only models, many potential useful claim details would be dropped off, so it isn’t done. Parsing using Process documents from Data where text is shifted to lower case, split into terms whose root form is used and stop words or information poor terms as well as long or short terms are removed. This is followed by dimensionality reduction components of Weight by Information gain and Select by Weights. TF-IDF or term frequency–inverse document frequency is used to measure each text attribute’s relative value by diminishing weight of terms that occur frequently and increasing weight of terms that occur rarely. To reduce the resultant number of texts in the model, weights are done using entropy. Checking attribute weights, it is found approximately 250 variables have a non-zero or useful weight. Therefore, top k is set to 250 in Select by Weights. In the neural network models, the nominal attributes are converted to numeric with unique integer coding type.

2) Splitting data into 70/30 training/validation sets to build and then test model results on different data.

3) Utilising ten-fold cross-validation to assess each model’s predictive performance. This includes using ten random partitions, so that multiple iterations of sampling on new training/validation data with complete coverage is performed. SMOTE upsampling is applied on the training data to account for the severe imbalance of Fraud claims (only 93 out of 3037 claims or a mere 3%). Without this, the model becomes overtrained to handle non-fraudulent cases. It generates additional examples in between the examples of Fraud claims and classifies them using K-nn, when the class falls into the majority class the new example is discarded, otherwise it is added to data. The benefit is that the generated examples are not identical to existing Fraud claims, so there isn’t duplication of existing data that could lead to overtraining. The downside is that the added examples are not based on reality. SMOTE is inside the training component of cross validation, but not in validation because that represents reality of claims mix. There is however a small issue that is not causing much problems, each run of cross-validation will over-sample training set differently and will create new and different data points.

4) For decision tree, optimal maximum depth is worked out using a loop parameter model which is found at 10. For text only neural network model, grip parameter optimisation is used with best results found when training cycles is 350, learning rate is 0.3 and momentum is 0.1. For structured only neural network model, the same approach finds best results when training cycle is 250, learning rate is 0.5 and momentum is 0.7. Note, fixed local seed of ‘1992’ is used in neural network in both split data and neural network parameter steps to standardise results for consistency.

5) Performance statistics of accuracy and kappa are calculated for each model as shown below in Table 2. Neural Network for Text only data is found to be the best performing model with accuracy of 97.10% and kappa of 42.10%.

6) Additional models, namely Gradient Boosted Tree (GBT), Vote Ensemble (based on decision tree and GBT) as well as more advanced network networks are considered following a similar process. The best performing model is found to be Neural Network for Mix of Text and Structured data giving accuracy of 99.01% and kappa of 85.20% when training cycle is 300, learning rate is 0.5 and momentum is 0.7 (note this is when adjustor notes is also treated as nominal that is subsequently converted to numerical with unique integer coding type as above, testing on individual text was also done but results were not as strong). The false positive and false negatives are also suitably low at 8 and 1 respectively. This ideal model is found using advanced model hyperparameters optimisation with grid search where the optimal model and its associated performance is outputted. Parameters are varied in the grid/range with training cycle set to having six steps between 100 and 400, learning rate two steps between 0.1 and 0.5 and momentum three steps between 0.1 and 1. Chart 1 shows the advanced chart output showcasing these variations and optimal model chosen.

No alt text provided for this image

K-means Clustering is performed to identify data groups and categories with Cluster Model Visualiser to neatly present the results. Using loop parameters and informal Weighted Sum of Squares, k of 10 is used. The resultant heat map chart is shown below to describe each cluster:

No alt text provided for this image

Clusters 1, 4, 7 and 8 have a significantly larger Vehicle Flag (that is, more motor vehicles involved), while Cluster 7 and 9 have a significantly larger Nature of Injury (Injuries including Unknown, Amputation, Multiple and Bite types).

Incorporating Costs

Claims processing costs are incorporated by using Performance (Costs) parameter in RapidMiner. When the fully automatic system is implemented, that is no manual labour involved and the system relies on text only, the cost matrix incorporating both cost and savings is shown in Table 1. Note, it is assumed that handling misclassifications adds to the original costs, in other words the costs are associated with the manual labour of handling misclassifications and so are fixed rather than relative to the cost of handling the new automatic process.

No alt text provided for this image

Figure 1 shows the text only neural network process utilised with Performance (Cost) incorporated as well as the standard Performance operator (gives accuracy, kappa and AUC model details) using Grid Opimisation to both find and output the best model, with the preload steps described in the previous section. The Grid Optimisation details are shown on the right side, with the Performance (Cost) operator added with the above cost matrix to factor these in. 

No alt text provided for this image

Optimisation based on models’ performance to minimise the cost of claims processing is performed, with Table 2 showing both the misclassification and details on positives/negatives results. It is important to consider the false positives and negatives, because there are many scenarios outputted that showed the most cost effective option being not to detect any fraud claims, which were ignored because they are not solving our business problem. Figure 2 shows how classification thresholds based on costs are incorporated into the performance in RapidMiner, adding the additional Find and Apply Threshold operators that incorporate the cost matrix once model is applied. This is then connected to both the Performance operators. 

No alt text provided for this image

The best performing model (neural network for mix of text and structured data optimising classification threshold based on costs), gives accuracy of 99.12%, kappa of 86.6% and AUC of 97.5%.

No alt text provided for this image

The ROC chart representing different confusion tables is produced at different level of confidence for the outcome class. The x-axis shows False Positives (rate of incorrect classification as fraud when it should have been non-fraud) and y-axis True Positives (rate of correct detection of fraud). By default, the outcome is set to positive, such as Fraud=true, when the confidence of that outcome is greater or equal to 0.5. AUC stands for "Area under the ROC Curve" and provides an aggregate measure of performance across all possible classification thresholds. It represents the probability that the model ranks a random fraud claim more highly than a random non-fraud example. The raw numbers of False Positive, False Negative, True Positive and True Negative are shown in Table 2 for each of the models, with the Neural Network for mix of text and structured data performing best. It has only 7 and 1 false positive and false negatives respectively, which is an outstanding result. This is demonstrated by the large area under ROC chart in chart 3 (almost covering all of the y-axis, true positives). The large AUC (97.5%) is also promising, because it demonstrates that the model will rank a randomly chosen fraud claim instance higher than a randomly chosen non-fraud instance. 

No alt text provided for this image

Deployment

The model will be deployed by the following steps:

1) RapidMiner will need to be installed on system user PC to produce fraud predictions from classification model.

2) User will be sent the model file ‘Model.mod’ that is generated using the ideal Neural Network approach created using Write Model in RapidMiner as shown in Figure 3.

No alt text provided for this image

3) As part of data preparation, reasonableness checks need to be performed on the scoring file to ensure sound data quality. Using RapidMiner’s Statistics tab (Figure 4), user should examine each attribute’s data type and relevance – for example, Body Part should not be integers or contain Cause of Injury data.

No alt text provided for this image

4) Ensure the scoring file has all required model attributes, namely: Body Part, Nature of Injury, Cause of Injury and Vehicle Flag as these are used as inputs in the predictive model.

5) Run model deployment process outlined in Extension below to get the fraud prediction outcomes and sort for all predicted fraud claims.

6) A manual checking process should be performed to check predicted fraud claims due to false negative rate that represents the rate of misclassifying non-fraud claims. There are also numerous fraud claims missed, so this tool should only be used as a supplement to existing identification techniques.

7) All the steps are documented with support contacts details provided if user experiences difficulties. Note that this model has not factored in for missing data, as it is assumed input data systems will not allow users to enter blank information.

No alt text provided for this image

8) A separate RapidMiner process in created (shown in Figure 5) that pre-processes scoring data and read the model from received model file ‘Classification_model.mod’ to generate the fraud predictions. The score file CSV is also read in, Vehicle Flag renamed (removing ‘=’ sign) and relevant attributes selected so not including Adjustor Notes and Claim Number. The classification model is applied and run to produce outputs shown in Table 3 (these are sorted so that predicted Fraud claims appear first).

No alt text provided for this image

The Confidence(1) column may be used to judge how probable the model believes that claim is a Fraud one. It may be of value when claims are manually checked for error detection. If required, Write Excel operator can be used at the end of the Apply Model process to easily export the table (the contents of data view) into an excel file that can be saved locally. 

Note that the deployment model built uses a number of saved features to ensure the training details are preserved in deployment, namely: the pre-processing steps of unique integers model and weights used. These are saved and replicated in the deployment pre-processing steps shown in Figure 6.

No alt text provided for this image

Extensions

RapidMiner neural network model results show an important attribute breakdown known and their associated weights. It shows Adjustor Notes and Body Part as the most important attributes used to predict Fraud, with the others far behind in importance. This could be because as noted in the analysis, most of the fraud claims are Back injuries, whilst where a vehicle is involved, the body part is Neck.

Independent research in fraud claims confirms that back and neck injuries are often misused. There are many ways of manipulating the system including faking by “invent[ing] injuries soft-tissue injuries such as muscle problems with the back and neck are popular scams. They're hard to disprove, and thus are easier to get away with.” “Workers get injured off the job, but say they're hurt at work so their workers comp[ensation] policy covers the medical bills. A person might hurt his back at home moving furniture.” Also, “inflated Injuries…[where] a worker has a fairly minor job injury - maybe a slight twinge in her lower back - but insists [their] back is seriously sprained. This lets the worker collect more workers comp[ensation] money and stay off the job longer” (Minnesota Comp Advisor 2018).

Lechner D. 2015 states that according to “Coalition Against Insurance Fraud, the five most common types of fraudulent workers' compensation claims are:

? Claims for injuries that did not occur in the workplace – This type of fraud occurs as employees misrepresent an off-time injury as a work-related one.

? Exaggerated injury claims – Workers experience a legitimate injury, but exaggerate its severity to collect more benefits and gain more paid, off-the-job time.

? Bogus injury claims – Employees claim injuries that never occurred.

? Claiming old injuries – Workers file claims on old injuries as if they were current ones.

? Malingering – Workers downplay recovery to stay off the job and collect benefits longer than necessary.”

Therefore, more data is needed on these fraud claims to extend the analysis to test each of these mentioned types such as if the worker has had a claim before, time off work, doctor certificate, injury status, baselining comparisons, etc. Also, for and back and neck claims, more details could be added on the injury details to strengthen fraud prediction.

Research also indicates that a “potential flaw with profiling tools is that they can highlight too many ‘false positives’. Their ‘buzzers go off’ too often, creating a large number of ‘potential’ fraud cases to investigate. (Engleman A. 2000)” This is the case in the final model chosen, given the recommendation for manual checking process to investigate misclassification of non-fraud claims. There are also numerous fraud claims missed, so the recommendation is to only use this tool as a supplement to existing identification techniques. There would be significant costs involved, duplication of processes and people/system resources, so the cost-benefit may be limited.

Another aspect is that the “the people who commit fraud quickly figure out the profiling ‘thresholds’ and modify their behaviour to avoid detection.” “Setting the ‘trigger’ threshold is not easy and may need to vary from jurisdiction to jurisdiction and from time to time, as circumstances change. (Engleman A. 2000)” A possible extension on the modelling would be around making threshold time dependent, possibly location specific and adding to the complexity of threshold setting by introducing a “Not too sure” class (shown in Figure 7 vs. standard in Figure 8 from Hussain A. 2018) so that fraud and non-fraud claims are more certain. These changes as well as modelling behaviour that avoid detection due to threshold familiarity should be investigated to future-proof models. Another potential operator that could be explored is Metacost in training that makes its base classifier cost-sensitive by using the cost matrix specified and operates intelligently to find optimal results.

No alt text provided for this image

RapidMiner Performance

Again great to use from the perspective of focusing on data, the business problem and statistical context without getting into the technical details of the tool itself and coding. Neural Network performance can be challenging when dealing with large datasets (ran overnight, sometimes multiple nights), visualisations limited and rigid in certain aspects, however the agility and simplicity in doing the processes is admirable. A note also that RapidMiner trial is capped at 10k dataset rows, which gives great exposure to explore all its functionalities.

References

1.    Engleman A. 2000, Fraud Management in Workers Compensation, Engleman Etcetera Pty. Ltd., retrieved 18 July 2018, <https://www.injurynet.com.au/documents/Article_Fraud.pdf>

2.    Minnesota Comp Advisor 2018, Accident Investigation and Fraud, Anderson Insurance, retrieved 3 August 2018, <https://www.minnesotacompadvisor.com/managing-injuries/accident-investigation-fraud>

3.    Lechner D. 2015, Preventing the Most Common Workers' Compensation Frauds, ErgoScience, retrieved 3 August 2018, <https://info.ergoscience.com/employer-blog/preventing-the-most-common-workers-compensation-frauds>

4.    Hussain A. 2018, An Investigation into Real-time Fraud Detection in the Telecommunications Industry, Project Tutor, retrieved 3 August 2018, <https://paul.kinlan.me/telecom-fraud-detection/>

himanshu singh

cdr at central government

1 年

Hi Karandeep......can I contact you for some clarification on your tutorial

回复
Karandeep Singh Chadha (KC)

Senior Manager and Strategist

5 年

Hi Koen, thanks glad you found it useful. Please contact me on [email protected] and which section you would be interested in.

回复
Koen Wouters

Analytics Translator

5 年

Nice tutorial, Is there an option to share or get the rmp files?

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了