A 'Best Fit' Line Resistant to Outliers when Assessing Precision in Mining & Exploration
This article is aimed at QAQC Geoscientists who may not have a theoretical background but need to be confident that the tools that they use have a sound statistical basis. However, it will be of interest to anyone who wishes to chart a linear trendline which is resistant to outliers. The article is the first of a series of articles which focus on the practical aspects of Statistical Quality Control when assessing accuracy and precision.
In this discussion I will also focus on two other types of regression lines used for inspecting the linear relationship between paired data (duplicates) when quality controlling precision of analytical results:
- Ordinary Least Squares
- Reduced Major Axis.
The discussion includes no formulae but rather the practical aspects of using these regression lines. There is plenty of background on the internet for those more into theory.
One of the most prolific charts used for quality controlling precision of analytical results is the scatter plot. The chart provides an ‘at a glance’ view of precision and, importantly, allows for the visual identification of outliers. For example, the chart below also displays warning and error control lines and an X=Y line:
It is also usual for a regression line to be overlaid (not shown in the above image) which is the ‘best-fit’ line to the data.
Ordinary Least Squares (OLS)
OLS is by far the most familiar and the most commonly used. However, a fundamental statistic used in the calculation of the OLS line is the arithmetic mean, which is a measure of central tendency. The robustness of a statistic is its resistance to the influence of outliers. The breakdown point of a statistic is the proportion of outliers that the statistic can handle before affecting or invalidating the statistic. The higher the breakdown point, the more robust the statistic. In the case of the mean, the breakdown point is 0, because the mean can be made large or small by changing just one value in the data from which it is derived. This influence can be seen clearly in the charts below:
Another aspect of OLS is the assumption that there is a dependent variable and independent variable. For example, if I were to plot the selling price of a used car (y) against age in years (x), over time selling price will go down:
Selling price is dependent on time (age).
OLS attempts to minimise the error between the dependent variable and independent variable as shown below:
However, sample 2 is not dependent on sample 1 or vice versa. They are two samples taken at the same time (usually) and place for the analysis of precision. For example two quarter cores from the same interval or two samples from the same sample pile (RC cuttings). Neither is dependent on the other. Sometimes the terms primary sample and secondary sample or original sample and check sample are used. Neither are appropriate terms. Sample 1 and sample 2 are duplicates or paired samples.
Reduced Major Axis (RMA)
RMA addresses the limitations of OLS by reducing the errors associated with both variables by minimising the sum of the area of right triangles whose legs are the horizontal and vertical deviations. The method is also called geometric mean regression (and other names):
Below I overlay RMA and OLS regression lines:
In general, RMA will perform better than OLS. However, as you can see from the above, it too is still influenced by outliers. This is because, the mean and standard deviation of the two sets of data feature in the RMA algorithm.
Robust Least Squares (RLS)
I mentioned earlier that, the mean is a measure of central tendency. Another measure of central tendency is the median. The median is the middle value of an ordered set of values. In terms of robustness, it would take 50% or more of values to influence the median. The breakdown point for the median is therefore 50%. For example, consider a set of ordered values:
0.3, 0.4, 0.6, 1.2, 1.3, 1.6, 2.1, 2.1, 2.3
The median value is 1.3. If the last value is changed to 8.7 instead of 2.3, the median still remains at 1.3. I could replace the last two values with 8.7 and 10.5 respectively and the median will remain unchanged at 1.3. The median is a robust statistic. It is highly resistant to the influences of outliers. Incorporating the median, instead of the mean, into the computation of a linear regression line, allows us to build a line which is largely resistant to outliers. Providing, of course, the breakdown point is not reached. Below are the same charts used above, but this time with the RLS regression line included:
The RLS line exactly overlies the X=Y line. It is unaffected by outliers. Having said that, there will be occasions where RLS will fail, for example when the breakdown point is exceeded.
There are a number of different methods for calculating a regression line using the median (robust regression). The method used in these charts is called ‘Method of Repeating Medians’. Unlike other methods ('Least Median of Squares' for example), the algorithm is simple, not iterative and fast.
To finish, here are some charts with all the three regression lines overlaid, for differing sample check stages:
In the Scatter for the Pulverising check stage, RLS is also affected slightly, but it still out performs RMA and OLS for robustness. In the case of the Sampling Stage it performed well even with sparse data.
The content in this article represents my own personal thoughts. Constructive feedback is always appreciated. Please share and/or 'like' this article if you found it interesting and feel it will be of benefit to others.
All charts produced using the acQuire 4 QAQC object from within GIM Suite. Data is real.
Become Part of the Collaboration
If you have R experience and would like to volunteer to bring cloud QAQC to the industry, click here.
--
4 年Hi Paul. The global idea is good. But if we you use the RMA, is it means that our duplicate analysis results are good? Is it possible to have, un example with a Excel sheet?
Senior Geologist - Consultant
5 年Hi Paul,? I have two sets of measurements which are expected to be similar. But they also have weighting factor of each couple. Do you think we can calculate weighted RLS? I would be happy if you could send me the workbook. [email protected] Thanks in advance
Exploration Geologist en Independent Consultant
6 年Paul thanks for this tool. Could you send to my mail. [email protected]
Exploration Consultant - Copper, Gold, Lithium, Anything
7 年Will this work on a LiDAR dataset of 140 million records for instance? Spatial imaging is often a fast way to spot outliers, when we are talking a function of two variables scenario. Then experience can determine which outliers are not realistic data values - they can be removed according to spatial extents or magnitude. Most imaging systems do a top and bottom cut by default so as to optimise dynamic range for most of the data.
Accredited acQuire NOVA Network Partner | GIM Suite Specialist | QAQC Specialist | Power BI
7 年I have an Excel workbook which displays two robust regression lines (Repeating Medians and Least Median of Squares) using VBA. Contact me directly if anyone is interested.