Outlier management in precision agriculture with QGIS
Luis Eduardo Perez Graterol
Analista y desarrollador de SIG comerciales y Open Source
Happy New Year, I have decided to resume publishing with data science topics in the context of the #GIS.
What motivated me to create this article/tutorial?
I recently evaluated a study in which the methodology I describe is recommended to achieve the desired and literature-reported results.
The study consisted of correlating crop yield values with NDVI values obtained from satellite images, in other words, correlating high resolution data measured in the field with comparatively low resolution data obtained from a remote sensor.
In this case, the NDVI value gives an indication of the "general" crop conditions and also given the spatial resolution of the satellite image the index value can be considered an average of the coverages found in the area covered by the pixel.
Therefore, when attempting to correlate the NDVI index with data from a yield monitor it is necessary to exclude outliers to remove the inherent errors and delays of this type of measurement, as well as to obtain an average yield representative of the conditions of the area covered by the pixel extent.
Although I cannot share the data from the study because it is proprietary, I adapted the methodology to a soil test data set, which you can download, so that the reader can replicate the entire process.
Introduction
This time we will talk about #outliers values, we will start briefly with theoretical foundations and then we will go into the case studies.
Geospatial data is made up of two components, the spatial (geometry) and the alphanumeric (attribute). Sometimes we will analyze them separately or combined, for example:
- Attributes: using traditional statistical methods. We omit the spatial component. Example: central tendency statistics, variability, inference, etc.
- Geospatial: we apply geostatistical methods to evaluate the distribution and interrelationship between spatial entities. Example: nearest neighbors, cluster.
- Combined (geostatistics): in this case the analysis considers the spatial distribution and the attribute value, the latter is generally used as weight (weighting) or to categorize, example: Hot spot.
In previous articles (QGIS data science, Statistics in QGIS, GIS Zonal Statistics, evaluating QGIS interpolation, evaluating interpolated soil parameters among others) I have highlighted the importance of applying classical statistics in the exploratory analysis and validation of alphanumeric data, before proceeding with more complex hybrid procedures.
Outliers
The closest translation of the term "outlier" is atypical, discordant, anomalous or contaminating observation. An outlier is an observation that is numerically distant from the rest of the data.
How should we handle outliers?
It depends on the particularity of the subject we are analyzing, generally such values will be removed since they alter the behavior of the variable we are studying, for example: when we are generating a model that relates numerical variables, an extreme value can affect the correlation, invalidating our model.
On other occasions, the outlier will be the one of greatest interest, therefore, the object of further study, for example: travel times to access services, infractions or crimes, among others.
- Causes and significance of outliers
- Measurement errors, due to human, method or equipment causes.
- Extraordinary events.
- Extreme valid values.
- Unknown causes.
These values can have a strong effect on the result of the analysis or model. For this reason it is important to validate them.
How to detect outliers in QGIS?
The most common method for its simplicity and results is the Tukey test, which takes as a reference the difference between the first quartile (Q1) and the third quartile (Q3), or interquartile range. In a box plot, an outlier is considered to be 1.5 times that distance from one of these quartiles (mild outlier) or 3 times that distance (extreme outlier).
The box-and-whisker plot is a graph that allows us to visualize the outliers considering these criteria. Source: Wikipedia.
Another option is to consider as outliers the values above and below three standard deviations. In the image the relationship between both criteria.
Equations
With Q1 and Q2 being the first and second quartiles, and IQR being the interquartile range (Q3 - Q1) a mild outlier will be one that meets either condition:
Value < Q1 - 1.5 * IQR
Value < Q3 + 1.5 * IQR
Q1 and Q2 thus determine the so-called lower limits, above which the observation is considered a mild outlier.
Case studies
After this introduction, let's move on to the practical development.
Our case studies will be focused on precision agriculture but what we have seen is valid in any subject (social, economic, engineering, commercial etc).
We will start with a set of data that I have used in previous articles published in acolita.com (administrator of this popular blog Franz Leonardo), a series of points with results of edaphological analysis exercise data.
Visualization of outliers in QGIS
There are several options to visualize (calculate and plot) the outliers present in a numeric field in QGIS, I describe three alternatives, the first two are straightforward, the third is more complete but requires further study:
- Using the statistics panel, you can calculate for any numeric field the ranges from which you will consider outliers. In this article more details Statistics in QGIS.
- Using the Data Plotly plug-in you can generate a box-and-whisker plot for each numeric field (I have not yet added this kind of plots to the Dashboard plug-in).
- Using expressions, a bit of Javascript and the plotly graphics library you can make the same graph and interact with it (content of the Dashboard Course in QGIS View and Forms).
Calculating statistics of a field excluding outliers
One of the objectives of determining the presence of outliers is to exclude them from calculations, especially from the determination of statistics. For example, when we want to obtain a representative value of a parameter such as the average (mean), if we want to estimate the average sales of a set of stores of a franchise we can exclude extremely high or low values.
Determining statistics of a field with expressions
For this we can use aggregate expressions, which we have studied in detail in the articles aggregate functions part I, aggregate functions part II.
To calculate the mean of the Organic Matter parameter, we load the points layer in QGIS, open the attributes table and write the expression:
--Expression mean("ORGANIC_M")
As can be seen in the image, at the bottom, in the preview the result is shown, which coincides with that of the Statistics Panel, mean = 3.461891891891891893.
Calculate the statistic excluding outliers
To do so, we will extend the previous expression by introducing conditionals and combining them with the aggregate functions q1 (first quartile), q3 (third quartile), iqr (interquartile range).
--Expression mean("ORGANIC_M", filter:= "ORGANIC_M" > q1("ORGANIC_M")-1.5*iqr("ORGANIC_M") and "ORGANIC_M" < q3("ORGANIC_M")+1.5*iqr("ORGANIC_M") )
We obtain a slightly higher value when outliers are excluded: mean = 3,4945205479452066.
Identifying outliers
Adapting the previous expression we can add an attribute that identifies the outliers, so we can then observe their location, to do this we execute the following expression in a text field:
--Expression Case When "ORGANIC_M" < q1("ORGANIC_M")-1.5*iqr("ORGANIC_M") or "ORGANIC_M" > q3("ORGANIC_M")+1.5*iqr("ORGANIC_M") Then 'Atipico' Else 'Normal' End
After running the expression each point is identified as "Normal" or "Atypical", then we can apply a categorized style for the organic matter variable, we can see that only one of the sampling points has a value that we can consider atypical (highlighted in red in the image) because it is very low.
However, this is only the beginning of our analysis; by replicating the procedure for the other parameters we can see if there is a correspondence that allows us to elucidate the causes or infer that it was a measurement error.
Considering spatial distribution
So far we have performed analyses without considering spatial location, we treat all our values as independent. By calculating an outlier considering all data we omit spatial autocorrelation.
How does spatial autocorrelation affect?
It is to be expected that nearby points present similar values (First Law of Geography), therefore, we should determine if a value is an outlier by considering its neighbors.
How to consider spatial autocorrelation?
There are several ways to approach the problem,
- Using geoprocesses and expressions: we create a buffer (area of influence) of a given radius for each point, then we apply the expression that will identify if the central point is an outlier considering its neighbors. Then we determine the mean of the data set excluding the outliers.\
- Using a grid and expressions: in this case we create a grid that covers the extent of the layer, then we calculate the mean for each cell of the grid excluding outliers.
- Try to write and/or develop an expression that performs the process.
We will apply the second method because of its simplicity, I invite you to try other alternatives. Below I describe the process step by step:
1.- We create a grid that covers the extension of the layer of points for this we use the Geoprocess, Create Grid of Research Tools in the Vectorial menu.
2.- In the window we set the Grid type as rectangle, in the extension we select the points layer, then we define a horizontal and vertical Spacing considering the distribution of the points. In this case assign 120 meters of horizontal spacing and 155 meters of vertical spacing. We keep the overlap at 0.
3.- Then we assign a name to each cell of the grid, for this we deploy the table of attributes of the Grid layer and apply the expression in the Field Calculator, in a new field that we will call "NAME":
'LOTE ' || "id"
With this configuration we obtain eight (08) cells.
To avoid performing complex spatial operations we will label each point according to the name of the cell in which it is contained, to do this, we display the table of attributes of the points layer and run the expression in the Field Calculator, in a new field that we will call "LOCATION":
--Expression array_to_string( overlay_within('Grid', "NAME") )
5.- Now that we have identified the points according to the cell to which they are contained we can evaluate the presence of outliers in each cell, the following cases may occur:
- No outliers are detected, in this case we can work with the average Organic Matter considering all the data.
- The outlier obtained in the previous case is confirmed, we assume as average the one calculated above.
- More outliers are added, in this case we recalculate the average by omitting them.
To identify the outliers in the layer of points contained in each cell, we execute the following expression, in a new field that we will call "OUTLIER":
with_variable( 'ubi',attribute($currentfeature,'LOCATION'), Case When "ORGANIC_M" < q1("ORGANIC_M", filter:= "LOCATION"= @ubi) -1.5*iqr("ORGANIC_M", filter:= "LOCATION"= @ubi) or "ORGANIC_M" > q3("ORGANIC_M", filter:= "LOCATION"= @ubi) +1.5*iqr("ORGANIC_M", filter:= "LOCATION"= @ubi) Then 'Atypical Else 'Normal End )
The expression calculates the quartiles and interquartile range only for those points whose "LOCATION" is equal to that of the entity ($currentfeature) being evaluated.
6.- We apply a categorized style symbology using the newly created field ("OUTLIER") in order to verify if new outliers have been identified and their location, we can notice that a new outlier appears, in this case a higher value than expected for its group.
Thus we confirm that the statistics by group differ from those obtained considering all values.
7.- Finally we calculate the average of Organic Matter excluding these values, in this case I will do it with the statistics panel deselecting the values labeled as outliers, obtaining a value of 3.47431.
Final notes
Going into the details of the expressions is beyond the scope of this tutorial which is already quite extensive. But in the next QGIS Dashboards in View and Forms Course that I am organizing we will delve into details and develop cases like this one.
Leave your comments, impressions, ideas and recommendations for future articles.