Performance of Spatial Queries in QGIS, Part One
Luis Eduardo Perez Graterol
Especialista en Sistemas de Información Geográfica en entornos comerciales y de libres. Desarrollador de plugins (QGIS) y script tools (ArcGIS). Integración de modelos de Inteligencia Artificial a datos geoespaciales.
Introduction Ir a la versión en espa?ol
The work of a GIS specialist is exciting and creative, in his daily work he must perform queries, spatial analysis, design and run models, finally represent the results in maps, reports, charts and others.
However, when working with large volumes of data, these tasks become more complicated, a simple analysis can take several minutes, even days. Because of this, the performance of the processes in the selected GIS software is of great importance.
This article evaluates different options to perform spatial queries in QGIS, working with its programming environment (console and editor).
The article has been written to be understood by users with little or no programming experience, but at the same time to be used by advanced users and developers. To achieve this, complex topics are explained with simplicity and code details are omitted, which can be published later in Github and/or in another publication.
Due to the extensive and diverse nature of the topic, I plan to continue it in subsequent publications. Preliminary conclusions derived from the interpretation of the results are presented at the end of the article.
What is a spatial query?
The term query is widely used in databases to refer to the request of a specific set of data. A very simplistic definition, if we consider that the query can be complex and require the processing of a large number of records, fields and related tables.
The same definition can be applied to spatial queries, but on a special type of data, the spatial data, which has the peculiarity of having a location (geographical position), which can be absolute or relative. The most explicit form of representation of a geographic element is the vector format, which is composed of geometries (points, lines, polygons and multiparts).
Spatial queries are a familiar theme and one of the great contributions of GIS, answering questions such as: How many houses are within 100 meters of a body of water, how many schools are in each neighborhood, what is the total population within 100 meters of a railroad?
To perform this type of queries, the GIS software must perform complex geometric calculations that generally involve the path of the vertices of the geometric entities and their comparison with other entities. Logically, the performance of the queries will be affected by the number of entities involved and the number of vertices they have.
The reason for this publication
The development of the Dashboards plugin brings several challenges. As an interactive visualization and query tool, it must be able to update multiple queries efficiently. If the update of the panels that make up the dashboard takes several minutes, it will not fulfill its purpose.
As a result, I have devoted myself to investigate and evaluate the different alternatives available in QGIS to perform spatial queries using its programming environment.
What alternatives does QGis have to perform spatial queries?
QGIS is a synergy of diverse geospatial technologies working in harmony and providing us with an application whose capability is greater than the sum of its parts.
Due to this amalgamation of libraries and developers, a special feature of QGIS, is that we can solve a problem or generate a process in many ways, this will be a recurring phrase in my next publications.
And the options increase in new versions that add more functionality or when considering the use of plugins. We will test various methods available in the QGIS user interface and compare their performance with a series of algorithms.
All processes will be run using the QGIS programming environment which will allow us to compare their execution times. We will use the LTR version QGIS 3.10 to evaluate the following alternatives:
1.- Process toolbox is the first option, it is a tool specifically designed for this purpose, specifically the process Selection by location.
2.- Expressions integrated throughout the QGIS structure, among many other things allows us to perform queries and spatial analysis.
3.- Python Console we will test several algorithms using the QGIS API and pre-installed libraries.
3.1. Sequential Algorithm
3.2. Sequential algorithm with prior filter by bounding box
3.3. Sequential Algorithm with Spatial Index
3.4. Implementing GEOS and the Spatial Index
Advanced users will find other alternatives besides those considered in this article, I invite you to comment at the end of the article, in order to share experiences and include your suggestions in subsequent articles.
The key to excellent spatial process performance
Before proceeding to evaluate the processes, we must review a very important concept, which is the spatial index.
Databases have long indexed their tables for more efficient data management. The spatial index is an application of these structures for spatial data, specifically indexing the bounding boxes of geometries (not directly the geometry) according to their location and extent (Erick Westra. 2015. Python Geospatial Analysis essential).
The spatial index implements a clustering algorithm to organize the entities in a data structure (usually a tree, R-Tree, B-Tree among others) that produces groups with the entities closest to each other, it defines the groups using their bounding boxes.
When a spatial operation is performed, for example an intersection, it evaluates only the entities that are close to the entity of interest, this significantly speeds up the processes, then performs the evaluation with the bounding boxes, avoiding unnecessary computations.
First test. Polygons contained in polygons
We will start with a spatial query of polygons contained in polygons, for this we will use two polygon layers.
- Unity layer: 3.555 entities
- Parcels layer: 319.352 entities
We will select fifty (50) polygons of the Unity layer and determine how many parcels are contained.
To know the execution time of each process we will use Python's time module, calculating the difference between the initial time and the time after the process.
import time start_time = time.time() ------Process code---- ---------------------- duration=time.time()-start_time
1.- Process selection by location:
Previously importing the processing module we can execute any available process using the QGIS API.
Without spatial index
Contained parcels: 3.598 Duration: 34,57313823699951 seconds == 0,576...minutes
Previously created spatial index
Contained parcels: 3.598 Duration: 0,30198025703430176 seconds
The Selection by Location process, with spatial index, performs very well, taking less than one second. The creation of the spatial index may take a little more than a second but is performed only once.
The difference between the process with and without spatial index is very high, 115 times faster!
In conclusion, if you work with many entities it is advisable to use this process, generating the spatial index beforehand.
Now we will try to match or improve this excellent performance by trying other options.
2.- Expressions
Expressions group functions, variables and operators accessible in almost all QGIS interface.
The selection of the right expression is very important to guarantee a better performance. In this case we will execute the following expression adapted to run in the console using the QgsExpression and QgsExpressionContext classes.
aggregate('parcels','count','id',filter:=contains( geometry(@parent), $geometry)) Parcels contained: 3.598 Duration = 480,52841544151306 seconds = 8,008... minutes
3.- Python Console
3.1- Sequential algorithm:
The first script we will test takes each of the selected units and evaluates all the parcels by counting how many are contained. To do this we perform a for loop and apply the methods of the QgsGeometry class.
The animated image shows the procedure to be carried out. It is an inefficient algorithm, since for each selected Unit all the parcels must be analyzed, but it will be useful to compare with the other methods.
Contained parcels: 3.598 Duration: 563,859628200531 seconds == 9,3976...minutes
3.2.- Sequential algorithm with prior filter by minimum bounding box
Determining if a rectangle intersects another rectangle is a simpler and faster process.
Therefore, we will modify the previous algorithm, incorporating this filter, first we will verify if the bounding box of the selected polygon of the Unity layer intersects the bounding box of each polygon of the parcel layer, if the condition is met, we will evaluate if the parcel is contained.
Contained plots: 3.598 Duration: 391,0932114124298 seconds == 6,518...minutes
The result is very interesting, with this small modification we obtain an improvement of 2.7 minutes of the previous algorithm. In turn, it runs almost one minute faster than the expression, demonstrating that the expressions perform a sequential traversal and do not take advantage of the spatial index.
3.3.- Sequential algorithm with spatial index
Using the QGIS API we can generate a temporal spatial index, specifically by implementing the QgsSpatialIndex class. The QGIS spatial index will allow us to more efficiently evaluate which parcels intersect the bounding box of the selected Unity layer polygons.
The generation of the spatial index is performed once during runtime, however, it is time consuming and must be considered. The creation of the spatial index for the parcel layer took 7.563292741775513 seconds. While the process performed:
Parcels contained: 3.598 Duration: 5,558135271072388 seconds
By implementing the spatial index, a significant improvement in the performance of this type of queries is obtained, it is executed in only 5.56 seconds. However, it is still far from reaching the performance of the Selection by Location process.
3.4.- Implementing GEOS and the spatial index
In addition to its rich API and spatial index, QGIS implements the GEOS library for advanced geometry operations such as geometry predicates and geometric properties (official QGis documentation). GEOS is an OSGEO project.
GEOS (Geometry Engine - Open Source) is the C++ version of the JTS Topology Suite (JTS). It includes the OpenGIS Simple Features for SQL spatial predicate functions and spatial operators, as well as JTS-specific topological functions. GEOS is the most widely used geospatial C++ library, being used in "free" projects such as PostGIS, QGIS, GDAL/OGR, MapServer, and by proprietary products including FME (GEOS official documentation).
We use the same spatial index that we generated in the previous process. This time the process yields:
Contained parcels 3.598 Duration: 0,4820523262023926 seconds
We obtain a substantial improvement in performance, only 0.482 seconds!
Comparing results
As expected, the sequential algorithm presents the lowest performance followed by the expressions and the sequential algorithm considering the bounding box.
The best performance is presented by the Selection by Location process available in the process toolbox, however, by optimizing our algorithms with the spatial index and GEOS we can achieve similar execution times.
Discussion
This article has been quite lengthy, leaving several aspects to be considered, among them:
- We have only evaluated one type of query, other geometries, relations and combinations are still pending.
- We still have options to try to improve performance,
- On the other hand, when evaluating large dataset queries it is necessary to consider databases, especially PostgresSQL/Postgis.
Nevertheless, the results obtained so far allow us to present preliminary conclusions.
Preliminary conclusions:
- The Selection by Location process and probably other similar processes in the process toolbox are highly optimized.
- It is possible that the algorithms executed in the console achieve performances similar to those of the process toolbox, if they are optimized with the spatial index and GEOS.
- The implementation of the spatial index is a fundamental requirement to achieve better performances in the process toolbox processes and in the algorithms we run through the console.
- The expressions present poor performance when processing many entities and do not implement the spatial index.
- It is possible and recommended to improve the performance of the expressions using the different options available in the QGIS API, such as the spatial index (permanent or temporal) and the GEOS library.
Spatial analyst
2 年Really interesting! Thanks!
Geodatasamordnare, Kungsbacka kommun
4 年I'll keep this in mind, and will wait for another one!