登录查看更多内容

How to Perform Radius-Based Search using Spark and Haversine Formula for Large-Scale Geospatial Data

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2023年4月30日

Geospatial data analysis has become increasingly important in many fields such as transportation, urban planning, and social sciences. One of the fundamental operations in geospatial analysis is radius-based search, which allows users to find all points within a certain distance from a given location. In this blog, we will discuss how to perform radius-based search using Spark and Haversine formula for large-scale geospatial data analysis.

What is Haversine Formula?

Haversine formula is a mathematical formula used to calculate the distance between two points on a sphere, such as the Earth. The formula takes into account the curvature of the Earth's surface, making it more accurate than simple Euclidean distance calculations. The formula is as follows:

where lat1, long1, lat2, long2 are the latitude and longitude of the two points, Δlat and Δlong are the differences between the latitudes and longitudes, and R is the radius of the Earth (6,371 km).

How to use Spark and Haversine formula to perform radius search?

Step 1: Define imports

Step 2 : Create Spark Session

Step 3: We shall use Sample dataset which has 0.2M Records Uber Fares Dataset

Radius Based Search

Suppose I want to analyze the number of rides taken near Continental Army Plaza. In order to achieve this, we will perform a radius search to retrieve all the rides within a 1-mile radius of the plaza

Add columns for the distance to the search center and whether the ride is within the radius

Output of dataframe

领英推荐

Data Visualization: Illuminating Insights in the…

Mohamed Al Marri ? , CIPME, ITBMC 6 个月前

Milan's Data Science Insights #004

Milan Janosov 9 个月前

Space-time hotspots, dynamic tiling & back to school…

CARTO 1 年前

Filter the Data within the search radius

Output DF

Plot on Map

Result

Advantages of using Spark for geospatial data analysis

Using Spark for geospatial data analysis has several advantages:

Scalability: Spark can handle large-scale geospatial data that cannot fit into memory of a single machine. Spark can distribute the data across multiple machines and process them in parallel, which enables faster and more efficient analysis.
Speed: Spark is designed for in-memory processing, which can significantly speed up the geospatial analysis. Spark can also leverage the computation power of GPUs and other accelerators to further speed up the analysis

Entire Code

https://soumilshah1995.blogspot.com/2023/04/how-to-perform-radius-based-search.html

Conclusion

In conclusion, performing radius-based searches on large-scale geospatial data can be a challenging task, but using Spark and the Haversine formula can greatly simplify the process. By leveraging Spark's distributed computing capabilities, we can efficiently process and analyze large volumes of data. The Haversine formula is a well-established and accurate method for calculating distances between two points on a sphere, making it a reliable choice for geospatial analysis.

While Spark and the Haversine formula are excellent tools for performing radius-based searches, there are other alternatives worth considering as well. For example, Geo Hash and Uber H3 library are two popular alternatives for geospatial indexing and querying. Ultimately, the choice of tool will depend on the specific needs of the project and the characteristics of the data being analyzed.

In summary, performing radius-based searches on large-scale geospatial data requires careful consideration of the available tools and methods. By using Spark and the Haversine formula, we can efficiently and accurately analyze large volumes of data to gain valuable insights and make data-driven decisions.

Aditya B.

Typescript | NodeJS | Microservices

1 年

Well written

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

How to Perform Radius-Based Search using Spark and Haversine Formula for Large-Scale Geospatial Data

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

What is Haversine Formula?

How to use Spark and Haversine formula to perform radius search?

Step 1: Define imports

Radius Based Search

领英推荐

Advantages of using Spark for geospatial data analysis

Entire Code

Conclusion

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Uncovering Hidden Patterns: Applying Spatial Analytics to Big Data

AIM Weekly for 23 September 2024

AI and Data Weekly for 25 November 2024

The New Geospatial Workforce: Skills for the Solution Era

Geospatial Analytics Market Demand, Scope, Share, Growth, Applications, Types and Forecasts Report 2030

Understanding the Binary Tree Data Structure

Network Graph Visualizations with DOT

GeoParquet 1.0.0 is Here, and It's Changing the Geospatial Game

The Spatial Intelligence Newsletter: February Edition

Introducing Milvus 2.5: Built-in Full-Text Search and More!

What is Haversine Formula?

How to use Spark and Haversine formula to perform radius search?

Step 1: Define imports

Radius Based Search

领英推荐

Advantages of using Spark for geospatial data analysis

Entire Code

Conclusion

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

Uncovering Hidden Patterns: Applying Spatial Analytics to Big Data

AIM Weekly for 23 September 2024

AI and Data Weekly for 25 November 2024

The New Geospatial Workforce: Skills for the Solution Era

Geospatial Analytics Market Demand, Scope, Share, Growth, Applications, Types and Forecasts Report 2030

Understanding the Binary Tree Data Structure

Network Graph Visualizations with DOT

GeoParquet 1.0.0 is Here, and It's Changing the Geospatial Game

The Spatial Intelligence Newsletter: February Edition

Introducing Milvus 2.5: Built-in Full-Text Search and More!