Using Python and SQL to forward geocode and find distance between two points

Using Python and SQL to forward geocode and find distance between two points

In my current role as a Strategic Marketing Manager, I am relied upon to conduct analysis on the company's sales territories and provide guidance around the optimal sales rep assignments, based on factors like distance from the sales rep to the customer. In the past I have used online tools to process a large number of addresses and return the latitude and longitude, which in some cases have a cost associated with them, and -- more often that not -- are pretty inaccurate. I knew there had to be a better way.

Enter Python and SQL. After scouring the Internet for days and trying different libraries and methodologies that failed to work, I stumbled upon a company named "OpenCage", based out of Germany. Essentially, these folks have a lot of geocoding experience and were the first to commercially implement OpenStreetMap in 2006. You can read more about the company here.

Here's how it works:

Before we can determine distance from our customers to their respective sales reps, we have to get the latitude and longitude for each address using Python.

Step 1 -- You need to sign up for an OpenCage API key. This is very easy to do!

Step 2 -- Open your preferred Python IDE (I like Jupyter Notebooks) and import the appropriate libraries and apply your key.

No alt text provided for this image

Step 2 -- Import the file containing the addresses you need to geocode using Pandas. Tip: OpenCage can only handle 2,500 addresses over a 24 hour period. If you have more than that, you will need to split them up into multiple files. As a best practice, I always check the first few lines of the file using the head() method to make sure everything was imported correctly.

No alt text provided for this image

Step 3 -- Now that you have your data imported and a dataframe containing your list of addresses, you need to convert them to a list.

No alt text provided for this image

Step 4 -- Once your address column has been converted to a list, you are ready to get started with the actual address processing. To do this, we are going to utilize a "for loop" to "iterate" over the entire list of addresses and, then, "indexing" to grab the actual latitude and longitude data points. Each point will be appended to their own list and added as a new column to our original dataframe. Go grab a coffee or walk around and stretch your legs. This may take a while ?.

No alt text provided for this image

Step 5 -- After processing is complete, check the "head" of the original dataframe. You should see the latitude and longitude out to the right of the corresponding address. Note: I always spot check a few of these using a tool like LatLong.net.

No alt text provided for this image

Step 6 -- Export the results using Pandas "to_csv" method. You will need them later to calculate distances in SQL.

No alt text provided for this image

If you've made it to this point, hopefully you've gotten all of your addresses geocoded and have the latitude and longitude data needed to measure the distance between these points. For the next phase of the project, I use SQL Server's "Geography" and "STDistance" function(s). Note: the method I've used returns the distance in meters, so to convert to miles, I divide by 1,609.344. If you don't have SQL installed, you can get a free version of SQLExpress here.

Step 1 -- Import your file into SQL by right clicking on the database you want to use and choosing "Tasks", then "Import Flat File...". If you did it correctly, you should see something resembling the image here. Click "Next >" and follow the prompts.

No alt text provided for this image

Step 2 -- Right click on the database and choose "refresh". Click on "Tables" and make sure your new table is there. You can also do a simple query to check the first few rows to inspect the contents.

No alt text provided for this image
No alt text provided for this image

I'm interested in knowing how far our Missouri rep will have to travel, so I pull the MO data and hard-code the rep's lat / long data into my query. I've blacked out the actual lat and long, as I'm not sure our rep would appreciate people knowing where he lives ??.

No alt text provided for this image

Step 3 -- Create a new field called "mo_distance" and apply the functions mentioned earlier.

No alt text provided for this image

You should now see a new column with the distances ??.

I didn't include the names of our customers, because I don't want our competitors to have that data ??, so my next step would involve copying / pasting results to Excel and then bumping them up against my original customer data, including their name and full address using a v-lookup. The results will be used by the rep to plan / prioritize visits to his customers and the lat / long data makes it easier to map out our customers, utilizing something like Tableau, for example. Important Note: lat / long data is not always very precise, depending on the source. This will affect distance. Be sure to spot check a sample of your data to see how far off you are to appropriately set expectations.

I hope this was as helpful to you as it was to me. My goal is to save you the headaches I had to go through to find the right solution.





Jay Ozer

Driving Data Insights & Conversational AI Solutions

4 年

Good post, thank you. The upper limit for the opencage API call is 2500 per day.

Eusebio Castaneda

Data Analytics | Business Intelligence

5 年

Excellent! I’ve recently implemented a similar process and the only suggestion I think I could make is to leverage python to do the manual loading for you via a sq connection like pymssql if using sql server. This way you don’t need to navigate back and forth between programs and simplifies the time doing manual work. Thanks for writing this article!

Vignesh Jeyaraman

Senior Software Engineer| AWS Certified Cloud Practitioner

5 年

Awesome :D

Omar EL HALLABI

Digital IT Engineer/ Agile Project Management/ Supply chain /Optimization/ E-commerce/ Last-Mile Delivery/ Order Management/ SalesForce/ Odoo/ Django

5 年
Sérgio Mauricio Neves

Engenheiro Eletricista | Mercado Livre de Energia | Energias Renováveis | Eletromobilidade | Front Office | Yellow Belt

5 年

Great work!

要查看或添加评论,请登录

Jennifer Cooper, MBA的更多文章

社区洞察

其他会员也浏览了