Foursquare Places
Foursquare
We're a location technology platform, inventing the future with developer tools, enterprise solutions and consumer apps.
Discovering real world places and keeping their information up-to-date is a challenging problem, due to their ever changing nature. In this post, we will explore how Foursquare builds a robust view of places using a combination of human and machine intelligence.?
There are millions of public places and points-of-interest around the world. Coffee shops, store locations, airports, bus stops, and so much more. But building and maintaining a dataset tracking all the places in the world is extremely difficult and time-consuming. Organizations that need this data find it far more efficient to rely on a trusted data provider than to assemble this data set themselves. Foursquare has spent over a decade building the world’s largest dataset of over 200 million points-of-interest (POI) around the world. Foursquare Places has 100% coverage of the top-100 retailers and QSR chains in the US and around the world. Each record includes information such as the address, city/state/zip/country, latitude/longitude, and other rich and relevant attributes that provide full geospatial context for each place.?
Foursquare Places data is used by companies in a wide range of vertical industries across Retail, CPG, Real Estate, Financial Services, Tech, and more. Uber, for example, uses Foursquare Places to help power Uber search in their app to help riders find and get rides to specific locations on a map. Coca Cola ingests Foursquare Places data into their data warehouse to inform their sales team on new store openings, helping grow their business by adding more stores where Coca cola products are sold. Each month, Foursquare adds and updates hundreds of thousands of data points to maintain the correctness and consistency of the data in accordance with the real world. The physical world is constantly changing — stores open, move, or close — so it’s important that the data represents the truth of the world around us.
Here, We’ll explore the engineering processes that Foursquare uses to maintain a global dataset of 200M+ POIs and ensure that customers can trust our data.?
Overview
Foursquare builds its Places dataset in several stages: a)?ingestion,?which aggregates data points from various sources including web-crawls, trusted partners, and Foursquare app users, b)?resolution, which combines all the inputs from various sources to create a self-consistent record of each place, c)?summarization, which applies heuristics and models to build an authoritative view of the place, d)?calibration, which scores each record on how much it deviates from its real world representation along various dimensions such as the reality score, status (open or closed for business), and the accuracy of its attributes, and e)?filtration,?which removes the records that do not meet our bar of quality. We deliver finalized datasets to our customers as a flat file and through our APIs.?
Ingestion
Foursquare ingests data from four kinds of sources: a) various websites containing data on points-of-interest that our web-crawlers extract, b) listing syndicators, who work with Foursquare to help businesses improve their online presence, c) trusted partners who specialize in maintaining data on specific categories of places (such as gas stations or restaurants) in specific regions, and d) the active users of Foursquare mobile apps.?
Our web-crawler searches thousands of websites to gather all publicly available data about places and stores the data in an internal cache. Then, we apply a set of validation and formatting rules to standardize the representations of attributes in various regions. For example, our rules use text patterns to identify postal codes in Australia, validation lists to confirm townships in Taiwan, and so on. We also receive batches of updates periodically from our?trusted data partners?and?listing syndicators. And finally, the?active?users?of our Foursquare Swarm and City Guide mobile applications provide us information about new places and updates to existing places.?
At the end of this ingestion stage, we have a list of canonical inputs corresponding to every place, organized by source (specific website, specific user, or specific data contributor). In the rest of the document, we will refer to these as?source inputs.?
Resolution
In the resolution stage, we map the source inputs to either an existing record of a place in our database or a new place that doesn’t yet exist in our database. To do so, we run the source inputs through a multi-step entity resolution process:?
At the end of this resolution stage, we have all the source inputs mapped to either an existing place in our database or a newly created place.?
Summarization
Once we’ve ingested and resolved groups of inputs related to a place, we use a process called “Summarization” to determine the best attribute values for each place. We run the?summarization process separately for each attribute, and the strategies we use for summarization differ from attribute to attribute. The following are a list of strategies that we use for different attributes:
领英推荐
Calibration
One of the key challenges of maintaining a reliable location dataset is measuring the quality of every place in the dataset rather than a random or a curated sample. Foursquare has developed several models that help us calibrate how closely each place in our data set conforms to its real world reference. We use the following scores to continuously identify gaps in our data and improve the quality in a measurable way:?
The above metrics provide a window into the quality of our dataset across different regions, states and localities, along each of these dimensions. We use Foursquare Studio to visually analyze and identify geographies & categories where we have low quality POIs – see visuals below. We then use a multitude of techniques such as pruning bad sources, identifying new sources that specialize in a specific category & geography, and leveraging the network of Foursquare app users to make quality improvements in a measurable way. We will go into the details of our Quality Framework and the process we use to drive continuous improvements to our Places data in a separate blog post.
Filtration
After all the place records in our dataset are scored, we perform various checks on the record to determine their eligibility for inclusion in the final dataset delivered to our customers as a flat file or through our APIs. In this step, we verify whether a) the key attributes of a place record are populated, b) the reality score of a place and the accuracy of key attributes cross a certain threshold, and c) there is at least one credible source contributing to each of the attribute values. We also perform some additional semantic checks to make sure two attribute values on the same record are not in conflict with each other, for instance, a zip-code not matching the city. The records passing these verification checks are released to our customers.
Release
Another interesting challenge in maintaining the integrity of our dataset is keeping track of any problems that may occur between releases. In this section, we will walk through some of the guardrails we set up to prevent regressions in our datasets.
There are two events that trigger a change to the Places dataset. This section will detail how we manage those releases: a) there is a new batch of data available from a specific source or a set of sources, and b) there is a new version of the summarization algorithms or the calibration models available. Each of these changes are tested in a staging pipeline, where all the above steps are run, on a specific snapshot of the production data, and a report is generated to understand the delta between the newer version of the places after the code or data changes are applied.?
The report checks for a variety of dataset changes such as changes to attributes, changes in distribution of key quality scores described in the section above, and other custom programmatic checks. The report calculates these metrics and compares them against predetermined thresholds to flag any changes that are suspicious for additional review. Additionally, any changes to our golden dataset containing popular & curated places are also flagged. Then we determine whether these changes were expected (for example as a response to some code change we made or the data we ingested), or if they represent a potential problem and need further investigation. After reviewing potential regressions, Foursquare’s engineers can approve the QA report, and those data changes will be merged into the production dataset.
Conclusion
In this blog post, we outlined the systems and processes that Foursquare employs to generate a reliable dataset of places worldwide. In subsequent blog posts, we will dive into the details of our Data Quality Tracking framework, and how our network of loyal Foursquare app users helps preserve the quality of our data.
Original article published to the Foursquare Developer blog: https://location.foursquare.com/resources/blog/developer/foursquare-places/
Authors:
Foursquare Engineering:? Vikram Gundeti , Horace Williams , Jorge Israel Pe?a , Zisis Petrou, Ph.D. & David Bortnichak
Foursquare Product:? Sandhya N. & Jen Foran