Improving OPM Locality Pay Data Access Through Web Scraping and API Creation
Nicholas Galiardo, CWA, TE I, TM I
Creating efficient, sustainable, and powerful technical solutions
In an effort to enhance salary data accuracy and streamline the estimation process for salary and benefits (S&B) as part of the annual budget formulation process, our team set out to improve access to the U.S. Office of Personnel Management (OPM) Locality Pay data. This initiative was driven by the need for more precise salary estimations, and accounting for the variations in pay rates based on locality of the employees' home of record. Here's how we tackled this challenge.
The Challenge
The OPM provides a wealth of salary information, but the data is scattered across various web pages and is not in an efficient, easily accessible, structured format. The two key datasets we needed access to were:
Unfortunately, OPM does not provide API endpoints for these datasets, nor do they provide a unified method for parsing this data. Additionally, the data is referenced via locality area codes that are not made publicly available, and the Staffing Plan dataset we were working with, only contains city and state information, not county-level details necessary for proper data matching.
Solution
To wrangle the necessary data, into a usable format for repetitive use, we developed a solution that combined web scraping, and the creation of a custom API.
What is Web Scraping?
Web scraping is the process of automatically extracting data from websites. Unlike APIs, which are designed to deliver structured data directly, many websites store their data in HTML tables or text on their pages. Web scraping tools parse through this content and organize the information into a more usable format, such as JSON or CSV.
In our case, we needed to gather OPM's locality pay data from scattered HTML tables across multiple web pages. By using web scraping, we could automate this process, transforming the raw HTML content into structured data that could be easily analyzed and integrated into our systems.
Once the data was scraped and organized, we saved it to several new tables in our Databricks environment.
What is an API?
An API (Application Programming Interface) is a set of rules that allows different software applications to communicate with each other. It defines how requests are made between systems, what data is sent, and how responses are returned. APIs are commonly used to retrieve structured data from a service, database, or external system, making it easier to integrate this data into applications.
In our solution, we built a custom API using the Plumbr framework to cross-reference geocoded data (location data like city and state) with OPM's locality pay information. This API allows users to request specific locality pay data for federal employee compensation based on geographic criteria, such as city or county.
Web Scraping OPM Data for Locality Pay Area Definitions
To gather the necessary OPM data for locality pay areas, we utilized web scraping techniques to extract the information from OPM’s website. Since OPM does not provide an easy-to-use API for retrieving this data, we built a custom solution to scrape the locality pay area definitions and organize them in a structured format for further use.
The Scraping Process
We used JavaScript and jQuery to extract the locality pay area definitions from the HTML tables found on the OPM website. Here’s an overview of the process:
Defining the States: We first defined the list of U.S. states that would be relevant for extracting locality pay area definitions. This list included all 50 states, and we iterated through each state to locate the corresponding data.
Locating the State Sections: Using the jQuery contains() function, we searched for headers (h4) on the page that contain the state names. These headers mark the beginning of each state’s locality pay data section.
JavaScript
var results = $('h4:contains("' + state + '")');
Extracting Locality and County Data: Once the state sections were located, we used jQuery to traverse the DOM and extract the locality pay area and the corresponding county information. The locality was typically found in a preceding header (h3), and the county data was within a table structure located next to it.
The code below demonstrates this logic:
JavaScript
var locality = $(e).prevAll('h3').first().text(); // Extract locality
var countyName = $(county).find('td').first().text(); // Extract county name
Storing the Data: After extracting the locality and county data, we organized it into a structured object, where the county names were used as keys and the locality pay area was the value. The data was then stored in a dictionary for each state.
JavaScript
counties[countyName] = { LocalityPayArea: locality };
Creating a Comprehensive Data Structure: This process was repeated for each state, resulting in a comprehensive object that mapped states to their respective locality pay areas and county data. This structured data was then used to populate the locality pay area definitions dataset, which was later transformed into a more usable JSON format for analysis.
JavaScript
stateList[state] = counties;
Example of the Scraping Code
Below is a snippet of the JavaScript code that was used to scrape the locality pay area definitions:
JavaScript
var states = ['Alabama', 'Alaska', 'Arizona', ...]; // List of states
var stateList = {};
states.forEach(function(state, i) {
var results = $('h4:contains("' + state + '")');
var counties = {};
$(results).each(function(i, e) {
var locality = $(e).prevAll('h3').first().text(); // Extract locality
$(e).next('table').find('tbody').find('tr').each(function(i, county) {
var countyName = $(county).find('td').first().text(); // Extract county
counties[countyName] = { LocalityPayArea: locality }; // Store data
});
});
stateList[state] = counties; // Store the final data for the state
});
Converting XML Pay Tables to JSON and Joining with Scraped Data
Once we had the Locality Pay Area Definitions in a structured JSON format, we needed to fetch and integrate the Pay Tables data, which was originally available in XML format on the OPM website.
Fetching and Converting XML Pay Tables
To fetch the XML Pay Tables, we used JavaScript to retrieve the XML files from the OPM website directly. This was done through the following JavaScript function, which fetched the XML content for a given year and locality codes scraped earlier:
JavaScript
var fullPay = [];
function getXML(year, locality){
var xmlText;
fetch(`https://www.opm.gov/policy-data-oversight/pay-leave/salaries-wages/salary-tables/xml/${year}/${locality}.xml`)
.then(response => response.text())
.then(str => new DOMParser().parseFromString(str, "text/xml"))
.then(data => {
setTimeout(() => {
fullPay.push(xml2json(xmlText)); // Convert XML to JSON
}, 100);
});
}
This function sends a request to the relevant OPM URL, parses the response XML data using the DOMParser, and then converts it into a usable JSON format by calling the xml2json() function, and pushes the result to a new array called fullPay.
The xml2json() Function
The xml2json() function is responsible for converting the XML data into JSON format. Here’s how the function works:
JavaScript
function xml2json(xml) {
try {
var obj = {};
if (xml.children.length > 0) {
for (var i = 0; i < xml.children.length; i++) {
var item = xml.children.item(i);
var nodeName = item.nodeName;
if (typeof(obj[nodeName]) == "undefined") {
obj[nodeName] = xml2json(item);
} else {
if (typeof(obj[nodeName].push) == "undefined") {
var old = obj[nodeName];
obj[nodeName] = [];
obj[nodeName].push(old);
}
obj[nodeName].push(xml2json(item));
}
}
} else {
obj = xml.textContent;
}
return obj;
} catch (e) {
console.log(e.message);
}
}
This function recursively converts an XML document into a JSON object. It processes each XML element by checking its children and building a JSON object with the node names as keys. If an element has multiple occurrences, it ensures that the values are stored as arrays.
Flattening and Storing Data
Once the data was converted to JSON and the XML pay tables were successfully joined with the scraped locality pay area definitions, we needed to flatten the nested JSON structure into a more usable format, such as a CSV, for easier analysis and importing into tools like Databricks.
We used JavaScript to flatten the dataset by iterating over the nested objects and extracting the relevant information into a tabular format.
Flattening the Nested JSON
Here is the code used to flatten the final dataset. It processes the nested pay table data and extracts details such as PayTable Abbreviation, Locality Pay Area, Grade, Step, and pay rates for Annual, Hourly, Overtime, and other fields:
JavaScript
fullPay.forEach(function(area){
area.PayTable.Grades.Grade.forEach(function(grade){
grade.Steps.Step.forEach(function(step){
flattenData.push({
Abbreviation: area.PayTable.Abbreviation,
LocalityPayArea: area.PayTable.LocalityPayArea,
Description: area.PayTable.Description,
Notes: area.PayTable.Notes,
SalaryCap: area.PayTable.SalaryCap,
Type: area.PayTable.Type,
Grade: grade.Value,
Step: step.Value,
Annual: step.Annual,
Hourly: step.Hourly,
Overtime: step.Overtime,
Statutory_Cap_Applies: step.Statutory_Cap_Applies
});
});
});
});
Code Explanation:
This flattened data was then stored in a CSV format and imported into Databricks for further analysis. The dataset was stored in our OPM schema, making it accessible for various analytical tools.
How This Data is Used
With the locality pay area definitions scraped and organized; this structured data is used by our custom-built API to cross-reference county data from the Staffing Plan dataset. By geocoding city and state information from the staffing plan, we can determine the associated county and use the scraped locality pay area data to fetch accurate pay rates for federal employees.
Geocoding Address Data with the OSM API
To further enhance the functionality of our API, we integrated OpenStreetMap (OSM)’s Nominatim geocoding API to retrieve county-level information for a given address. This allows users to provide a city and state, and the API will automatically determine the corresponding county using the geocoding API, ensuring that the data can be accurately cross-referenced against OPM’s locality pay tables.
The Geocoding Process
Here’s an overview of how the geocoding process works:
User Inputs: The user provides a city, state, grade, step, and an optional flag (agg) to aggregate data.
Check for Special Cases: If the city is located in special regions like Washington D.C. or certain states (Hawaii, Alaska), we directly fetch locality pay data for those areas.
Geocoding Request: If the city and state are not in the special cases, the system constructs a URL for the OSM Nominatim API to geocode the provided city and state into county data:
JavaScript
const apiUrl = `https://nominatim.openstreetmap.org/search?city=${encodeURIComponent(city)}&state=${encodeURIComponent(state)}&country=USA&format=json&addressdetails=1`;
County Retrieval: After the geocoding request is made, the API receives the response and extracts the county from the data:
领英推荐
JavaScript
const county = data[0].address.county || 'County not found';
Cross-Referencing with OPM Data: Once the county is identified, the locality pay data is retrieved from our custom-built API, using the county as a key to find matching locality pay data. If the county is not found, the system checks for alternative options such as using the city name.
How the Custom API Works
Our custom API is built with the Plumber framework in R, providing an efficient and flexible way to query the OPM Locality Pay data. The API exposes an endpoint that allows users to retrieve specific data based on the following parameters:
1. Connection to Databricks Database
The API establishes a connection to a Databricks instance using the ODBC (Open Database Connectivity) interface. We preconfigured our Posit interconnection with Databricks and assigned an appropriately named DSN. The connection is initialized with the following code:
r
con <- dbConnect(odbc(), "Databricks")
This connection allows the API to query the OPM Locality Pay data stored in the Databricks environment, retrieving the most up-to-date salary information.
2. Query Construction and SQL Injection Prevention
To ensure data security and prevent SQL injection vulnerabilities, we use parameterized queries rather than directly embedding user input into the SQL query. This method ensures that user inputs are treated as data, not executable code.
Here’s an example of a safe query construction using parameterized queries:
r
query <- "SELECT DISTINCT opt.*, old.State, old.State_Abbreviation, old.Locality
FROM opm_data.opm_locality_definitions old
LEFT JOIN opm_data.opm_pay_tables opt
ON opt.Abbreviation = old.localityCode AND opt.Year = old.Year
WHERE old.Year IN (?)"
# Assuming state, locality, grade, and step are vectors, bind those as parameters
params <- list(year, state, locality, grade, step)
dbGetQuery(con, query, params)
By using parameterized queries, the user inputs are safely handled, preventing any malicious SQL code from being injected.
Additionally, input validation is applied to ensure that parameters like year, state, locality, grade, and step are in the correct format and free from dangerous characters. For example, we can ensure that year is numeric and state only contains valid alphanumeric characters:
r
# Ensure 'year' is a numeric value
if(!is.numeric(year)) {
stop("Invalid year value")
}
# Ensure 'state' is alphanumeric and does not contain dangerous characters
if(!grepl("^[a-zA-Z ]*$", state)) {
stop("Invalid state value")
}
3. Dynamic Query Execution
Once the query has been constructed and the parameters safely bound, the API executes the query using the dbGetQuery() function from the DBI package:
r
dbGetQuery(con, query, params)
The query results are then returned from Databricks as a data frame, which in turn is transformed into JSON and served back as the response to the initial API call, containing the locality pay data that matches the specified parameters.
4. Returning Filtered Data
The data returned by the API is structured in a way that includes the relevant locality pay data, such as:
This data can then be used for further analysis or integration into other applications, such as R, Python, or Power BI.
Example Workflow
If a user enters "San Francisco" and "California" along with a grade of 12 and step 5, the system first geocodes the location to retrieve the county ("San Francisco County"). Then, the locality pay data for "San Francisco County" is fetched from the OPM database through the custom API.
Practical Application
We’ve successfully made API calls in R, Python, and Power BI Desktop to seamlessly pull OPM locality pay data into our analytical workflows. This development allows for more accurate salary projections and better decision-making based on up-to-date compensation data.
Conclusion
Our custom API provides a secure, efficient, and flexible way to access and filter OPM Locality Pay data, ensuring that users can easily retrieve the information they need based on specific geographic and pay grade criteria. By integrating the OSM geocoding API, we’ve added a powerful feature to the API, enabling users to submit a city and state and receive county-level data that can be cross-referenced with OPM’s locality pay tables. This solution enhances the accuracy and accessibility of salary and benefits data for federal employees.
?? Take a look at the presentation slides and code, below for more details.
?? Interested in exploring building custom APIs connections or web scrapping? Let’s connect!