登录查看更多内容

Improving OPM Locality Pay Data Access Through Web Scraping and API Creation

Nicholas Galiardo, CWA, TE I, TM I

Creating efficient, sustainable, and powerful technical solutions

发布日期: 2025年2月14日

In an effort to enhance salary data accuracy and streamline the estimation process for salary and benefits (S&B) as part of the annual budget formulation process, our team set out to improve access to the U.S. Office of Personnel Management (OPM) Locality Pay data. This initiative was driven by the need for more precise salary estimations, and accounting for the variations in pay rates based on locality of the employees' home of record. Here's how we tackled this challenge.

The Challenge

The OPM provides a wealth of salary information, but the data is scattered across various web pages and is not in an efficient, easily accessible, structured format. The two key datasets we needed access to were:

Locality Pay Area Definitions – This dataset contains details on geographic areas considered locality pay areas, which are crucial for federal employee compensation.
Locality Pay Tables – These tables contain grade/step pay rates for each locality pay area.

Unfortunately, OPM does not provide API endpoints for these datasets, nor do they provide a unified method for parsing this data. Additionally, the data is referenced via locality area codes that are not made publicly available, and the Staffing Plan dataset we were working with, only contains city and state information, not county-level details necessary for proper data matching.

Solution

To wrangle the necessary data, into a usable format for repetitive use, we developed a solution that combined web scraping, and the creation of a custom API.

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. Unlike APIs, which are designed to deliver structured data directly, many websites store their data in HTML tables or text on their pages. Web scraping tools parse through this content and organize the information into a more usable format, such as JSON or CSV.

In our case, we needed to gather OPM's locality pay data from scattered HTML tables across multiple web pages. By using web scraping, we could automate this process, transforming the raw HTML content into structured data that could be easily analyzed and integrated into our systems.

Once the data was scraped and organized, we saved it to several new tables in our Databricks environment.

What is an API?

An API (Application Programming Interface) is a set of rules that allows different software applications to communicate with each other. It defines how requests are made between systems, what data is sent, and how responses are returned. APIs are commonly used to retrieve structured data from a service, database, or external system, making it easier to integrate this data into applications.

In our solution, we built a custom API using the Plumbr framework to cross-reference geocoded data (location data like city and state) with OPM's locality pay information. This API allows users to request specific locality pay data for federal employee compensation based on geographic criteria, such as city or county.

Web Scraping OPM Data for Locality Pay Area Definitions

To gather the necessary OPM data for locality pay areas, we utilized web scraping techniques to extract the information from OPM’s website. Since OPM does not provide an easy-to-use API for retrieving this data, we built a custom solution to scrape the locality pay area definitions and organize them in a structured format for further use.

The Scraping Process

We used JavaScript and jQuery to extract the locality pay area definitions from the HTML tables found on the OPM website. Here’s an overview of the process:

Defining the States: We first defined the list of U.S. states that would be relevant for extracting locality pay area definitions. This list included all 50 states, and we iterated through each state to locate the corresponding data.

Locating the State Sections: Using the jQuery contains() function, we searched for headers (h4) on the page that contain the state names. These headers mark the beginning of each state’s locality pay data section.

JavaScript

var results = $('h4:contains("' + state + '")');

Extracting Locality and County Data: Once the state sections were located, we used jQuery to traverse the DOM and extract the locality pay area and the corresponding county information. The locality was typically found in a preceding header (h3), and the county data was within a table structure located next to it.

The code below demonstrates this logic:

JavaScript

var locality = $(e).prevAll('h3').first().text();  // Extract locality
var countyName = $(county).find('td').first().text();  // Extract county name

Storing the Data: After extracting the locality and county data, we organized it into a structured object, where the county names were used as keys and the locality pay area was the value. The data was then stored in a dictionary for each state.

JavaScript

counties[countyName] = { LocalityPayArea: locality };

Creating a Comprehensive Data Structure: This process was repeated for each state, resulting in a comprehensive object that mapped states to their respective locality pay areas and county data. This structured data was then used to populate the locality pay area definitions dataset, which was later transformed into a more usable JSON format for analysis.

JavaScript

stateList[state] = counties;

Example of the Scraping Code

Below is a snippet of the JavaScript code that was used to scrape the locality pay area definitions:

JavaScript

var states = ['Alabama', 'Alaska', 'Arizona', ...];  // List of states
var stateList = {};

states.forEach(function(state, i) {
    var results = $('h4:contains("' + state + '")');
    var counties = {};

    $(results).each(function(i, e) {
        var locality = $(e).prevAll('h3').first().text();  // Extract locality
        $(e).next('table').find('tbody').find('tr').each(function(i, county) {
            var countyName = $(county).find('td').first().text();  // Extract county
            counties[countyName] = { LocalityPayArea: locality };  // Store data
        });
    });

    stateList[state] = counties;  // Store the final data for the state
});

Converting XML Pay Tables to JSON and Joining with Scraped Data

Once we had the Locality Pay Area Definitions in a structured JSON format, we needed to fetch and integrate the Pay Tables data, which was originally available in XML format on the OPM website.

Fetching and Converting XML Pay Tables

To fetch the XML Pay Tables, we used JavaScript to retrieve the XML files from the OPM website directly. This was done through the following JavaScript function, which fetched the XML content for a given year and locality codes scraped earlier:

JavaScript

var fullPay = [];
function getXML(year, locality){
    var xmlText;
    fetch(`https://www.opm.gov/policy-data-oversight/pay-leave/salaries-wages/salary-tables/xml/${year}/${locality}.xml`)
    .then(response => response.text())
    .then(str => new DOMParser().parseFromString(str, "text/xml"))
    .then(data => {
        setTimeout(() => {
           fullPay.push(xml2json(xmlText));  // Convert XML to JSON
        }, 100);
    });
}

This function sends a request to the relevant OPM URL, parses the response XML data using the DOMParser, and then converts it into a usable JSON format by calling the xml2json() function, and pushes the result to a new array called fullPay.

The xml2json() Function

The xml2json() function is responsible for converting the XML data into JSON format. Here’s how the function works:

JavaScript

function xml2json(xml) {
    try {
        var obj = {};
        if (xml.children.length > 0) {
            for (var i = 0; i < xml.children.length; i++) {
                var item = xml.children.item(i);
                var nodeName = item.nodeName;
                if (typeof(obj[nodeName]) == "undefined") {
                    obj[nodeName] = xml2json(item);
                } else {
                    if (typeof(obj[nodeName].push) == "undefined") {
                        var old = obj[nodeName];
                        obj[nodeName] = [];
                        obj[nodeName].push(old);
                    }
                    obj[nodeName].push(xml2json(item));
                }
            }
        } else {
            obj = xml.textContent;
        }
        return obj;
    } catch (e) {
        console.log(e.message);
    }
}

This function recursively converts an XML document into a JSON object. It processes each XML element by checking its children and building a JSON object with the node names as keys. If an element has multiple occurrences, it ensures that the values are stored as arrays.

Flattening and Storing Data

Once the data was converted to JSON and the XML pay tables were successfully joined with the scraped locality pay area definitions, we needed to flatten the nested JSON structure into a more usable format, such as a CSV, for easier analysis and importing into tools like Databricks.

We used JavaScript to flatten the dataset by iterating over the nested objects and extracting the relevant information into a tabular format.

Flattening the Nested JSON

Here is the code used to flatten the final dataset. It processes the nested pay table data and extracts details such as PayTable Abbreviation, Locality Pay Area, Grade, Step, and pay rates for Annual, Hourly, Overtime, and other fields:

JavaScript

fullPay.forEach(function(area){
    area.PayTable.Grades.Grade.forEach(function(grade){
        grade.Steps.Step.forEach(function(step){
            flattenData.push({
                Abbreviation: area.PayTable.Abbreviation,
                LocalityPayArea: area.PayTable.LocalityPayArea,
                Description: area.PayTable.Description,
                Notes: area.PayTable.Notes,
                SalaryCap: area.PayTable.SalaryCap,
                Type: area.PayTable.Type,
                Grade: grade.Value,
                Step: step.Value,
                Annual: step.Annual,
                Hourly: step.Hourly,
                Overtime: step.Overtime,
                Statutory_Cap_Applies: step.Statutory_Cap_Applies
            });
        });
    });
});

Code Explanation:

Flattening the JSON: After the XML data is converted to JSON using the xml2json() function, we iterate over the nested JSON structure. We extract key data from the Grade and Step levels and push the relevant data into a flattened structure.
Flattened Data Structure: The flattened data structure includes important fields like Abbreviation, LocalityPayArea, Grade, Step, Annual Pay, Hourly Pay, and others. These fields are then formatted as a CSV dataset for further processing.

This flattened data was then stored in a CSV format and imported into Databricks for further analysis. The dataset was stored in our OPM schema, making it accessible for various analytical tools.

How This Data is Used

With the locality pay area definitions scraped and organized; this structured data is used by our custom-built API to cross-reference county data from the Staffing Plan dataset. By geocoding city and state information from the staffing plan, we can determine the associated county and use the scraped locality pay area data to fetch accurate pay rates for federal employees.

Geocoding Address Data with the OSM API

To further enhance the functionality of our API, we integrated OpenStreetMap (OSM)’s Nominatim geocoding API to retrieve county-level information for a given address. This allows users to provide a city and state, and the API will automatically determine the corresponding county using the geocoding API, ensuring that the data can be accurately cross-referenced against OPM’s locality pay tables.

The Geocoding Process

Here’s an overview of how the geocoding process works:

User Inputs: The user provides a city, state, grade, step, and an optional flag (agg) to aggregate data.

Check for Special Cases: If the city is located in special regions like Washington D.C. or certain states (Hawaii, Alaska), we directly fetch locality pay data for those areas.

Geocoding Request: If the city and state are not in the special cases, the system constructs a URL for the OSM Nominatim API to geocode the provided city and state into county data:

JavaScript

const apiUrl = `https://nominatim.openstreetmap.org/search?city=${encodeURIComponent(city)}&state=${encodeURIComponent(state)}&country=USA&format=json&addressdetails=1`;

County Retrieval: After the geocoding request is made, the API receives the response and extracts the county from the data:

领英推荐

Zyte’s new AI-powered web data feeds enable unlimited…

Zyte 3 个月前

Unleash the Power of Data Magic with Django-Data-Wizard

StartxLabs 1 年前

Query Language: The 5 Advanced Features You Should Know

Designveloper | Software Development Company 6 个月前

JavaScript

const county = data[0].address.county || 'County not found';

Cross-Referencing with OPM Data: Once the county is identified, the locality pay data is retrieved from our custom-built API, using the county as a key to find matching locality pay data. If the county is not found, the system checks for alternative options such as using the city name.

How the Custom API Works

Our custom API is built with the Plumber framework in R, providing an efficient and flexible way to query the OPM Locality Pay data. The API exposes an endpoint that allows users to retrieve specific data based on the following parameters:

year: Specifies the year(s) of interest for the pay data.
state: Filters the data by the state(s) in which the locality pay applies (optional).
locality: Filters by the locality name, locality pay area, or locality code (optional).
grade: Filters by the pay grade (optional).
step: Filters by the pay step within the specified grade (optional).

1. Connection to Databricks Database

The API establishes a connection to a Databricks instance using the ODBC (Open Database Connectivity) interface. We preconfigured our Posit interconnection with Databricks and assigned an appropriately named DSN. The connection is initialized with the following code:

r

con <- dbConnect(odbc(), "Databricks")

This connection allows the API to query the OPM Locality Pay data stored in the Databricks environment, retrieving the most up-to-date salary information.

2. Query Construction and SQL Injection Prevention

To ensure data security and prevent SQL injection vulnerabilities, we use parameterized queries rather than directly embedding user input into the SQL query. This method ensures that user inputs are treated as data, not executable code.

Here’s an example of a safe query construction using parameterized queries:

r

query <- "SELECT DISTINCT opt.*, old.State, old.State_Abbreviation, old.Locality 
          FROM opm_data.opm_locality_definitions old
          LEFT JOIN opm_data.opm_pay_tables opt 
          ON opt.Abbreviation = old.localityCode AND opt.Year = old.Year 
          WHERE old.Year IN (?)"
          
# Assuming state, locality, grade, and step are vectors, bind those as parameters
params <- list(year, state, locality, grade, step)
dbGetQuery(con, query, params)

By using parameterized queries, the user inputs are safely handled, preventing any malicious SQL code from being injected.

Additionally, input validation is applied to ensure that parameters like year, state, locality, grade, and step are in the correct format and free from dangerous characters. For example, we can ensure that year is numeric and state only contains valid alphanumeric characters:

r

# Ensure 'year' is a numeric value
if(!is.numeric(year)) {
  stop("Invalid year value")
}

# Ensure 'state' is alphanumeric and does not contain dangerous characters
if(!grepl("^[a-zA-Z ]*$", state)) {
  stop("Invalid state value")
}

3. Dynamic Query Execution

Once the query has been constructed and the parameters safely bound, the API executes the query using the dbGetQuery() function from the DBI package:

r

dbGetQuery(con, query, params)

The query results are then returned from Databricks as a data frame, which in turn is transformed into JSON and served back as the response to the initial API call, containing the locality pay data that matches the specified parameters.

4. Returning Filtered Data

The data returned by the API is structured in a way that includes the relevant locality pay data, such as:

Locality pay area codes
Grade/step pay rates
State and locality names
Salary information for the specified year(s)

This data can then be used for further analysis or integration into other applications, such as R, Python, or Power BI.

Example Workflow

If a user enters "San Francisco" and "California" along with a grade of 12 and step 5, the system first geocodes the location to retrieve the county ("San Francisco County"). Then, the locality pay data for "San Francisco County" is fetched from the OPM database through the custom API.

Practical Application

We’ve successfully made API calls in R, Python, and Power BI Desktop to seamlessly pull OPM locality pay data into our analytical workflows. This development allows for more accurate salary projections and better decision-making based on up-to-date compensation data.

Conclusion

Our custom API provides a secure, efficient, and flexible way to access and filter OPM Locality Pay data, ensuring that users can easily retrieve the information they need based on specific geographic and pay grade criteria. By integrating the OSM geocoding API, we’ve added a powerful feature to the API, enabling users to submit a city and state and receive county-level data that can be cross-referenced with OPM’s locality pay tables. This solution enhances the accuracy and accessibility of salary and benefits data for federal employees.

?? Take a look at the presentation slides and code, below for more details.

?? Interested in exploring building custom APIs connections or web scrapping? Let’s connect!

Wired and Inspired

105 位关注者

查看更多评论

要查看或添加评论，请登录

Nicholas Galiardo, CWA, TE I, TM I的更多文章

Advanced SQL Insights

2025年3月3日

Advanced SQL Insights

Mastering Advanced Techniques to Tackle Real-World Data Challenges Welcome to this edition of Wired and Inspired…
Converting GeoJSON to SVG: A Scalable Approach for Custom Geographic Visualizations

2025年2月6日

Converting GeoJSON to SVG: A Scalable Approach for Custom Geographic Visualizations

Introduction The ability to visualize geographic data efficiently is a crucial requirement for many decision-support…
Advanced API connections and data traversal techniques using M Code

2025年1月30日

Advanced API connections and data traversal techniques using M Code

The Challenge: Parsing Complex API Data for Actionable Insights The 2024 hurricane season was unprecedented, stretching…

2 条评论
Advanced API connections and data traversal techniques using M Code

2025年1月30日

Advanced API connections and data traversal techniques using M Code

The Challenge: Parsing Complex API Data for Actionable Insights The 2024 hurricane season was unprecedented, stretching…

The Challenge

Solution

What is Web Scraping?

What is an API?

Web Scraping OPM Data for Locality Pay Area Definitions

The Scraping Process

Example of the Scraping Code

Converting XML Pay Tables to JSON and Joining with Scraped Data

Fetching and Converting XML Pay Tables

The xml2json() Function

Flattening and Storing Data

Flattening the Nested JSON

Code Explanation:

How This Data is Used

Geocoding Address Data with the OSM API

The Geocoding Process

领英推荐

How the Custom API Works

1. Connection to Databricks Database

2. Query Construction and SQL Injection Prevention

3. Dynamic Query Execution

4. Returning Filtered Data

Example Workflow

Practical Application

Conclusion

Wired and Inspired

105 位关注者

Nicholas Galiardo, CWA, TE I, TM I的更多文章

Advanced SQL Insights

Converting GeoJSON to SVG: A Scalable Approach for Custom Geographic Visualizations

Advanced API connections and data traversal techniques using M Code

Advanced API connections and data traversal techniques using M Code

其他会员也浏览了

Maximizing Data Fetching in Next.js with React Query

Prime topics to grasp in GraphQL

Streamlit how to guide: advanced tips for Data Scientists

Loading Data in GraphDB: Best Practices and Tools

Understanding GraphQL APIs: A Modern Approach to Data Fetching

What Data Science Tasks You Should Outsource?

Get data from any page

9 Tips for Using APIs and Web Scraping in Power BI

Core Pillars of GraphQL

DWG Data Extraction Export Drawing Data to Tables and Excel