Entity Resolution: The Cornerstone of BSA Data Analysis
Christopher Westphal
???????????????? ???????????????????????? + ?????????????? ?????????????????? + ???????? ?????????????? = ????????????????
Entity Resolution (ER) is the process of establishing equivalence among data that refer to the same real-world entity (person, organization, place, etc.) due to duplicate, inconsistent, incomplete, or erroneous data. This process is essential for analyzing Bank Secrecy Act (BSA) data used by law enforcement, regulatory bodies, and national security agencies to combat money laundering, terrorist financing, and other illicit financial activities.
More than 260,000 regulated financial institutions submit over 55,000 reports daily (exceeding 20 million annually) to the Financial Crimes Enforcement Network (FinCEN), the federal agency responsible for BSA oversight. These reports detail suspicious transactions, control of foreign financial accounts, beneficial ownership, and bulk cash movement and purchases. However, analyzing this data presents significant challenges due to name variations, nicknames, aliases, and misspellings (often intentional) which can obscure crucial links between records pertaining to the same individual.
ER directly addresses these challenges by accurately linking records and consolidating data, building a comprehensive view of individuals, organizations, and their financial activities. This "big picture" is essential for understanding the scale of money laundering operations, mapping connections to expose criminal networks, and ultimately aiding law enforcement in targeting key figures.
Each BSA submission made to FinCEN includes core details about the activity (filing, transaction, or event), such as dates, times, amounts, descriptions, types, and other relevant information. Certain data elements like phone numbers, addresses, identification numbers, IP addresses, and email addresses, are categorically unique meaning they inherently represent distinct analytical-entities, unlikely to be confused by similar values or content.
While many of these values may have different formats, various transformation and cleanup processes can standardize them for proper identification. For instance, the phone numbers (123) 456-7890, 123.456.7890, and 1234567890 are easily recognized as referring to the same entity. For more complex data like addresses, standard parsers, abbreviation lists, and lookup tables can accurately resolve inconsistencies and help ensure accurate matching.
Consolidation becomes more challenging with less distinctive data, such as common names, which often lack the specificity needed for accurate entity identification. The name "John Smith," for instance, illustrates this difficulty; without further information, distinguishing between multiple individuals becomes problematic. Is the "John Smith" from New York the same as the one from New Jersey? Is the "John Smith" born on 02/05/1998, the same as the one born on 05/02/1998? Is "John Smith" the same person as "Jon Smiths"?
To resolve these ambiguities, ER must rely on additional parameters or "descriptors" to establish unique identities. These can include demographic information such as gender, date of birth, age; physical characteristics like height and weight (non-BSA data); and distinct identifiers like Social Security Numbers, driver’s licenses, or other types of IDs. The lack of these identifiers associated with names in datasets can lead to inaccuracies and inefficiencies in investigations.
BSA Format & Queries
For over two decades, the Bank Secrecy Act Electronic Filing System (BSA E-filing) has provided a secure, electronic method for regulated entities to submit BSA forms. This system enables filers to create and submit well-formatted XML "batch" files containing one or more forms, ensuring compliance with the specific XML schema for each form type (SAR, CTR, etc). FinCEN collects and stores all this data which are subsequently made available via the FinCEN Query System (FCQ) as shown in Figure 1.
The XML format for all BSA forms use consistent representations for data types such as addresses, identification numbers, phone numbers, email addresses, URLs, and other supporting information. This consistency extends to the representation of "parties" associated with the filings, including individuals, suspects, owners, institutions, referrals, law enforcement contacts, and bank officials. Each party's role in the reported activity is defined using a simple numerical lookup code.
FinCEN publishes the XML format for each BSA form, ensuring consistent data exchange. These formats provide a structured approach to detailing transactions and related entities. For example, the Suspicious Activity Report (SAR) format can be found at: FinCEN SAR XML User Guide.
To deliver effective analytics, it is important to understand the underlying XML structures and how it impacts ER. Each reported ACTIVITY (SAR, CTR, 8300, FBAR, CMIR, DEOP, BOI) is assigned a unique identifier (ActivityID) that encapsulates all associated details. Within this activity wrapper, all parties involved are identified according to form-specific requirements. Every party—whether the filing institution, a bank representative, or the subject of the activity—is represented by a single, unique PARTY structure (PartyID) within a given activity. Each PARTY record can contain multiple sub-entities to reflect all associated addresses, phone numbers, identification numbers, email addresses, and names. Figure 2 shows the representative structure.?
As illustrated in the abbreviated FinCEN XML specification shown in Figure 3, an ACTIVITY (e.g., 111…) can contain a PARTY (e.g., 222…) with multiple PARTYNAME entries (e.g., 333… and 444…), representing, for example, a "Legal name" and an "Also known as (AKA)" name. For clarity, other associated XML fields related to the activity, party, and party-name have been omitted.
Each party record will always include a "Legal name" and may also contain zero or more "Also Known As" (AKA) or "Doing Business As" (DBA) entries. Each name representation is assigned a unique PartyNameID to distinguish every variation. FinCEN assigns a distinct PartyID to each party upon processing the form, ensuring a unique identifier for that specific entity. Consequently, even if the same bank submits multiple SARs or CTRs concerning the same individual, each submission generates a new PartyID, resulting in different numbers for every ActivityID for that individual within the FinCEN BSA E-filing database.
While this method of data representation is typical for many collection systems, it creates ER challenges, particularly compounded by name variations. For instance, if one bank records an individual as JON SMITH and another bank as JOHNNY SMITH, a query searching for JOHN SMITH might overlook crucial data. Although a skilled analyst might try to account for these variations or use other identifying information such as address, phone number, email, identification number, or related entities and accounts, each of these attempts necessitates different queries. This forces the analyst to manually track numerous entities and values and subsequently synthesize all the results into a coherent diagram, a time-consuming and potentially error-prone process.?
This limitation affects highly specific queries, such as searches for a particular name. It also impacts operations like SAR-review-teams or other investigations using proactive queries, such as "all transactions for a specific region within a defined timeframe" (e.g., all SARs filed in Miami, FL in the past six months). If the underlying data is not properly resolved and aligned, the results may display numerous disconnected networks or overly dense representations due to the sheer number of entities present.
Figure 4 illustrates two very similar entities originating from a SAR and a CTR. While basic data cleaning can readily standardize their phone numbers, identification numbers, and addresses, standard analytical systems would likely treat them as different entities. Applying ER to this combined data would significantly improve the accuracy and reliability of the results. Varying the stringency of the matching criteria by using different combinations of data fields (such as name and address, name and identification number, name and phone number, or any combination) will further enhance the confidence in the resolved entities.
The goal is to deliver a more complete, thorough, and reliable representation of all the critical data available for any investigation. Although the examples presented are based entirely on BSA content, it becomes an analytical multiplier when accessing and combining data from different content including other government sources, social-media, open-source, and subscription services. Figure 5 depicts how data from SAR and CTR sources are easily integrated using ER by exposing similar names in combination with other core-entities such as phones, emails, addresses, accounts, and id-numbers.
Ultimately, to enhance the core BSA XML framework, agencies should implement an ENTITY structure with a unique EntityID. This process, ideally performed daily during data loading or collection, would involve applying ER matching templates set at a "strict" level to ensure only highly probable matches are made. An associated EntityScore could quantify the strength of each match, serving as a filter for results. Less stringent matching criteria could then be employed for specialized analytical contexts, such as counter-terrorism investigations, where broader searches and comparisons are necessary.
FinCEN Query System (FCQ)
During an IT modernization program circa 2012, FinCEN replaced the legacy Web Based Currency and Banking Retrieval System (WebCBRS) system with the FinCEN Portal and Query System (FCQ) that permits authorized users to query the BSA data using an on-line database query application. Today, the FCQ supports approximately +2.5 million searches each year across hundreds of agencies, task forces, and investigative units. An early screen capture for searching the FCQ for several key fields is show in Figure 6.
BSA data is a crucial resource for virtually all agencies conducting financial crime investigations, becoming a standard component of many government investigations. Reports indicate that nearly 90% of IRS Criminal Investigations (CI) cases utilize BSA data. The FBI leverages this data in thousands of cases involving transnational criminal activity, public corruption, international terrorism, and organized crime. Homeland Security Investigations (HSI) relies on BSA data for a broad spectrum of criminal investigations, leading to indictments, convictions, and the seizure of billions in assets, including currency, virtual assets, and bulk cash.
FinCEN annually recognizes agencies for notable investigations using BSA data to combat illicit activities including Drug Enforcement Administration (DEA) for interdicting transnational criminal organizations, Department of Justice’s Civil Rights Division Criminal Section for human trafficking and smuggling investigations, and IRS-CI for detecting fraud, corruption, and use of synthetic identifications. A compilation of cases where BSA data was instrumental in identifying or supplementing investigations was recently published by FinCEN and an abridged synopsis for several of these are provided below:
These cases demonstrate the critical role of BSA data in uncovering financial crimes and achieving successful prosecutions. While effective use of this data requires agencies to retrieve, consolidate, and integrate information from various sources, ER offers a significant advantage. By automatically linking disparate records, even those with inconsistencies, ER addresses the inherent fragmentation of BSA data. This process creates a comprehensive view of individuals, organizations, and their financial activities, revealing hidden connections and the true scope of illicit operations like money laundering. Ultimately, ER enhances the quality, efficiency, reliability, and scope of investigations, transforming fragmented data into actionable intelligence and strengthening the fight against financial crime.
Good Data Leads to Good Analysis and Results.
Financial Crime Consultant
2 个月Outstanding, Chris. ??