Data Extraction: Key Considerations

Data Extraction: Key Considerations

Data extraction is the process of retrieving data from multiple sources and consolidating it into a single destination. This method allows organisations to aggregate and analyse data from different origins, providing a comprehensive view and aiding in better decision-making.


Avoiding Error-Prone and Non-Scalable Solutions

One of the most common yet inefficient methods of data extraction is manual entry. This often involves downloading data and copying it into spreadsheets or other tools. Manual entry is of course prone to human error. A study conducted by the Journal of Accountancy found that manual data entry can result in error rates between 1- 5%. Furthermore, it introduces a latency in data availability, as human processing times cannot match automated systems.

The other common method of extracting and combining data would be through excel macros. Automation scripts or macros in Excel can fail or produce incorrect results due to unexpected data formats or changes in data structure and are not designed to handle very large datasets efficiently. Excel can also create a data silo, where data is stored in isolated files that are not easily accessible or integrable with other systems.

A more effective approach is to centralise your data into a data warehouse/lakehouse, ensuring that all data is accessible from a single point and is easier to manage and analyse.


How to Automate Data Extraction

To automate data extraction, there are two common scenarios.

The first would be to utilise EL (Extract and Load) tools like Fivetran, Airbyte, and Stitch which offer pre-built connectors with intuitive point-and-click interfaces, eliminating the need for complex coding. These tools simplify the process of integrating various data sources, making data management more efficient and less error-prone.

Alternatively, custom Python scripts can be employed for data extraction. This approach is beneficial when dealing with uncommon data sources not covered by existing tools or when the cost of EL tools is prohibitive. Custom scripts, however, require more resources for maintenance and development over the long-term.


Frequency and Methods of Data Extraction

Data extraction tools allow users to select the frequency of data retrieval, offering three primary methods:

Full Extraction: This method involves pulling all available data from the original source on every execution of the extraction process. You are more likely to do a full extraction when populating your data warehouse/lakehouse for the first time. There are cost implications associated with extracting data (as a general rule of thumb, the more data you pull the higher the cost) and so it is better not to rely on a full extraction every time.

Incremental Batch Extraction: With this model only the data that has changed since the previous execution is pulled. It improves efficiency and reduces strain on resources, allowing for faster processing of smaller batches, saving time and cost. Every source of data will have different capabilities to identify changes and to be selective with in the data pull, but you would normally accomplish an incremental extraction process by identifying the correct timestamp in the data you already have in the warehouse.

Incremental Stream Extraction: Incremental stream extraction is a data extraction method that focuses on continuously capturing and processing data changes as they occur, rather than processing data in predefined chunks or batches. This approach is particularly useful for applications that require real-time or near-real-time data updates. The downsides of this approach are that it can be considerably more costly, requires more oversight and as more data is being processed, also has a cost to the environment.


Handling PII Data

When dealing with personally identifiable information (PII), it is crucial to take precautions.?

One approach is to avoid extracting PII altogether, as the goal of analytics is to provide insights at scale not on an individual. By excluding PII from the extraction process, you significantly reduce the risk of unauthorised access or data breaches. Where analytics is sitting in its own infrastructure, it is very unlikely your team would need to use any PII information for metrics.??

If PII is required for certain types of analytics, data activation or AI models, we would recommend applying hashing algorithms on PII fields. However, PII data points might be necessary for certain analytics processes like cross dataset joining or data activation. If the requirement is only to join your first party data to data from a third party system, a simple method to reduce risk is to hash the keys used to join datasets. For example, if your CRM system uses email as the only way to identify users, you can bring the email from your different datasets, but only store the hashed version so that in the event of a data breach the value would not be useful to an attacker.

If storing PII data is unavoidable, the recommendation is to isolate the data as much as possible. One way to achieve this is by encrypting the sensitive data points and reducing the access to the decryption keys to select individuals in an organisation.


Cost Considerations

It is important to understand that higher data volumes and extraction frequency come with increased costs, including later storage expenses. Therefore, it is wise not to extract everything but only the data points which will relate to your KPIs. 173tech always advises creating a data dictionary to help define what these metrics are, what is considered the source of truth and how they are calculated. In this way you can narrow down exactly what data you need to extract and why.


Get In Touch

The team here at 173tech have a lot of experience in using both automated extraction tools and creating custom Python scripts for a wide range of different data sources. Get in touch today to see if we can help you centralise and model your key data!


要查看或添加评论,请登录

社区洞察