登录查看更多内容

The Story of Data - Part 3

Andy Crellin

Putting your data to work: Data | Analytics | Insight | Information Delivery

发布日期: 2022年12月21日

This is the third post in a multi-part series about how we approach partnering with organisations to drive positive business change. Today we look at the?Acquisition?part of the?Data Collection?phase. All of this can be found on our website at?www.wearepoweredbydata.com/datastory?where we will build up to the full story as we go.

Data Collection: Acquisition

Once we have discovered where the data is, what systems there are, who the relevant people in the organisation are and what processes and access requirements are in place, we can move on to the act of actually acquiring the data.

Acquisition is the final part of the “Collection” phase of the Data Story – and it may be surprising to learn that “acquisition” in this case doesn’t necessarily mean “gathering together”, or in fact moving the data at all. What we mean by acquisition is the process of making the data ready to “prepare” for analysis.

The process we go through for this depends partly on the data maturity of the organisation we are working with. Those that have a high level of data maturity may have great data environments, ready to use to process the data and prepare it for analysis – sometimes in-situ. For organisations that have a low level of data maturity, however, we may need to bring all of the data we have discovered into a new data environment to enable the preparation process to start. Sometimes this will necessitate the implementation of a temporary or permanent (depending on the type of engagement we have) system that will act as a central data store. Acquisition is then about bringing all of the relevant data into this new environment.

Either way, the outcome of the data acquisition process should be an environment that contains all relevant data (still quite raw), structured in a way that allows it to be prepared for analysis. This data preparation is then the next phase of the Data Story, which we’ll talk about next time.

Acquisition methods

With thousands of different potential data sources, it is critical that we have experience with a good range of them. Acquiring data from a nicely constructed database might be relatively easy, however extracting tabular data from poorly scanned documents in PDF format presents some unique challenges.

Some common data acquisition methods are:

Querying structured files/databases

Making up the vast majority of data acquisition tasks, querying structured databases or files involves automated methods to extract data from systems using the relevant interfaces for those systems. Extracting large data sets from an existing database using SQL via a script is an example, as well as automatically processing 10,000 csv files stored in a document store and recognising and extracting appropriate data. Having a wide range of experience across many different types of datastore is critical – as well as the experience and skills to know how to extract that data in the most efficient way.

Web Scraping

Sometimes we find that an organisation has a critical but poorly maintained and documented application that they use to enter and store data. If there are no data export capabilities, then web scraping may be the only way to extract the data. If the application is not web based, then there are tools we can use to interact automatically with it in a scripted way and creatively extract the required data.

API interfaces

Where an API interface to an application is available, we can leverage that to extract data in a formal way. Sometimes this might require a bit of creativity to work with the limitations of the API, but usually there is a way! This requires an understating of the technology used for the API and how to construct efficient queries against it. APIs are sometimes “rate limited” (only allow a certain number of queries per second) or “response size” limited (will only return a response up to a maximum size) so for large extracts we may again need to be creative in how we construct the scripts.

PDF processing

May PDF files contain both the “image” of the document and the underlying text (which can easily be extracted) – but this isn’t always the case. Even if the underlying text can be extracted it’s usually not presented in a nice table (even though that is what the document shows). And if the text layer isn’t available then we need to train OCR (Optical Character Recognition) software to do the heavy lifting for us.

Manual Data Entry

It might sound archaic, but sometimes manual data entry is the best that’s available. If you have a cabinet of handwritten documents that need to be processed, it is sometimes much more efficient (and accurate) to transcribe them manually, compared with trying to configure a cutting edge scanning and text recognition system to do the job. Humans are still (currently) superior when context is key and content representing similar things is presented in different formats. There are many companies around the world that specialise in providing this service.

And many others

There are thousands of different ways that data can be stored, from hand-written documents to super-fast, niche, in-memory databases and they all have their idiosyncrasies. It is crucial that you have a data partner with the experience and creativity to deal with them all.

At the end of the Acquisition stage, we should have a well-documented set of data sources ready for processing. We may have moved some of the raw data into more appropriate locations, but some may be left where it is (if that location remains the most appropriate for that data). This documentation is now a hugely valuable asset for the organisation, and should be widely distributed particularly in the technical teams. It may contain information about the organisation’s data that has never before been seen and will support projects well beyond the current scope of work.

The acquisition part of Collection gets our data into shape, ready to be processed – and brings us into the final phases of the “realm of enablement”.

In the next article we will focus on the Data Preparation phase of the Data Story.

Wherever you are in your data story, we can help - from acting as a sounding board as you dip your toe in the water, right through to a full data and insight stack implementation. If you’d like to have a chat about your business challenges and opportunities drop a DM to?Anna Blackwell?or visit our website?www.wearepoweredbydata.com?– we love meeting new people and hearing about your data plans!

要查看或添加评论，请登录

Andy Crellin的更多文章

The Story of Data - Part 5

2023年6月23日

The Story of Data - Part 5

We at Powered By Data are passionate about supporting talent when we see it. This final part of this phase of our…
The Story of Data - Part 4

2023年3月17日

The Story of Data - Part 4

This is the fourth post in a multi-part series that delves into our approach to partnering with organisations to drive…
Call Based Marketing Automation: Attributing "post-visit" calls

2015年7月13日

Call Based Marketing Automation: Attributing "post-visit" calls

Call Based Marketing Automation (also known as visitor-level call tracking) allows you to follow a customer's journey…

Data Collection: Acquisition

Acquisition methods

Andy Crellin的更多文章

The Story of Data - Part 5

The Story of Data - Part 4

Call Based Marketing Automation: Attributing "post-visit" calls