5 Data Analytics Challenges Companies Will Face in 2021
Benjamin Rogojan
Fractional Head Of Data | Reach Out For Data Infra And Strategy Consults
Integrating data into strategy is proving to be a differentiator for businesses of all sizes. The clichéd term “Data-Driven” isn’t just for billion-dollar tech companies.Companies like DiscoverOrg and MVF are using data to help drive decisions and create better products.
Even smaller companies are finding savings and new revenue opportunities left and right thanks to data.
However, this is all easier said than done.
Just pulling data from all your different data sources isn’t always sufficient. There are a lot of problems that can come up with developing your data strategy and products.
In this article, I will outline some of the problems you’re likely to run into using data, including increasing data size, having consistent data and definitions, and reducing the time it takes to get data from third-party systems to data warehouses/data lakes/data lake-houses.
I’ll also provide some solutions.
Too Much Data Leading to Poor Performance
Big data. It solves all your bad algorithm problems, right? Well, sort of.
In fact, big data can cause a host of new problems. In particular, with performance.
Big data means if you don’t have systems that can handle the data, you will quickly run into problems.
Even the largest corporations run into this. Data keeps growing and in turn slowing down dashboards, models, and reports. Waiting two minutes for a Tableau dashboard isn’t an option. No executive, director, or manager wants to wait that long.
So, although big data offers a lot of benefits as far as possible insights go, it also quickly becomes a burden (and I haven’t even referenced the problem of pruning unnecessary data).
So, how do we improve the performance of big data systems?
Solution 1: Pay For More Compute
One way to solve the problem is just to spend more money — buy more compute on your clusters or get bigger machines. But there’s always a limit.
I’ve worked with plenty of clients who quickly calculated that to improve their performance by the more compute route would cost far too much. It may seem like an easy step but arguably it’s not the best solution.
Sometimes, it’s less about compute and more about design, as well as the underlying system.
Solution 2: Migrate to a Cloud Data Warehouse/Lake-house Designed For Speed
Through a combination of technical debt and just time, systems begin to slow. There are several ways to improve without just buying bigger machines.
For example, if your team is using databases like Postgres or standard SQL Server for your data warehouse, then it might be time to migrate to a cloud data warehouse that was designed for analytical queries.
You could look into cloud data warehouses like Azure Synapse, Redshift, or BigQuery. These can help improve performance depending on how you design your data warehouse.
Solution 3: Improve Your Data Warehouse Design
If you’re already in a cloud data warehouse, then you can look into other possible solutions, such as better design, aggregate tables, or indexes.
These are all solutions that would require looking at your design, looking at bottlenecks, and assessing the best solution.
Improving your overall design is a much more difficult topic to discuss. I would be happy to discuss this on a free consultation call or during our open office hours that I hold every week.
You can also look at some of these articles on scaling as there are a lot of ways to approach the problem.
Managing Complex Business Decisions In Excel
Excel and spreadsheets continue to drive billion-dollar decisions in companies across the globe. This reliance on Excel has also led to millions and billion-dollar mistakes by even the smartest companies.
For example, in 2008 Barclays agreed to purchase Lehman Brothers, but spreadsheet errors led them to eat losses on contracts they did not intend to buy. The detailed spreadsheet of Lehman assets contained approximately 1,000 rows that needed to be converted into a PDF. However, the original Excel version had hidden rows with 179 items that Barclays did not want. The notes that they did not want those contracts were not transferred to the PDF, but the hidden rows were. As a result, they had to buy 179 contracts that they did not want.
And in 2012 JP Morgan lost nearly 6 billion dollars due to what came down to mostly Excel errors.
Excel is a very versatile data tool that can help teams manage a lot of workflows. However, they are also very error-prone because of complex design, human error, and how Excel operates in general.
To avoid these errors your team has a few strategies you can put into place.
Solution 1: Treat Excel Like Engineers Treat Code
If your Excel is being used to make large decisions, then you should treat it the same way engineers treat code.
This means there should be Excel reviews and test cases. It may seem like overkill, but Excel is very much like code. Some people even consider it a fourth-generation coding language. That means it’s prone to errors caused by logic, the process, and the inputs provided.
So, Excel should be treated as code.
Don’t just trust that an analyst, regardless of how smart they are will make a perfect Excel sheet. Brilliant programmers make bugs, as do brilliant analysts.
A logic review, test cases, and sanity checks should be put in place to reduce these errors. This all may seem unnecessary until your company loses massive amounts of money due to bad information.
Solution 2: Automate Excel Using Python And SQL
Automate and develop clean data workflows for business processes that are well-defined and can be converted into SQL and code.
A lot of reporting requires copy-pasting data from a report or database table. Instead of copy-pasting, there are methods can that automatically provide the data outputs.
This can be done with a combination of SQL, code, and spreadsheets with limited manipulation. Code is still error-prone, however, it can often be written in ways that limit the number of errors and inputs. This is not the case with Excel. Excel’s flexibility is what makes it error-prone.
Which solution is best depends on the complexity of the process, its repetitiveness, and how big the decisions being made based on the data solution are.
Getting Your Data Into Your Data Warehouses Fast
I have now talked to data scientists and analysts at companies across the US and they all have one major complaint: They can’t get to their data fast enough.
This is often because there aren’t enough data engineers with free time to pull in every new data set. This drastically slows down the ability of data scientists and analysts to answer questions that could be saving the business thousands, if not millions of dollars.
There’s a constant balance to be found between creating well-governed data systems and providing data quickly so business executives can make decisions.
But by the time the data is added, it might be too late.
Solution 1: Use An ELT Instead of an ETL For Ad-hoc Work
The truth is that, at some point, given the speed that modern companies need to move at, sometimes working with data that’s good enough is better than waiting to work with perfect data.
Just to clarify — I do believe that data that goes into your central data warehouse should be as accurate and dependable as possible. However, thanks to alternative central data storage systems, data scientists and analysts can start to get access to new data sets faster without having to wait for data to be 100% set up — with the caveat that the data could be funky.
This is where ELTs come in.
For those unfamiliar with ELTs and ETLs. These are methods of extracting, transforming, and loading data into data warehouses. Notice the first letter of that process makes up ETL.
I have gone over ETLs in the past. The problem is, ETLs are slow to develop because the transform is often code-heavy. There are pros and cons to that, but I won’t be discussing these now.
ELTs move the transform over to the end of the process which allows data analysts and scientists to start to work with the data before its fully processed. This does pose some risk as far as data governance goes.
However, I believe ELTs play a role when it comes to ad-hoc analysis as well as attempting to figure out which data sets to model for your core data layer.
ELTs have lots of solutions and tools — here are some of them:
Solution 2: Only Bring in the Data You Need
One of the big problems people run into is having too many data sources to pull from.Instead of pulling all of the data, your team should focus on only pulling data that aligns with your business goals.
For example, Salesforce and Workday can have hundreds of tables depending on how custom your team’s processes are. So don’t create a pipeline for every possible table — only pull in the data you need.
This helps allocate resources and ensure that you get your data quickly.
Consistent Data Values and Definitions
Inconsistent data definitions and values across teams can cause major issues.
In many ways, this problem conflicts with the solution I provided above. But I am purely focusing more on the core data models of a business, so I believe this problem is slightly different.
For those who are unfamiliar with this problem, let me provide a simple example.
Let’s say you are creating a metric that is looking at multi-day events vs. single day events.
What defines a multi-day event? Is it an event that lasts more than 24 hours or an event that spans more than 2 days or perhaps somewhere in between?
Defining key concepts that your business uses on a daily basis is important — the chances are that every business team will be using that term, KPI, and definition. If there isn’t a consistent definition across teams, you will probably see inconsistent reporting: One team will say there were 100 multi-day trips while the other will say there were 90 multi-day trips.
I have seen this cause chaos in meetings. Suddenly, instead of focusing on the actual context and impact of some form of strategic change. Managers and directors are trying to discover the reason for the ten-trip difference. Completely derailing any form of actual strategic purpose of the meeting and turning it into a waste of time as logic and people attempt to match numbers.
Similarly, another problem that occurs is with all the various integrating systems, like Workday and Salesforce. Various fields can become unsynced. For example, a company’s Salesforce instance might pull the job position once a week for employees from Workday. That’s for reducing manual intervention, but it poses a risk for reporting and having accurate information. If a data analyst pulls from salesforce and assumes the data is always up to date — they could be wrong.
There are many risks in having a lack of clear definitions and clear sources of truth.
Solution #1: Implement a data governance strategy
One great solution is to set up a data governance process.
Data governance is the process of managing the availability, usability, integrity, and security of the data in enterprise systems, based on internal data standards and policies that also control data usage.
Data governance is far from sexy: It’s not data science or machine learning. But effective data governance ensures that data is consistent and trustworthy and doesn’t get misused.
Let’s look at that last part in particular — making sure the data is consistent and trustworthy.
Data governance, as its name, suggests, unavoidably adds more process and bureaucracy to moving data around. Companies set up data governance committees to ensure that the data that exists and is reported on is consistent.
Data in modern companies are often highly integrated and automatically populates various third-party sources. I gave the example above of Workday and Salesforce integrations.
Data governance helps define which fields should come from which sources, so you don’t pull the same data from different sources causing a risk of data not matching.
In other words, data governance is a human process.
Solution #2: Pick a data lineage tool
Other, more automated, processes can help your teams track data sources and how data goes from point a to point b.
For example, tools like TreeSchema, Octopai and Kylo automatically scrape your team’s data sources to help track metadata — who owns which table, what the schema of your data is, and where the data comes from.
Not only do these tools help track where the data comes from, they also help track changes in the data and bad data practices.
For example, TreeSchema offers some features, like alerts when new data sources or fields are added. In a normal company, a new data object would be created but might not have any documentation to tell users what that data is. If it does have documentation it might be in some shared folder or SharePoint site. Instead, with TreeSchema, you centralize your data documentation and get updates when there are holes in it.
Having a method to track all your metadata can simplify things and eliminate confusion about data. It also creates a more reliable core data layer.
Lack of a Clear Data Strategy
One of the biggest challenges, faced by most companies, is a lack of clear direction. There’s so much data to use, analyze, build data products from, and integrate, it can make it very difficult to know where to start.
Setting up a clear data strategy is generally step one.
This means is looking at your overall business goals and then seeing where you can align them with your data goals — not the other way around.
I’ve worked with clients whose goals were to integrate AI into their services, but they didn’t really think through the role AI would play in their overall business strategy.
Solution 1: Start with your business goals first
Before you spend thousands, if not hundreds of thousands, of dollars on some new machine learning model or data lake house, make sure that there are benefits to your business. There needs to be alignment.
First, plan your business goals and then see what data you have that can help those business goals.
Step one is to go through a basic data analytics strategy checklist.
How Will You Improve Your Data Analytics Process Today?
Using data to make better decisions gives companies a competitive advantage. However, this depends on the quality of data and the robustness of data processes set up.
Simply creating dashboards, data warehouses, and machine learning models is not sufficient to make data-driven decisions. In developing your future data products, there are many snags that can catch your team.
Hopefully, this list of five challenges and solutions can help your team make informed decisions on how to improve your data analytics strategy.