Patching Holes
While I would love to live in a world where all data is perfect, I know that’s not reality. And not even close to reality. The real world contains data that are far messier and more chaotic than almost any toy dataset that we can find online.
Cleaning data is a task much like running on sand or walking through a few inches of rainfall. On the surface, it looks straightforward and innocuous, but once we jump into starting the process, it turns out there’s a lot more resistance than we thought. Cleaning data isn’t a single-step process either. I can’t even remember how many times I did the standard steps for selecting the columns, filtering the rows, and formatting the data types, only to revisit the ETL framework later to make updates. These additional updates include transformations like changing the local time zone into UTC, fixing spelling errors, or removing leading and trailing white spaces.
Time series data is another beast of its own. These time series data points represent values measured on a regular basis by the year, month, day, or even hour (to name a few examples of time intervals). However, there might be gaps in the data. The stock market, for example, closes on the weekend, so we don’t have measurements then. So we need to figure out a way to bridge this two-day hole between Friday and Monday. If it’s a holiday weekend (Monday off for example), then the data needs to bridge this three-day gap between the measurements instead of two days.
OFFSET DAX Function
Accounting for null or empty values in any dataset plays an imperative role in the future success of data models. We might want to remove them entirely from the data, or we might want to somehow bridge the gap between dates for example over the weekend. One way to bridge the gap over the weekend for stock market data for example is by using the OFFSET DAX function in Power BI in conjunction with the ORDERBY function to move the DAX measure filter context to the date before, even on non-contiguous dates like we see below. It follows the same logic as many other languages (like SQL) do for dealing with these scenarios.
Check out how to use the new DAX OFFSET and ORDERBY functions in one of my latest Power BI Weekly videos!
Filling in holes in R
Another way we can deal with holes in our data is by replacing the null values with the last known value for that measurement field. Time series data measure these data points on a regular basis. We want to have evenly spaced data points over time so we can use them in algorithms like ARIMA. Even if we have fastidious data records, there's a good chance that there might be holes in the data for irregular event occurrences. These missing data points occur for (often unintentional) reasons like when tools go offline, incomplete measurements, or simply they didn't take measurements at that point in time. The line chart below (created by referencing a zoo object in R) shows what these missing data points look like on a visual.
Instead, we can use built-in functions within the zoo or xts library in R to fill in the missing data point values with the value from the last non-null measurement. We can see what this looks like in our updated line chart (again using the zoo library in R).
In my latest course published in the LinkedIn Learning library (Time Series Modeling in Excel, R, and Power BI), I show how to fill in these missing values in the exact same way we see above through R code (using the RStudio IDE).
领英推荐
Free Courses!
Looking to learn more about AI? Check out these free courses (for a few more weeks) in the LinkedIn Learning library! I loved creating this course content because they explored how to build AI models and algorithms in different tools, whether that's Power BI, Excel, or R.
Logistic Regression
If you're looking to create the signature S-shaped model curves for logistic regression, this course shows how to build them without writing any code other than formulas in Excel (and builds the same model with R code in RStudio and Power BI Desktop as well).
Data Reduction
Dendrograms look like heatmaps built in tandem with tree branch structures on the sides. This data reduction course covers how to build these visuals using hierarchical clustering, along with the KMeans and PCA algorithms as well.
Power BI
Power BI has some amazing built-in functions for running AI models without having to write any code. This includes natural language processing (NLP), forecasting, anomaly detection, and linear regression, all of which the course below explores using really interesting weather and energy data models!
Thanks for subscribing to my newsletter! Stay tuned for upcoming projects and courses that I’m working on right now and in the near future!
-HW
Systems Engineer | Project Engineer
1 年Thanks for sharing!
Data Enthusiast | People & Process Champion | Cross-Functional Wrangler | Collaborative Player-Leader | Trusted Advisor | Adventurer
1 年Helen, always learning from your posts, thank you! Anyone else out there who finds joy in data cleaning/prep? I often feel a bit of a kook that this is one of my favored activities. I believe for me it's simply deriving pleasure from getting to a point of symmetry, much like the sense of preparatory accomplishment by getting puzzle pieces organized into logical groups. Plus, it seems I find more surprising connections and correlations at this step, more "swimming" and feeling your way through the data... discovery. Who else is in this tribe?
Next Trend Realty LLC./wwwHar.com/Chester-Swanson/agent_cbswan
1 年Thanks for sharing.