20 Advanced Methods For Doing Data Analysis in Excel
Like Atlas, Excel carries much of the world's business, financial, and scientific institutions on its shoulders.

20 Advanced Methods For Doing Data Analysis in Excel

The Excel-Acolyte's Ultimate Guide to Data Science (previous working title)

I love Python, I couldn't imagine doing Data Analysis without it, but I wish I had known how to do many of my favorite Pandas/SKLearn methods in Excel before committing to several books and boot camps to learning Data Analysis/Science for Python. That learning curve stopped me from jumping in because I thought I wasn't ready. Looking back, much of what I was learning to accomplish could've be performed in Excel, a software I'd already been using for a decade.

So I spent some time writing the guide I wish I had when I first started learning Data Analysis/Science. One shouldn't let Python's – albeit modest – learning curve stop you from getting your feet wet in a Data Analysis career.

The following 20 methods have been curated to take your Excel-game to the next level. If you use Excel for work, I'm positive the information here can help you no matter the field you work in.


From Non-Believer to Acolyte

Microsoft Excel does not have the sexiest name recognition in the world... but maybe it should.

When I was first taught Excel in my middle school computer class, I saw it as little more than an overrated calculator. Yeah, smart-filling was cool, but what good was that outside of an accounting department? Man, how times change. These days, I truly believe that Microsoft Excel represents one of the greatest technical achievements humanity has ever made. It's such a powerful tool, it still blows my mind that anyone can download and use it for free. It should come as no surprise that, like Atlas carrying the world, Excel carries most of the world's business, financial, and scientific endeavors on its shoulders.

I wanted to write an article examining many of the features that converted me into an Excel Acolyte: someone who wholeheartedly believes most work tasks can be solved in Excel. For example, the field of Data Science is complex and has a steep learning curve, but with a little know-how and creativity, many basic Data Science methods can be recreated right from Excel. Many of the methods and features I'm about to explore can easily replace many other software tools in many other industries. If you find yourself needing to make predictions or visualizations, doing repetitive tasks, knowing the basics of Excel, and unwilling or unable to learn Python then read on, friend.

I will assume you know the basics (Sums, fills, etc), and launch right into the good stuff.

Make sure to save this article for later reference!


What I Will Cover:

(In order of ascending difficulty)

  1. Formatting for Tables - Takes a collection of cells and in a couple of clicks, formats them as a table with colored header rows and adjustable totals at the bottom.
  2. Remove Duplicates - Cleans your data by removing identical rows, ensuring each piece of information is unique.
  3. Text to Columns - Splits text from a single column into multiple columns, based on specific delimiters, to organize data better.
  4. Conditional Formatting with Color Scales - Color your cells in a gradient or scale to easily spot trends, variations, and patterns in your data.
  5. Data Validation - Restricts the type of data or the values that users can enter into a cell, helping prevent data entry errors.
  6. Dynamic Arrays - Allows formulas to return multiple values that automatically expand to neighboring cells.
  7. Tables Slicers for Filtering - Provides an intuitive and visual way to filter table data, making it easier to segment and analyze.
  8. Text Functions - Manipulate and transform text data in cells using formulas, like changing cases, combining text, or extracting parts of it.
  9. Formulas for Data Cleaning - Utilize Excel formulas to correct or remove unwanted data, streamlining your dataset for analysis.
  10. X/V/H Lookup - Finds and retrieves information from a specific column or row in a table or range by matching a key value.
  11. Combination Charts - Combines two or more chart types into a single graph, enhancing the representation of different data sets.
  12. Pivot Tables - Summarize large data sets with customizable reports, allowing for easy data analysis and insight discovery.
  13. Advanced Filter - Offers complex filtering options beyond standard filters, including using criteria from outside the dataset.
  14. Geographical Data Types - Converts geographical identifiers into rich, interactive objects that can display extra information and maps.
  15. Power Query - Enables you to connect, combine, and refine data sources to meet your analysis needs.
  16. Correlation - Measures the relationship between two data sets, indicating how one may predict or affect the other.
  17. Manual Linear Regression - Analyzes the relationship between two variables to predict future values, using a straight-line formula.
  18. The Analysis Toolpak Add-On - A suite of additional functions and tools for statistical, engineering, and financial analyses.
  19. Descriptive Stats - Provides a quick statistical summary of your data, including mean, median, mode, and standard deviation.
  20. Logistic Regression - A statistical method for analyzing datasets in which there are one or more independent variables that determine an outcome, used especially for binary outcomes.


Let's get started!

1. Format as Table

Format as Table

When you first bring a .CSV or similar data file into Excel, your data will be a simple spreadsheet of unconnected values. If you'd like to sum a particular column, you'll need to manually select the column you'd like (minus any potential header rows) and write a sum formula.

Formatting for Table's Theme Selection

By selecting all your cells (command + A for Mac, Ctrl + A for PC) and clicking the Format as Table dropdown, you can instantly tell Excel that the values in your spreadsheet are related through columns and to turn everything into a colored table. This is especially useful if your data has a header row as it makes the header more obvious and allows for more intuitive sorting.

Header Sorting Options in Tables

Another great bonus to having your data formatted as a table is having access to the "total row" option. Under the "Table" tab, by checking the "Total Row" box, a new row will appear at the bottom of your table which automatically calculates the sum, mean, count, etc. of any of your numerical rows.

Selecting Aggregation for Row Totals

2. Remove Duplicates

Remove Duplicates Button Location

One of the best features in the Python data frame library "Pandas" is its built-in duplicate identification feature. It was one I wished I had access to in Excel... until I did a little digging and realized I already did!

To check to see if your data table contains duplicates, select all from inside your table to select all of the values without the headers, click on the "Data" tab, and then the small button with a red X (shown in the above photo).

Duplicate Found Alert

A prompt will appear asking which columns you'd like to check. Usually, you'll want to select all your columns to check for duplicate entries, but you can always narrow that down based on your needs. After clicking okay, a new prompt will appear informing you if any were found and eliminated.

3. Text to Columns

Text to Columns Button

This feature will take columns with cells containing multiple words and split them into individual columns. This doesn't sound like much but is vital in situations where you need to isolate a single part of many values like email domains or address numbers.

Example of Strings with Delimiters

Let's say, for example, you have a column that contains a series of addresses in long form and on a single line and you'd like to section out each of the elements within the addresses. Text to columns is the perfect tool. First, select all your data and click the "Text to Columns" button from the "Data" tab. Excel will then ask you three simple questions on how you'd like it to go about separating your data.

Text to Columns Wizard Page 1

The first window simply confirms if Excel is correct in its assumption about how your data is separated. In our case, it is. Each of our data points within each address is separated by a comma so we would select "Delimited." If each element was separated only by a space, we'd select "Fixed-Width" or confirm it was automatically selected by Excel's analysis.

Text to Columns Wizard Page 2

The next window confirms what symbol is being used to delimit each individual part. In our case, we want the street address, the state abbreviation, and the zip code to all be separate columns. Each is separated by a comma so we make sure the comma is selected. Data comes in all shapes and sizes so make sure you pay attention to what kind of delimiter your data is using.

The next screen simply asks you to confirm each of the new columns' data types and the destination for the new rows. After confirming both, you should see new columns for each element, separated by your selected delimiter.

Newly Separated Columns

4. Conditional Formatting with Color Scales

Conditional Formatting Options

Conditional cell formatting is easily the quickest way to make an Excel spreadsheet look more professional. What this feature does is embed certain color scales or small bar charts directly into the cell itself. You can find the conditional formatting button near the middle of the ribbon on the "Home" tab.

Cells with Conditional Formatting Applied

The tools here range from automatic to very intricate depending on your needs. At the top, you'll find features for indicating precisely how you'd like certain cells to be shaded depending on their values. However, I believe the middle options are the most exciting as they function automatically. By selecting which range you'd like to format and then, say, the Color Scales option, you can quickly show in red your underperforming sales rows dynamically contrasted against your high sales amounts which will appear in green. Additionally, if you'd like to further illustrate differences in quantity at a glance, adding a Data Bar to your cells is incredibly intuitive, especially for dollar amounts.

5. Data Validation

Data Validation Options

This is an easily overlooked feature that makes it easy to "future-proof" your dataset. If you're going to be setting up a pipeline to automatically add data to your dataset from a database or other source, data validation allows you to set each column to only accept a particular kind of data. This helps quickly flag future situations where data is input incorrectly. This can mean improper type or range. This means you can easily use data validation to quickly spot outliers or potential input errors.

Validation Error Example

For example, let's say you know you usually never sell more than $1,000.00 per transaction. You can set your Revenue column to automatically throw an error and/or add a red circle when values are over this amount. That way, if you load up the latest data from your data source, but someone accidentally added some zeros when inputting an amount, it makes it very easy to identify and assess that error.

Data Validation Error Circling Enabled

6. Dynamic Arrays

We're going to be stepping up the complexity now. Dynamic arrays are exactly what they sound like. An array is simply a collection of values and the addition of the word "Dynamic" means that it's able to adapt. In this case, the addition of Dynamic Arrays to Excel gives you the ability to write formulas that can adapt to different kinds of inputs and give you adjusted outputs. Basically, instead of returning a value in a single cell based on a formula, you can return a series of cells.

Excel's UNIQUE() Function Creates a Dynamic Array

For example, let's say I wanted to return one value per cell for every unique country in my dataset and then I wanted to sort the results alphabetically. The above formula does just that. The UNIQUE() function returns and fills a series of cells as opposed to a single cell.

A Dynamic Array

As you can see from the above photo, the results total seven countries so the returned values spill over into seven cells. The SORT() function that surrounds the UNIQUE() function sorts the values alphabetically.

Here's a short list of more Excel functions that can return Dynamic Arrays:

  • FILTER(): Returns a filtered version of the source range, based on a provided condition.
  • SORTBY(): Sorts a range or array based on the values in one or more corresponding ranges or arrays.
  • SEQUENCE(): Generates a sequence of numbers in an array, based on the specified count, start, and step.
  • RANDARRAY(): Returns an array of random numbers between 0 and 1.
  • TRANSPOSE(): Returns a transposed version of an array or range (switches the rows and columns).

7. Table Slicers for Filtering

Table slicers allow a user to quickly, simply, and dynamically filter a table. It requires you to format your data set as a table so if you skipped the first bullet point and don't know how to do that, go back and review now!

The Insert Slicer Button

Once your data is formatted as a table, make sure a cell within your table is selected, then locate the "insert slicer" button on the "Table" tab.

Let's say for the sake of our example that we wanted to be able to quickly limit the table based on the country that the sales took place in. In that case, we'd select check the box next to Country.

A Slicer to Filter a Table by Countries

As you can see from the above image, our table is automatically filtered to show only the data that matches the selected field. We could add another slicer for the Date of Sale if we wanted to limit our table to sales from France on a particular day.

Another amazing benefit is that charts based on our data will be automatically updated by the slicer. It helps to build any calculated cells, slicers, and charts referencing the table on a new sheet as using the slicer affects the number of rows on the original sheet and thus the positioning of any other objects on that sheet.

A Table's Distribution of Sales in All Countries

Above is the distribution for the Sales in all countries. I've placed the slicer and the histogram on a separate worksheet.

A Table's Distribution of Sales Now Filtered to Only Show France

In the above image, we see that since the chart and slicer both reference the same table, clicking on "France" limits the chart to only sales within that country and automatically refreshes the histogram to reflect the new distribution. This feature is a must for building dashboards.

8. Text Functions

Sometimes you need to filter or select only parts of a series of characters or words. For instance, maybe you have a column containing phone numbers and would like to extract all the area codes. If the numbers have no delimiter, then the text-to-column function wouldn't work.

Enter the LEFT(), RIGHT(), and MID() functions. Each will return a certain number of characters when counted from the left, right, or middle of a cell. For example, if I had a cell that contained the word "CATSKILLS" and I wanted to return the word "CAT," in a new cell I would reference it with the formula "=LEFT(A1, 3)" and that new cell would contain the word "CAT."

In the below example, I have created random phone numbers in column A. This gives us a great opportunity to combine the UNIQUE() dynamic array function with MID() to identify all of the unique area codes that appear within those phone numbers.

Phone Numbers with All Unique Area Codes Extracted with MID()

The RIGHT() and LEFT() functions only take two arguments, the cell you're referencing and the number of characters it should return from the end. The MID() has one extra argument. First, it takes the reference cell, then the starting position, and then the number of characters to the right it should return. In the above formula, I know I have 50 numbers and a header so I first have MID() reference A2 to A51, then since I know they're all US numbers, I skip the first character beginning at the 2 position, then I return the three characters to the right of that position. That alone would get me all of the area codes, but I only want the unique ones so I wrap that function in a UNIQUE () function which eliminates any duplicates.

Okay, now we're cooking with gas! Congrats on making it this far. If you've followed everything we've covered so far, then you're ready for a bit more of a challenge. Let's check out some more advanced functions typically used for data cleaning in more advanced tools like Pandas for Python or SQL.

9. Formulas for Data Cleaning

In this section, I'm going to cover three functions commonly used in other data-cleaning tools: TRIM(), CLEAN(), and SUBSTITUTE().

Use of TRIM()

TRIM() removes white spaces from not only the ends of a string but also from inside the string itself. This is an aspect it has that many equivalent methods in Python and SQL do not. This function is great for standardizing string entries, especially names.

The CLEAN() function is hard to display here because it removed "non-printable" characters which are usually artifacts from data corruption. I cannot display an example because I can't print non-printable characters. The CLEAN() function is good for data validation or when importing error-prone data from external sources.

The SUBSTITUTE() function is exactly what it sounds like. It replaces a given string wherever it occurs in a given range.

Use of the SUBSTITUTE() function

SUBSTITUTE() also takes an optional fourth argument to indicate which instance of the given string should be replaced.

Use of SUBSTITUTE() with the additional arguement

10. XLOOKUP

The LOOKUP functions are some of the most common and some of the more confusing Excel functions for Data Analysts. Every time I've had to put together an Excel dashboard for a client, I inevitably end up using XLOOKUP even when I try my best to avoid it. Chances are, at some point in your career, you'll need to use it too.

In a nutshell, what the various LOOKUP functions do is search for a given value within a given range, but they differ in the details of the arguments they accept. This is possibly the most valuable function you can learn from this article.

XLOOKUP is the newest version of the lookup function and it is generally recommended you use it whenever it is available due to its flexibility and enhanced capabilities. Unlike its predecessors, XLOOKUP can search in any direction, not just "vertically" or "horizontally". More on that in a bit.

XLOOKUP simplifies finding a value within a range or array and returning a related value from a specified array or range. Its syntax, XLOOKUP(lookup_value, lookup_array, return_array, [if_not_found], [match_mode], [search_mode]), introduces many optional parameters for added versatility, such as specifying a return value when the lookup value is not found (if_not_found), and more precise match types (match_mode) and search modes (search_mode). This functionality makes it not just a replacement for VLOOKUP and HLOOKUP but a significant improvement, offering a more intuitive handling of various lookup scenarios.

A usage of XLOOKUP()

In the above example, I've created a simple lookup tool for anyone who might want to see a specific row in the data without having to scroll to find it. This formula first looks at the cell at K4 which I've given a thick border. Whatever the user types here becomes the value XLOOKUP will search for. In this case, the user enters a sale ID they'd like to see, and then XLOOKUP checks the entire A column from top to bottom for the first match by default. When a match is found, it dynamically returns all info in the corresponding columns A to H. If the user types in an invalid or non-existent ID, XLOOKUP will display "Sale ID Not Found."

There are further ways of personalizing XLOOKUP. What follows is a list of the additional arguments that can be passed that affect what it considers a match and how it goes looking for said match.

XLOOKUP MATCH MODES:

  1. Exact Match (0): Looks for an exact match to the lookup value. If no exact match is found, the function returns an error. This is the default mode.
  2. Exact Match or Next Smaller Item (-1): If an exact match isn't found, the function returns the next smaller item. This is useful for sorted data where you want to find a close match without going over.
  3. Exact Match or Next Larger Item (1): If an exact match isn't found, the function returns the next larger item. Like the previous mode but in the opposite direction.
  4. Wildcard Match (2): Allows the use of wildcard characters (* for any number of characters, ? for a single character) in the lookup value.

XLOOKUP SEARCH MODES:

  1. Search First-to-Last (1): Starts searching from the beginning of the lookup array and moves towards the end. This is the default search mode.
  2. Search Last-to-First (-1): Starts searching from the end of the lookup array and moves towards the beginning. Useful for finding the last occurrence of a value.
  3. Binary Search for Ascending Order (2): Uses a binary search algorithm for faster searches on large data sets sorted in ascending order. This mode is efficient but requires the data to be sorted.
  4. Binary Search for Descending Order (-2): Similar to the previous mode but for data sorted in descending order.

For posterity, VLOOKUP, or Vertical Lookup, the first iteration of XLOOKUP, searches for a value in the first column of a table or range and returns a value in the same row from a specified column. When data is organized vertically, it means that each row represents a unique record or item, and each column represents a different attribute or variable of that record. VLOOKUP is most suited for situations where your data is organized vertically, and you need to retrieve information based on the leftmost column. Its syntax, VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]), includes an option for exact or approximate matches but requires the lookup value to be in the first column of the data range.

HLOOKUP, or Horizontal Lookup, operates similarly but is designed for horizontal data layouts. In horizontally organized data, each column represents a unique record or item, and each row represents a different attribute or variable of that record. HLOOKUP searches the first row of a table or range to return a value in the same column from a row you specify. The syntax, HLOOKUP(lookup_value, table_array, row_index_num, [range_lookup]), mirrors that of VLOOKUP, adjusted for horizontal data.

And now for something completely different!

11. Combination Charts

So far, everything I've mentioned can be accomplished in Excel for both Mac and PC, however for Combo Charts, there are noticeably more customization options on PC. Those options include not only more aesthetic choices but also the ability to use a secondary Y-axis which is useful for relatively larger values like cumulative sums. So word to the wise if you're working on a Mac and want more customization options, use Microsoft's Online Office 360 version of Excel instead!

Types of Combination Charts

Combination charts combine two charts into one, but you probably guessed that! They are especially useful when plotting financial information because Combo Charts allow for multiple data sources to be graphed in multiple manners on multiple axes within the same graph.

An example of a Bar/Line Combination Chart

As you can see in the above example, they're especially powerful when trying to illustrate that "one of these things is not like the others." That can mean abnormalities or significantly larger or smaller values overall.

12. Pivot Table

This section gets my vote for the second most valuable thing one can take away from this article. Pivot Tables were the thing that finally made me an Excel Believer. A Pivot Table is essentially a real-time report page or a highly interactive and customizable group-by. If you want to impress a stakeholder, build them a Pivot Table and show them how to use it.

There have been times when I've been on a call with clients, going over EDA findings, and they've raised a hand to ask me "Hey, this is great, but I'm curious how many Xs per Y we've had" as a sinking feeling in my stomach begins to grow because I hadn't done that exact grouping yet. The only thing that saved my bacon in those moments was knowing how to build a Pivot Table and being able to construct that exact grouping from scratch in mere seconds and then manipulate it in real-time as more questions came through.

Where to locate the PivotTable button

If you've already got your data formatted as a table, you're halfway there. Just select all, head to the "Insert" tab, and click Pivot Table on the far left. You'll get a prompt asking you where you'd like to insert the new Pivot Table and I recommend always putting it in a new worksheet.

From there, you need to ask yourself some questions about your data. What grouping would I like to see? More precisely, what are the categories I want to see and what would I like to divide by those categories? When I was first learning it, this part was confusing for me.

PivotTable Fields

The Pivot Table Fields can look intimidating if you've never used them before, but they're very simple after you've used them a couple of times. At the top, your Field Names are simply all of the columns within your table. If you're just learning, ignore Filters and Columns for now. Rows represent the categories you'd like to use to group your data and Values represent what data you'd like to see within those categories. For instance, if you want to see Sales by Date, then Date becomes your rows and Sales becomes your Values. If you'd like to see your Revenue by Product, Product is your Rows and Revenue is your Values. See, easy.

Two important things to keep in mind are for Rows, ideally, you'll want to choose a field from your data with relatively low value counts. In technical terms, you want a discrete data category, one that only has a few possible options. Otherwise, the Pivot Table will get very long very fast. The other thing is that for each value, Excel needs to know how exactly you want those values grouped or aggregated in technical terms. These aggregations can be sums, averages, maxes, mins, or simple counts. To choose which you'd like, simply click the small "i" button next to your field after you've added it to the values panel. In the below example, I'm showing the average revenue made by product type.

Basic 2-field PivotTable

Now, the really amazing thing about Pivot Tables is the ability to stack various groupings. For instance, if you wanted to combine the examples I mentioned and see the average sales by product by date, all you need to do is drag the "Date" field above Product within Rows. As you can see below, Excel automatically recognizes our Date column and splits it into Years, Quarters, Months, and Individual dates. We can then see each of the products sold on those days and the average number of sales for each. By looking closely at the groupings on the top left side, we can see those averages broken down by Month, Quarter, and Year.

A multi-row PivotTable

Now let's say I wanted to give this to a client and let them quickly see the same grouping only limited, let's say, to a given country of their choosing. That sounds a lot like the Slicer we used recently. In fact, that's exactly what Filters are in Pivot Tables. If we add the Country field to the Filter panel, we can let the user quickly limit the existing grouping to the country of their choosing.

A PivotTable with a filter

Finally, let's say I wanted to see at a glance the average sales per product per quarter, limited to China, and I wanted to see it broken down by year, but I didn't want to have to collapse a bunch of rows in the report to get there. That's where columns come in. Columns are the trickiest to wrap your mind around, but they're basically the highest grouping in the sense that they can show parallel groupings based on what you have so far. They duplicate the hierarchy you've built across a given category. Take a look below.

A PivotTable with columns

Since there are Quarters and Products for every Year (and only a few possible years) it makes sense to move Year into Columns. This basically duplicates the Pivot Table as if you were to do the exact same grouping (average sales per product per date for China) manually for each Year. It does the same grouping for each Year and then shows the results side by side.

Pivot Tables are one of those things that only really make sense after you've had a chance to play around with them. So the next time you're working with a table in Excel, I highly recommend making a Pivot Table and just dropping random Fields into random panels to see what you can generate!

I'd also recommend checking out the "Pivot Chart" options which automatically generates visualizations based on your Pivot Table in real time.

13. Advanced Sort & Filter

Pivot Tables were pretty rough so let's do something a little simpler now. Advanced Filters for tables act like static slicers on top of regular sorts. Let's say you have a table like the example I've been working with and it's sorted descending by sale ID. That's all fine and well, but it would be nice to have it already displaying the top sales first when we hand it to a client.

An example of tabled Sorting feature

What I've done in the example above is click inside my table, go to the "Data" tab, click the "Sort" button, and create a hierarchy of sorts so that the highest sales are on top and where sales are equal the highest revenue is on top, and where revenue is equal the highest profit margins are on top. This also has the added benefit of making our cell highlights look even nicer.

Now let's say I only wanted to show transactions where customer satisfaction was "Very High." I could simply right-click on that value and set it as a filter. The resulting table will only display rows where that value matches. Kind of like a slicer that's always on.

Filtering a table by right-clicking a value

Clicking on the "Advanced" button next to the filter allows to to further set your criteria range and specify the conditions data must meet to be included in your filtered results.

Table Filtering options

14. Geographical Data Types

Now here's a fun one. Excel has a few "special" data types that it can work with. You probably already know the basics like whole numbers or phone numbers or addresses, but you can also specify when a cell's value contains Stock, Currency, or Geography data. Doing so will allow Excel to display special info. It can automatically grab real-time stock info from stock tickers or real-time exchange rates from currency. However, the one I end up using most is Geography data because it allows you to build filled map charts quickly and dynamically.

An example of the Geographical Data button and a column of Geographical data

Above you can see I've gone to my "Data" tab, selected my table's column which contains country info, and changed its data type to "Geography." Now a little map appears next to each of the values. This works for full addresses, city names, states, and zip codes. Clicking on the little map icon gives you info about that place. You can even generate more columns from information from Bing such as location descriptions.

Now, to get info onto a filled map in Excel for Mac, I tend to make a simple Pivot Table where the columns are the geographical locations and the values are... well the values. We do this because the Filled Map function needs each individual location to be its own column so we need to "pivot" our data. After creating the simple Pivot Table, I copy and paste the resulting table into the worksheet where I want the filled map and convert the location names back to the geographical data type as shown below.

A small PivotTable export containing Geographical Data

Then, it's very easy to select this new small table, go to the "Insert" tab, and click Filled Map as shown below.

The Filled Map button

From here, you can customize the look of your filled map! This is great for those who want Tableau-style geographical visuals but don't know how to use Tableau or don't want to pay for it.

A Filled Map using Geographical data values

Use caution if you're using more granular location data as the Filled Map function becomes more complicated the more geographical locations you're working with. Six countries is no problem, 1000 individual zip codes might cause your laptop to catch on fire.

15. Power Query

The "Get Data" drop down menu

One of the reasons Excel wasn't taken seriously as a tool for data science for a long time was its limitations on what kinds of data sources you could connect to and how much of that data you could import at a time. Not that long ago, you couldn't connect to a server and you were limited to roughly a million rows. Additionally, any changes you made to your data weren't replicable like they would be if you wrote a Python script to perform your cleaning automatically.

Power Query solves all of that. Power Query is really the key feature that elevates Microsoft Excel to the status of a true Data Science tool because it allows you to parse "Big Data" sources far larger than simple .CSVs.

Now, I could probably write an entire article roughly the same length as this one dedicated solely to Power Query so I'm going to limit this section just to the basic reasons you should learn to use Power Query and some of its best features. At the highest level, you would want to use Power Query when you either need replicability, the ability to perform the same kinds of processes over and over for every newly imported dataset, or if you need to perform complex data cleaning or validation on a particularly large dataset. Power Query is also quite useful when it comes to feature engineering which is a term for the ability to impute new data columns out of existing ones.

Power Query's Data Source options

To get started importing data with power query, go to your "Data" tab, then click the Get Data button on the far left side of the ribbon. That will display all of the various ways you can connect to external data sources. For our purposes here, I've generated a .CSV with many, many data validation problems.

Power Query data preview

The first thing you'll want to do is check to make sure all of your data has been properly parsed. That's to say that all of your columns are properly sectioned off. You can change the delimiter if they aren't. Once that's done, click "Transform Data" to launch the Power Query Editor.

The Power Query Editor

As mentioned, I don't want this section to go on forever so I'll just quickly explain what you're seeing here and what I'd like to do with this data. What I'd like to do here is make sure all of our data is valid, meaning it's formatted all the same way, and that new data entered in the same dataset will also be subject to the same formatting if I run the same processes again. Then I'd like to create a new column that contains a rolling 3 month average based on sales.

From the main editor page, you can see all of the external sources you're connected to on the left side panel. On the right side, you can see a history of every transformation that has been done so far to your data (this is where the replicability comes from). In the center you see your data, your data types per column, as well as the percentage of valid rows, empty rows, or rows imported with errors. As you can see, ours are 100% valid across the board because we haven't told Power Query what valid data looks like yet and it's imported the "Not a Number" values from our .CSV file as strings that read "nan" which here is valid.

On the ribbon itself, we see two columns that have a simple but crucial difference. Clicking on both reveals that the Transform tab and Add Column tab have very similar tools, however, Transform applies changes to your data in place whereas Add Column creates a new column that contains new data based on whatever change you've made.

Row removal options for Data Validation

First things first for our validation, removing duplicate values. From the "Home" tab I click Remove Rows and find the Remove Duplicates button. Power Query removes all duplicate rows and then adds that process to the history panel on the right. If I ever wanted to "go back in time" on our analysis, I'd just click on a higher step and our data would revert.

Text standardization with the Transform tab

Next, I want each one of my customer's names to be capitalized. I go to the "Transform" tab, click Format, then Capitalize Each Word. It does so and I see a new step added to our Applied Steps.

Replacing values in Power Query

Now usually, our nans would register as missing values and be flagged as "empty" cells. Since our nans are just strings, I use a simple right click and replace to set those cells to "Unknown" instead.

Splitting a column on a delimiter

I've also noticed that my "Categories" column has subcategories for products in Electronics. They're separated by a pipe operator symbol, "|". I right-click my categories column and select "Split Column." Then I choose my custom separator and hit OK. This splits my column into two. It also means that now I have null values in my newly created subcategories column so I'll go back to my replace option and swap those with "n/a."

Replacing periods with spaces

I've also noticed that there are periods in between some of the names. Using the "Replace Values" button from the ribbon in the "Transform" tab allows me to swap that with a simple whitespace.

Getting the absolute value of a column

I know that I have no negative sales, but it appears like my dataset contains some negative values so I right-click the column, select Transform Column, then convert each of the values to their absolute value.

That looks good enough to me! Now that all of the data is cleaned and validated, I go to the "Home" tab and select Close and Load Data to add my data to my main Excel file.

Finally, the thing that makes Power Query so worthwhile: if you're on a PC, it's possible to export a .pq file which saves your query for reuse later. If you're on a Mac, unfortunately, this isn't possible, but you can always revise your previous steps and copy the M-code step by step onto new data.

16. Correlation

Okay, that was rough. Let's do another fun one.

Correlation in the field of statistics is a complicated process, but the idea behind it is very simple. If I have two series of data and a number goes up in one series, how likely is it that the corresponding number in the other series will go up, down, or stay the same? So you run a complicated series of calculations which I won't get into here and you're left with a single number known as a "correlation coefficient." That number ranges between 1 and -1. A correlation coefficient close to 1 means that when one series goes up, so does the other. A correlation coefficient of -1 means the opposite, when one series goes up, the other goes down. A correlation coefficient of 0 means there's no noticeable relationship. There are also various kinds of correlation calculations, some more rigorous than others, but we'll leave it there for now.

While knowing how correlated two sets of data are doesn't tell us everything (and in fact can even be misleading), it does give us a strong indication that two data sets share some kind of relation that should be further investigated.

There are three ways to do correlation analyses in Excel, two of which I'll cover now and one of which I'll save for a little later.

Excel's built-in Pearson Correlation function

Firstly, there's a function. This doesn't look like much, but it will come in handy later. Using Excel's CORREL() function, all I need to do is pass two arrays of data and Excel immediately returns the Pearson correlation coefficient. As we can see from the near-zero score, my randomly generated data are not very correlated.

A Scatter Plot visualizing (a lack of) correlation

We can confirm that with the second method, plotting a scatter plot using the two arrays. Typically, if the data points are arranged more or less in what looks to be a diagonal line, then there's likely a correlation. From this jumbled mess, I can tell from a glance there's no correlation here. I'm attaching below what it would look like if there was a correlation for reference.

A highly-correlated Scatter Plot

17. Linear Regression (Manual)

Pay no attention to the word inside of the parentheses for now.

A Linear Regression is a way of taking two correlated data series and using a "line of best fit" on their values to predict future values to a degree of certainty. Linear Regression is the most basic machine learning algorithm, so simple in fact that it can be done with a pen and paper. That is exactly what we're going to attempt here, more or less.

Why would we do that? Well, I'm always a fan of teaching folks to fish as opposed to giving them a fish. In your future or current data reports and presentations, it may help you in the long run to explain to any non-technical stakeholders how machine learning algorithms make predictions. Visualizing the most basic version helps.

The statistical formula for performing a Linear Regression is y=mx+b where:

  • y: The value you're trying to predict, also known as the dependent variable because its value depends on the value of x.
  • x: A given variable within your dataset, known as the independent variable because it is assumed to influence or predict the value of y.
  • m: The slope of the regression line, representing the change in y for a one-unit change in x. It quantifies the relationship between the independent variable x and the dependent variable y, indicating how much y changes when x changes by one unit.
  • b: The y-intercept of the regression line, which is the predicted value of y when x is 0. It represents the point where the regression line intersects the y-axis.

We can calculate m and b using Excel functions. All we need to do after that is plug those values into the formula. For this example, let's see if Profit Margin Percentage is a good predictor of Sales (I have a suspicion it isn't).

Excel's SLOPE() function

Since we're trying to predict Sales, it is our dependent variable or "known y's" so it goes first in Excel's SLOPE() function. Profit Margin comes next. Poof. We have our slope.

Excel's INTERCEPT() function

We then do the same exact thing only using Excel's INTERCEPT() function. Congratulations, you just technically built a machine-learning model! Now we have all the pieces to perform a Linear Regression and try to predict a price.

Okay. Let's try to use the first sale in our data set to test how accurate our model is. Plugging in all the numbers we get:

A manual Linear Regression prediction

Hey! 49.01 is pretty close to our real value of 51 for that sale! Our model might actually be usable after all.

The way we'd test that is by checking the R2 (R-squared) score which measures how accurately our model can predict our known independent variables. Excel actually has a function for this and running it gets us R^2 = 0.00025 with scores closer to 1 meaning that the model predicts well and scores closer to 0 meaning the model doesn't predict anything sooo...

Pobodies nerfect! At least we learned something. On to the next one.

18. The Analysis Toolpak Add-On

I really saved the best for nearly last here. If you haven't messed around with Excel's new Add-On features, I highly recommend doing so. The Add-On I'd like to focus on right now is called The Analysis Toolpak and it actually lets you perform many advanced statistical and machine learning analyses very quickly from within Excel.

See I told you there was a third way to do Linear Regression.

Excel's Add-Ins window within the Tools drop down menu.

If you navigate to the Tools drop-down menu and then to "Excel Add-ins..." you'll see an option to check a box next to Analysis Toolpak. Doing so will enable new "Data Analysis" menus elsewhere in Excel.

Using the Analysis Toolpak's Regression option

One location is back in the Tools drop-down menu listed as "Data Analysis..." and clicking that button will launch the above menu with many advanced data analysis and statistical methods you can apply to arrays. Be careful to be exact with your array ranges. The Toolpak doesn't automatically ignore headers.

The Analysis Toolpak's Regression window.

Clicking Regression allows us to select our y and x variable arrays as well as an output cell. Clicking okay gets us this...

The Analysis Toolpak's Regression output.

Much faster than our previous method, but seeing this also lets us confirm that our previous measurements were correct!

19. Descriptive Statistics

Using the Analysis Toolpak, we can very quickly calculate many descriptive statistics including the central tendency of a given array. I mention it here to emphasize that this replaces a very common function like Pandas' describe() method in Python.

Running it on our Sales column returns this:

An example of the Analysis Toolpak's Descriptive Statistics output.

20. Logistic Regression

Okay! Let's go ahead and put it all together with one last big swing.

First things first, what is a Logistic Recession and how does it differ from a Linear Regression? Put simply, whereas a Linear Regression predicts a single continuous numerical value, a Logistic Regression guesses as to what binary category a prediction belongs in. In our case, I'd like to build a Logistic Regression model that predicts whether or not our observation came from a given country.

To do this, we'll need a new binary column that is 0 when an observation did not come from the said country and 1 when it did, then we'll need to do a little bit of Excel trickery to get the program to perform the calculations we want, but if you've made it this far it should be a piece of cake.

Encoding my Country column with an IF() function.

To get started, I convert my Geographic data back into regular text by selecting the country column, right-clicking, going to Data Type, and selecting "Convert to text." Then, I make a new column with a formula that takes every cell in the country column, removes non-printable characters, removes any trailing whitespaces, and then if the remaining text says "United States," it records a 1 with any other value getting a 0. This column is going to represent our training data. Ideally, once the model is finished, we'd be able to pass some Sales, Revenue, and Profit Margin values, and it would return to a degree of certainty either a 1 or a 0 for whether or not that sale came from the US. Then, for simplicity, I delete every column that isn't my dependent variable, independent variable, or reference.

Just to get the technical aspects out of the way, the logistic function can be defined as P(Y=1) = 1 / (1 + e^(-(β0 + β1X1 + ... + βnXn))), where P(Y=1) is the probability of the dependent variable equaling 1 (success), e is the base of the natural logarithm, X1, ..., Xn are your independent variables, and β0, β1, ..., βn are the coefficients to be estimated.

Adding placeholders for coefficients and the intercept.

Let's start with the placeholders for the coefficients and intercept. We need starting values that can be adjusted for each input. I'll create one new cell to hold the coefficients at the bottom of the table so they don't get in the way. I'll give them starting values of 0.01 (don't think too hard about that, just choose any number close to zero for now). I'll also make a cell for the intercept.

Adding statistical columns.

Now we need some new columns for the statistical methods we'll use to optimize the cells we just made. I'll add those to my slimmed-down table. Logit is the logarithm of the probability of the country being the US, eLogit is the inverse of Logit, Probability is the probability of it being the US expressed as a real number, and Log Likelihood is a measurement of the fit.

Calculating the Logit.

We use the placeholder values to calculate Logit.

Using EXP() to calculate the eLogit.

Then Excel's EXP() function to calculate eLogit.

Calculating the Probability.

Then we use eLogit to calculate Probability.

Using LN() to calculate the Log Likelihood.

Then we use Excel's LN() function on the Probability to get the Log Likelihood and use the table's total row to get the sum of our Log Likelihood column.

Launching the Solver Add-in

Now we navigate to the "Data" tab and launch our newly installed "Solver" from the right side of the ribbon. We set the objective to the sum of the Log Likelihoods, "To" to "Max," "By Changing Variable Cells" to our coefficient and intercept placeholders, unchecking the "Make Unconstrained Variables Non-Negative" box, and lastly making sure our solving method is GRG Nonlinear. Then hit solve!

Now, I hate to end this anticlimacticly, but since my data is randomly generated, proceeding any further is going to get pretty nonsensical. At this point, Solver will update the values you placed in it.

If you'd like to proceed to make predictions using your newly fitted logistic regression model, please check out the article that helped me put together what I have so far!

Check out this article on how to do it with your actual data: https://www.datamation.com/big-data/logistic-regression-in-excel/


So long and sorry about your headache!

If you're new to using Excel, I'm sure that was a lot to take in, assuming you didn't just skip to the bottom. I hope this shows you the value of Excel and has officially converted you into an Acolyte like me!

If you use Excel for your job, no matter your field of work, I hope you took away at least one new method and you can use to make your lifer easier or your reports and dashboards more fantastical!

Did you enjoy this article? Did I miss something? Got a better method? Leave a comment below and let me know! And make sure we're connected so you see the next article I have planned when it comes out!

Got a data or media production related question? I'm your guy! Drop me a line and I'd be happy to chat.

Andrii Lytvyn

Business development | Entrepreneurship | Product development

6 个月

Casey, thanks for sharing!

回复

Casey this is very helpful. I wasn't aware of most of them. Thank you for sharing!

Júlia Greco

Video Editor

7 个月

This is amazing! So informative. ??????

要查看或添加评论,请登录

Casey Franco的更多文章

社区洞察

其他会员也浏览了