20 Advanced Methods For Doing Data Analysis in Excel
The Excel-Acolyte's Ultimate Guide to Data Science (previous working title)
I love Python, I couldn't imagine doing Data Analysis without it, but I wish I had known how to do many of my favorite Pandas/SKLearn methods in Excel before committing to several books and boot camps to learning Data Analysis/Science for Python. That learning curve stopped me from jumping in because I thought I wasn't ready. Looking back, much of what I was learning to accomplish could've be performed in Excel, a software I'd already been using for a decade.
So I spent some time writing the guide I wish I had when I first started learning Data Analysis/Science. One shouldn't let Python's – albeit modest – learning curve stop you from getting your feet wet in a Data Analysis career.
The following 20 methods have been curated to take your Excel-game to the next level. If you use Excel for work, I'm positive the information here can help you no matter the field you work in.
From Non-Believer to Acolyte
Microsoft Excel does not have the sexiest name recognition in the world... but maybe it should.
When I was first taught Excel in my middle school computer class, I saw it as little more than an overrated calculator. Yeah, smart-filling was cool, but what good was that outside of an accounting department? Man, how times change. These days, I truly believe that Microsoft Excel represents one of the greatest technical achievements humanity has ever made. It's such a powerful tool, it still blows my mind that anyone can download and use it for free. It should come as no surprise that, like Atlas carrying the world, Excel carries most of the world's business, financial, and scientific endeavors on its shoulders.
I wanted to write an article examining many of the features that converted me into an Excel Acolyte: someone who wholeheartedly believes most work tasks can be solved in Excel. For example, the field of Data Science is complex and has a steep learning curve, but with a little know-how and creativity, many basic Data Science methods can be recreated right from Excel. Many of the methods and features I'm about to explore can easily replace many other software tools in many other industries. If you find yourself needing to make predictions or visualizations, doing repetitive tasks, knowing the basics of Excel, and unwilling or unable to learn Python then read on, friend.
I will assume you know the basics (Sums, fills, etc), and launch right into the good stuff.
Make sure to save this article for later reference!
What I Will Cover:
(In order of ascending difficulty)
Let's get started!
1. Format as Table
When you first bring a .CSV or similar data file into Excel, your data will be a simple spreadsheet of unconnected values. If you'd like to sum a particular column, you'll need to manually select the column you'd like (minus any potential header rows) and write a sum formula.
By selecting all your cells (command + A for Mac, Ctrl + A for PC) and clicking the Format as Table dropdown, you can instantly tell Excel that the values in your spreadsheet are related through columns and to turn everything into a colored table. This is especially useful if your data has a header row as it makes the header more obvious and allows for more intuitive sorting.
Another great bonus to having your data formatted as a table is having access to the "total row" option. Under the "Table" tab, by checking the "Total Row" box, a new row will appear at the bottom of your table which automatically calculates the sum, mean, count, etc. of any of your numerical rows.
2. Remove Duplicates
One of the best features in the Python data frame library "Pandas" is its built-in duplicate identification feature. It was one I wished I had access to in Excel... until I did a little digging and realized I already did!
To check to see if your data table contains duplicates, select all from inside your table to select all of the values without the headers, click on the "Data" tab, and then the small button with a red X (shown in the above photo).
A prompt will appear asking which columns you'd like to check. Usually, you'll want to select all your columns to check for duplicate entries, but you can always narrow that down based on your needs. After clicking okay, a new prompt will appear informing you if any were found and eliminated.
3. Text to Columns
This feature will take columns with cells containing multiple words and split them into individual columns. This doesn't sound like much but is vital in situations where you need to isolate a single part of many values like email domains or address numbers.
Let's say, for example, you have a column that contains a series of addresses in long form and on a single line and you'd like to section out each of the elements within the addresses. Text to columns is the perfect tool. First, select all your data and click the "Text to Columns" button from the "Data" tab. Excel will then ask you three simple questions on how you'd like it to go about separating your data.
The first window simply confirms if Excel is correct in its assumption about how your data is separated. In our case, it is. Each of our data points within each address is separated by a comma so we would select "Delimited." If each element was separated only by a space, we'd select "Fixed-Width" or confirm it was automatically selected by Excel's analysis.
The next window confirms what symbol is being used to delimit each individual part. In our case, we want the street address, the state abbreviation, and the zip code to all be separate columns. Each is separated by a comma so we make sure the comma is selected. Data comes in all shapes and sizes so make sure you pay attention to what kind of delimiter your data is using.
The next screen simply asks you to confirm each of the new columns' data types and the destination for the new rows. After confirming both, you should see new columns for each element, separated by your selected delimiter.
4. Conditional Formatting with Color Scales
Conditional cell formatting is easily the quickest way to make an Excel spreadsheet look more professional. What this feature does is embed certain color scales or small bar charts directly into the cell itself. You can find the conditional formatting button near the middle of the ribbon on the "Home" tab.
The tools here range from automatic to very intricate depending on your needs. At the top, you'll find features for indicating precisely how you'd like certain cells to be shaded depending on their values. However, I believe the middle options are the most exciting as they function automatically. By selecting which range you'd like to format and then, say, the Color Scales option, you can quickly show in red your underperforming sales rows dynamically contrasted against your high sales amounts which will appear in green. Additionally, if you'd like to further illustrate differences in quantity at a glance, adding a Data Bar to your cells is incredibly intuitive, especially for dollar amounts.
5. Data Validation
This is an easily overlooked feature that makes it easy to "future-proof" your dataset. If you're going to be setting up a pipeline to automatically add data to your dataset from a database or other source, data validation allows you to set each column to only accept a particular kind of data. This helps quickly flag future situations where data is input incorrectly. This can mean improper type or range. This means you can easily use data validation to quickly spot outliers or potential input errors.
For example, let's say you know you usually never sell more than $1,000.00 per transaction. You can set your Revenue column to automatically throw an error and/or add a red circle when values are over this amount. That way, if you load up the latest data from your data source, but someone accidentally added some zeros when inputting an amount, it makes it very easy to identify and assess that error.
6. Dynamic Arrays
We're going to be stepping up the complexity now. Dynamic arrays are exactly what they sound like. An array is simply a collection of values and the addition of the word "Dynamic" means that it's able to adapt. In this case, the addition of Dynamic Arrays to Excel gives you the ability to write formulas that can adapt to different kinds of inputs and give you adjusted outputs. Basically, instead of returning a value in a single cell based on a formula, you can return a series of cells.
For example, let's say I wanted to return one value per cell for every unique country in my dataset and then I wanted to sort the results alphabetically. The above formula does just that. The UNIQUE() function returns and fills a series of cells as opposed to a single cell.
As you can see from the above photo, the results total seven countries so the returned values spill over into seven cells. The SORT() function that surrounds the UNIQUE() function sorts the values alphabetically.
Here's a short list of more Excel functions that can return Dynamic Arrays:
7. Table Slicers for Filtering
Table slicers allow a user to quickly, simply, and dynamically filter a table. It requires you to format your data set as a table so if you skipped the first bullet point and don't know how to do that, go back and review now!
Once your data is formatted as a table, make sure a cell within your table is selected, then locate the "insert slicer" button on the "Table" tab.
Let's say for the sake of our example that we wanted to be able to quickly limit the table based on the country that the sales took place in. In that case, we'd select check the box next to Country.
As you can see from the above image, our table is automatically filtered to show only the data that matches the selected field. We could add another slicer for the Date of Sale if we wanted to limit our table to sales from France on a particular day.
Another amazing benefit is that charts based on our data will be automatically updated by the slicer. It helps to build any calculated cells, slicers, and charts referencing the table on a new sheet as using the slicer affects the number of rows on the original sheet and thus the positioning of any other objects on that sheet.
Above is the distribution for the Sales in all countries. I've placed the slicer and the histogram on a separate worksheet.
In the above image, we see that since the chart and slicer both reference the same table, clicking on "France" limits the chart to only sales within that country and automatically refreshes the histogram to reflect the new distribution. This feature is a must for building dashboards.
8. Text Functions
Sometimes you need to filter or select only parts of a series of characters or words. For instance, maybe you have a column containing phone numbers and would like to extract all the area codes. If the numbers have no delimiter, then the text-to-column function wouldn't work.
Enter the LEFT(), RIGHT(), and MID() functions. Each will return a certain number of characters when counted from the left, right, or middle of a cell. For example, if I had a cell that contained the word "CATSKILLS" and I wanted to return the word "CAT," in a new cell I would reference it with the formula "=LEFT(A1, 3)" and that new cell would contain the word "CAT."
In the below example, I have created random phone numbers in column A. This gives us a great opportunity to combine the UNIQUE() dynamic array function with MID() to identify all of the unique area codes that appear within those phone numbers.
The RIGHT() and LEFT() functions only take two arguments, the cell you're referencing and the number of characters it should return from the end. The MID() has one extra argument. First, it takes the reference cell, then the starting position, and then the number of characters to the right it should return. In the above formula, I know I have 50 numbers and a header so I first have MID() reference A2 to A51, then since I know they're all US numbers, I skip the first character beginning at the 2 position, then I return the three characters to the right of that position. That alone would get me all of the area codes, but I only want the unique ones so I wrap that function in a UNIQUE () function which eliminates any duplicates.
Okay, now we're cooking with gas! Congrats on making it this far. If you've followed everything we've covered so far, then you're ready for a bit more of a challenge. Let's check out some more advanced functions typically used for data cleaning in more advanced tools like Pandas for Python or SQL.
9. Formulas for Data Cleaning
In this section, I'm going to cover three functions commonly used in other data-cleaning tools: TRIM(), CLEAN(), and SUBSTITUTE().
TRIM() removes white spaces from not only the ends of a string but also from inside the string itself. This is an aspect it has that many equivalent methods in Python and SQL do not. This function is great for standardizing string entries, especially names.
The CLEAN() function is hard to display here because it removed "non-printable" characters which are usually artifacts from data corruption. I cannot display an example because I can't print non-printable characters. The CLEAN() function is good for data validation or when importing error-prone data from external sources.
The SUBSTITUTE() function is exactly what it sounds like. It replaces a given string wherever it occurs in a given range.
SUBSTITUTE() also takes an optional fourth argument to indicate which instance of the given string should be replaced.
10. XLOOKUP
The LOOKUP functions are some of the most common and some of the more confusing Excel functions for Data Analysts. Every time I've had to put together an Excel dashboard for a client, I inevitably end up using XLOOKUP even when I try my best to avoid it. Chances are, at some point in your career, you'll need to use it too.
In a nutshell, what the various LOOKUP functions do is search for a given value within a given range, but they differ in the details of the arguments they accept. This is possibly the most valuable function you can learn from this article.
XLOOKUP is the newest version of the lookup function and it is generally recommended you use it whenever it is available due to its flexibility and enhanced capabilities. Unlike its predecessors, XLOOKUP can search in any direction, not just "vertically" or "horizontally". More on that in a bit.
XLOOKUP simplifies finding a value within a range or array and returning a related value from a specified array or range. Its syntax, XLOOKUP(lookup_value, lookup_array, return_array, [if_not_found], [match_mode], [search_mode]), introduces many optional parameters for added versatility, such as specifying a return value when the lookup value is not found (if_not_found), and more precise match types (match_mode) and search modes (search_mode). This functionality makes it not just a replacement for VLOOKUP and HLOOKUP but a significant improvement, offering a more intuitive handling of various lookup scenarios.
In the above example, I've created a simple lookup tool for anyone who might want to see a specific row in the data without having to scroll to find it. This formula first looks at the cell at K4 which I've given a thick border. Whatever the user types here becomes the value XLOOKUP will search for. In this case, the user enters a sale ID they'd like to see, and then XLOOKUP checks the entire A column from top to bottom for the first match by default. When a match is found, it dynamically returns all info in the corresponding columns A to H. If the user types in an invalid or non-existent ID, XLOOKUP will display "Sale ID Not Found."
There are further ways of personalizing XLOOKUP. What follows is a list of the additional arguments that can be passed that affect what it considers a match and how it goes looking for said match.
XLOOKUP MATCH MODES:
XLOOKUP SEARCH MODES:
For posterity, VLOOKUP, or Vertical Lookup, the first iteration of XLOOKUP, searches for a value in the first column of a table or range and returns a value in the same row from a specified column. When data is organized vertically, it means that each row represents a unique record or item, and each column represents a different attribute or variable of that record. VLOOKUP is most suited for situations where your data is organized vertically, and you need to retrieve information based on the leftmost column. Its syntax, VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]), includes an option for exact or approximate matches but requires the lookup value to be in the first column of the data range.
HLOOKUP, or Horizontal Lookup, operates similarly but is designed for horizontal data layouts. In horizontally organized data, each column represents a unique record or item, and each row represents a different attribute or variable of that record. HLOOKUP searches the first row of a table or range to return a value in the same column from a row you specify. The syntax, HLOOKUP(lookup_value, table_array, row_index_num, [range_lookup]), mirrors that of VLOOKUP, adjusted for horizontal data.
And now for something completely different!
11. Combination Charts
So far, everything I've mentioned can be accomplished in Excel for both Mac and PC, however for Combo Charts, there are noticeably more customization options on PC. Those options include not only more aesthetic choices but also the ability to use a secondary Y-axis which is useful for relatively larger values like cumulative sums. So word to the wise if you're working on a Mac and want more customization options, use Microsoft's Online Office 360 version of Excel instead!
Combination charts combine two charts into one, but you probably guessed that! They are especially useful when plotting financial information because Combo Charts allow for multiple data sources to be graphed in multiple manners on multiple axes within the same graph.
As you can see in the above example, they're especially powerful when trying to illustrate that "one of these things is not like the others." That can mean abnormalities or significantly larger or smaller values overall.
12. Pivot Table
This section gets my vote for the second most valuable thing one can take away from this article. Pivot Tables were the thing that finally made me an Excel Believer. A Pivot Table is essentially a real-time report page or a highly interactive and customizable group-by. If you want to impress a stakeholder, build them a Pivot Table and show them how to use it.
There have been times when I've been on a call with clients, going over EDA findings, and they've raised a hand to ask me "Hey, this is great, but I'm curious how many Xs per Y we've had" as a sinking feeling in my stomach begins to grow because I hadn't done that exact grouping yet. The only thing that saved my bacon in those moments was knowing how to build a Pivot Table and being able to construct that exact grouping from scratch in mere seconds and then manipulate it in real-time as more questions came through.
If you've already got your data formatted as a table, you're halfway there. Just select all, head to the "Insert" tab, and click Pivot Table on the far left. You'll get a prompt asking you where you'd like to insert the new Pivot Table and I recommend always putting it in a new worksheet.
From there, you need to ask yourself some questions about your data. What grouping would I like to see? More precisely, what are the categories I want to see and what would I like to divide by those categories? When I was first learning it, this part was confusing for me.
The Pivot Table Fields can look intimidating if you've never used them before, but they're very simple after you've used them a couple of times. At the top, your Field Names are simply all of the columns within your table. If you're just learning, ignore Filters and Columns for now. Rows represent the categories you'd like to use to group your data and Values represent what data you'd like to see within those categories. For instance, if you want to see Sales by Date, then Date becomes your rows and Sales becomes your Values. If you'd like to see your Revenue by Product, Product is your Rows and Revenue is your Values. See, easy.
Two important things to keep in mind are for Rows, ideally, you'll want to choose a field from your data with relatively low value counts. In technical terms, you want a discrete data category, one that only has a few possible options. Otherwise, the Pivot Table will get very long very fast. The other thing is that for each value, Excel needs to know how exactly you want those values grouped or aggregated in technical terms. These aggregations can be sums, averages, maxes, mins, or simple counts. To choose which you'd like, simply click the small "i" button next to your field after you've added it to the values panel. In the below example, I'm showing the average revenue made by product type.
Now, the really amazing thing about Pivot Tables is the ability to stack various groupings. For instance, if you wanted to combine the examples I mentioned and see the average sales by product by date, all you need to do is drag the "Date" field above Product within Rows. As you can see below, Excel automatically recognizes our Date column and splits it into Years, Quarters, Months, and Individual dates. We can then see each of the products sold on those days and the average number of sales for each. By looking closely at the groupings on the top left side, we can see those averages broken down by Month, Quarter, and Year.
Now let's say I wanted to give this to a client and let them quickly see the same grouping only limited, let's say, to a given country of their choosing. That sounds a lot like the Slicer we used recently. In fact, that's exactly what Filters are in Pivot Tables. If we add the Country field to the Filter panel, we can let the user quickly limit the existing grouping to the country of their choosing.
领英推荐
Finally, let's say I wanted to see at a glance the average sales per product per quarter, limited to China, and I wanted to see it broken down by year, but I didn't want to have to collapse a bunch of rows in the report to get there. That's where columns come in. Columns are the trickiest to wrap your mind around, but they're basically the highest grouping in the sense that they can show parallel groupings based on what you have so far. They duplicate the hierarchy you've built across a given category. Take a look below.
Since there are Quarters and Products for every Year (and only a few possible years) it makes sense to move Year into Columns. This basically duplicates the Pivot Table as if you were to do the exact same grouping (average sales per product per date for China) manually for each Year. It does the same grouping for each Year and then shows the results side by side.
Pivot Tables are one of those things that only really make sense after you've had a chance to play around with them. So the next time you're working with a table in Excel, I highly recommend making a Pivot Table and just dropping random Fields into random panels to see what you can generate!
I'd also recommend checking out the "Pivot Chart" options which automatically generates visualizations based on your Pivot Table in real time.
13. Advanced Sort & Filter
Pivot Tables were pretty rough so let's do something a little simpler now. Advanced Filters for tables act like static slicers on top of regular sorts. Let's say you have a table like the example I've been working with and it's sorted descending by sale ID. That's all fine and well, but it would be nice to have it already displaying the top sales first when we hand it to a client.
What I've done in the example above is click inside my table, go to the "Data" tab, click the "Sort" button, and create a hierarchy of sorts so that the highest sales are on top and where sales are equal the highest revenue is on top, and where revenue is equal the highest profit margins are on top. This also has the added benefit of making our cell highlights look even nicer.
Now let's say I only wanted to show transactions where customer satisfaction was "Very High." I could simply right-click on that value and set it as a filter. The resulting table will only display rows where that value matches. Kind of like a slicer that's always on.
Clicking on the "Advanced" button next to the filter allows to to further set your criteria range and specify the conditions data must meet to be included in your filtered results.
14. Geographical Data Types
Now here's a fun one. Excel has a few "special" data types that it can work with. You probably already know the basics like whole numbers or phone numbers or addresses, but you can also specify when a cell's value contains Stock, Currency, or Geography data. Doing so will allow Excel to display special info. It can automatically grab real-time stock info from stock tickers or real-time exchange rates from currency. However, the one I end up using most is Geography data because it allows you to build filled map charts quickly and dynamically.
Above you can see I've gone to my "Data" tab, selected my table's column which contains country info, and changed its data type to "Geography." Now a little map appears next to each of the values. This works for full addresses, city names, states, and zip codes. Clicking on the little map icon gives you info about that place. You can even generate more columns from information from Bing such as location descriptions.
Now, to get info onto a filled map in Excel for Mac, I tend to make a simple Pivot Table where the columns are the geographical locations and the values are... well the values. We do this because the Filled Map function needs each individual location to be its own column so we need to "pivot" our data. After creating the simple Pivot Table, I copy and paste the resulting table into the worksheet where I want the filled map and convert the location names back to the geographical data type as shown below.
Then, it's very easy to select this new small table, go to the "Insert" tab, and click Filled Map as shown below.
From here, you can customize the look of your filled map! This is great for those who want Tableau-style geographical visuals but don't know how to use Tableau or don't want to pay for it.
Use caution if you're using more granular location data as the Filled Map function becomes more complicated the more geographical locations you're working with. Six countries is no problem, 1000 individual zip codes might cause your laptop to catch on fire.
15. Power Query
One of the reasons Excel wasn't taken seriously as a tool for data science for a long time was its limitations on what kinds of data sources you could connect to and how much of that data you could import at a time. Not that long ago, you couldn't connect to a server and you were limited to roughly a million rows. Additionally, any changes you made to your data weren't replicable like they would be if you wrote a Python script to perform your cleaning automatically.
Power Query solves all of that. Power Query is really the key feature that elevates Microsoft Excel to the status of a true Data Science tool because it allows you to parse "Big Data" sources far larger than simple .CSVs.
Now, I could probably write an entire article roughly the same length as this one dedicated solely to Power Query so I'm going to limit this section just to the basic reasons you should learn to use Power Query and some of its best features. At the highest level, you would want to use Power Query when you either need replicability, the ability to perform the same kinds of processes over and over for every newly imported dataset, or if you need to perform complex data cleaning or validation on a particularly large dataset. Power Query is also quite useful when it comes to feature engineering which is a term for the ability to impute new data columns out of existing ones.
To get started importing data with power query, go to your "Data" tab, then click the Get Data button on the far left side of the ribbon. That will display all of the various ways you can connect to external data sources. For our purposes here, I've generated a .CSV with many, many data validation problems.
The first thing you'll want to do is check to make sure all of your data has been properly parsed. That's to say that all of your columns are properly sectioned off. You can change the delimiter if they aren't. Once that's done, click "Transform Data" to launch the Power Query Editor.
As mentioned, I don't want this section to go on forever so I'll just quickly explain what you're seeing here and what I'd like to do with this data. What I'd like to do here is make sure all of our data is valid, meaning it's formatted all the same way, and that new data entered in the same dataset will also be subject to the same formatting if I run the same processes again. Then I'd like to create a new column that contains a rolling 3 month average based on sales.
From the main editor page, you can see all of the external sources you're connected to on the left side panel. On the right side, you can see a history of every transformation that has been done so far to your data (this is where the replicability comes from). In the center you see your data, your data types per column, as well as the percentage of valid rows, empty rows, or rows imported with errors. As you can see, ours are 100% valid across the board because we haven't told Power Query what valid data looks like yet and it's imported the "Not a Number" values from our .CSV file as strings that read "nan" which here is valid.
On the ribbon itself, we see two columns that have a simple but crucial difference. Clicking on both reveals that the Transform tab and Add Column tab have very similar tools, however, Transform applies changes to your data in place whereas Add Column creates a new column that contains new data based on whatever change you've made.
First things first for our validation, removing duplicate values. From the "Home" tab I click Remove Rows and find the Remove Duplicates button. Power Query removes all duplicate rows and then adds that process to the history panel on the right. If I ever wanted to "go back in time" on our analysis, I'd just click on a higher step and our data would revert.
Next, I want each one of my customer's names to be capitalized. I go to the "Transform" tab, click Format, then Capitalize Each Word. It does so and I see a new step added to our Applied Steps.
Now usually, our nans would register as missing values and be flagged as "empty" cells. Since our nans are just strings, I use a simple right click and replace to set those cells to "Unknown" instead.
I've also noticed that my "Categories" column has subcategories for products in Electronics. They're separated by a pipe operator symbol, "|". I right-click my categories column and select "Split Column." Then I choose my custom separator and hit OK. This splits my column into two. It also means that now I have null values in my newly created subcategories column so I'll go back to my replace option and swap those with "n/a."
I've also noticed that there are periods in between some of the names. Using the "Replace Values" button from the ribbon in the "Transform" tab allows me to swap that with a simple whitespace.
I know that I have no negative sales, but it appears like my dataset contains some negative values so I right-click the column, select Transform Column, then convert each of the values to their absolute value.
That looks good enough to me! Now that all of the data is cleaned and validated, I go to the "Home" tab and select Close and Load Data to add my data to my main Excel file.
Finally, the thing that makes Power Query so worthwhile: if you're on a PC, it's possible to export a .pq file which saves your query for reuse later. If you're on a Mac, unfortunately, this isn't possible, but you can always revise your previous steps and copy the M-code step by step onto new data.
16. Correlation
Okay, that was rough. Let's do another fun one.
Correlation in the field of statistics is a complicated process, but the idea behind it is very simple. If I have two series of data and a number goes up in one series, how likely is it that the corresponding number in the other series will go up, down, or stay the same? So you run a complicated series of calculations which I won't get into here and you're left with a single number known as a "correlation coefficient." That number ranges between 1 and -1. A correlation coefficient close to 1 means that when one series goes up, so does the other. A correlation coefficient of -1 means the opposite, when one series goes up, the other goes down. A correlation coefficient of 0 means there's no noticeable relationship. There are also various kinds of correlation calculations, some more rigorous than others, but we'll leave it there for now.
While knowing how correlated two sets of data are doesn't tell us everything (and in fact can even be misleading), it does give us a strong indication that two data sets share some kind of relation that should be further investigated.
There are three ways to do correlation analyses in Excel, two of which I'll cover now and one of which I'll save for a little later.
Firstly, there's a function. This doesn't look like much, but it will come in handy later. Using Excel's CORREL() function, all I need to do is pass two arrays of data and Excel immediately returns the Pearson correlation coefficient. As we can see from the near-zero score, my randomly generated data are not very correlated.
We can confirm that with the second method, plotting a scatter plot using the two arrays. Typically, if the data points are arranged more or less in what looks to be a diagonal line, then there's likely a correlation. From this jumbled mess, I can tell from a glance there's no correlation here. I'm attaching below what it would look like if there was a correlation for reference.
17. Linear Regression (Manual)
Pay no attention to the word inside of the parentheses for now.
A Linear Regression is a way of taking two correlated data series and using a "line of best fit" on their values to predict future values to a degree of certainty. Linear Regression is the most basic machine learning algorithm, so simple in fact that it can be done with a pen and paper. That is exactly what we're going to attempt here, more or less.
Why would we do that? Well, I'm always a fan of teaching folks to fish as opposed to giving them a fish. In your future or current data reports and presentations, it may help you in the long run to explain to any non-technical stakeholders how machine learning algorithms make predictions. Visualizing the most basic version helps.
The statistical formula for performing a Linear Regression is y=mx+b where:
We can calculate m and b using Excel functions. All we need to do after that is plug those values into the formula. For this example, let's see if Profit Margin Percentage is a good predictor of Sales (I have a suspicion it isn't).
Since we're trying to predict Sales, it is our dependent variable or "known y's" so it goes first in Excel's SLOPE() function. Profit Margin comes next. Poof. We have our slope.
We then do the same exact thing only using Excel's INTERCEPT() function. Congratulations, you just technically built a machine-learning model! Now we have all the pieces to perform a Linear Regression and try to predict a price.
Okay. Let's try to use the first sale in our data set to test how accurate our model is. Plugging in all the numbers we get:
Hey! 49.01 is pretty close to our real value of 51 for that sale! Our model might actually be usable after all.
The way we'd test that is by checking the R2 (R-squared) score which measures how accurately our model can predict our known independent variables. Excel actually has a function for this and running it gets us R^2 = 0.00025 with scores closer to 1 meaning that the model predicts well and scores closer to 0 meaning the model doesn't predict anything sooo...
Pobodies nerfect! At least we learned something. On to the next one.
18. The Analysis Toolpak Add-On
I really saved the best for nearly last here. If you haven't messed around with Excel's new Add-On features, I highly recommend doing so. The Add-On I'd like to focus on right now is called The Analysis Toolpak and it actually lets you perform many advanced statistical and machine learning analyses very quickly from within Excel.
See I told you there was a third way to do Linear Regression.
If you navigate to the Tools drop-down menu and then to "Excel Add-ins..." you'll see an option to check a box next to Analysis Toolpak. Doing so will enable new "Data Analysis" menus elsewhere in Excel.
One location is back in the Tools drop-down menu listed as "Data Analysis..." and clicking that button will launch the above menu with many advanced data analysis and statistical methods you can apply to arrays. Be careful to be exact with your array ranges. The Toolpak doesn't automatically ignore headers.
Clicking Regression allows us to select our y and x variable arrays as well as an output cell. Clicking okay gets us this...
Much faster than our previous method, but seeing this also lets us confirm that our previous measurements were correct!
19. Descriptive Statistics
Using the Analysis Toolpak, we can very quickly calculate many descriptive statistics including the central tendency of a given array. I mention it here to emphasize that this replaces a very common function like Pandas' describe() method in Python.
Running it on our Sales column returns this:
20. Logistic Regression
Okay! Let's go ahead and put it all together with one last big swing.
First things first, what is a Logistic Recession and how does it differ from a Linear Regression? Put simply, whereas a Linear Regression predicts a single continuous numerical value, a Logistic Regression guesses as to what binary category a prediction belongs in. In our case, I'd like to build a Logistic Regression model that predicts whether or not our observation came from a given country.
To do this, we'll need a new binary column that is 0 when an observation did not come from the said country and 1 when it did, then we'll need to do a little bit of Excel trickery to get the program to perform the calculations we want, but if you've made it this far it should be a piece of cake.
To get started, I convert my Geographic data back into regular text by selecting the country column, right-clicking, going to Data Type, and selecting "Convert to text." Then, I make a new column with a formula that takes every cell in the country column, removes non-printable characters, removes any trailing whitespaces, and then if the remaining text says "United States," it records a 1 with any other value getting a 0. This column is going to represent our training data. Ideally, once the model is finished, we'd be able to pass some Sales, Revenue, and Profit Margin values, and it would return to a degree of certainty either a 1 or a 0 for whether or not that sale came from the US. Then, for simplicity, I delete every column that isn't my dependent variable, independent variable, or reference.
Just to get the technical aspects out of the way, the logistic function can be defined as P(Y=1) = 1 / (1 + e^(-(β0 + β1X1 + ... + βnXn))), where P(Y=1) is the probability of the dependent variable equaling 1 (success), e is the base of the natural logarithm, X1, ..., Xn are your independent variables, and β0, β1, ..., βn are the coefficients to be estimated.
Let's start with the placeholders for the coefficients and intercept. We need starting values that can be adjusted for each input. I'll create one new cell to hold the coefficients at the bottom of the table so they don't get in the way. I'll give them starting values of 0.01 (don't think too hard about that, just choose any number close to zero for now). I'll also make a cell for the intercept.
Now we need some new columns for the statistical methods we'll use to optimize the cells we just made. I'll add those to my slimmed-down table. Logit is the logarithm of the probability of the country being the US, eLogit is the inverse of Logit, Probability is the probability of it being the US expressed as a real number, and Log Likelihood is a measurement of the fit.
We use the placeholder values to calculate Logit.
Then Excel's EXP() function to calculate eLogit.
Then we use eLogit to calculate Probability.
Then we use Excel's LN() function on the Probability to get the Log Likelihood and use the table's total row to get the sum of our Log Likelihood column.
Now we navigate to the "Data" tab and launch our newly installed "Solver" from the right side of the ribbon. We set the objective to the sum of the Log Likelihoods, "To" to "Max," "By Changing Variable Cells" to our coefficient and intercept placeholders, unchecking the "Make Unconstrained Variables Non-Negative" box, and lastly making sure our solving method is GRG Nonlinear. Then hit solve!
Now, I hate to end this anticlimacticly, but since my data is randomly generated, proceeding any further is going to get pretty nonsensical. At this point, Solver will update the values you placed in it.
If you'd like to proceed to make predictions using your newly fitted logistic regression model, please check out the article that helped me put together what I have so far!
Check out this article on how to do it with your actual data: https://www.datamation.com/big-data/logistic-regression-in-excel/
So long and sorry about your headache!
If you're new to using Excel, I'm sure that was a lot to take in, assuming you didn't just skip to the bottom. I hope this shows you the value of Excel and has officially converted you into an Acolyte like me!
If you use Excel for your job, no matter your field of work, I hope you took away at least one new method and you can use to make your lifer easier or your reports and dashboards more fantastical!
Did you enjoy this article? Did I miss something? Got a better method? Leave a comment below and let me know! And make sure we're connected so you see the next article I have planned when it comes out!
Got a data or media production related question? I'm your guy! Drop me a line and I'd be happy to chat.
Business development | Entrepreneurship | Product development
6 个月Casey, thanks for sharing!
Casey this is very helpful. I wasn't aware of most of them. Thank you for sharing!
????????????
Video Editor
7 个月This is amazing! So informative. ??????