Interactive Simulations in Excel

Interactive Simulations in Excel

Data Scientists and Data analysts often gravitate towards the same kinds of visualization and interactivity tools. You will find them creating Power BI dashboards, Tableau reports, or even using R and its Shiny package to deliver interactivity.

The issue is that all of these require some kind of enterprise level buy in. You will need somewhere to host your reports and your dashboards. In large organizations that have the time, money, and staff to implement and maintain these solutions it may not be an issue. But for smaller organizations or for work that is specific to a single team you may not have the luxury of a large-scale or commercial solution.

The question becomes how can data analysts and scientists provide information and interactivity to a non-technical audience with standard business software.?

Turning Excel Workbooks into Dashboards

The answer is that many of the reports and dashboards that use static data can be recreated in excel. Doing this may require some creativity in breaking down problems into excel formula digestible steps or creating multiple columns and sheets to work through calculations.

While this can be a little more work upfront, it can be beneficial in that excel has the ability to update tables and charts with new input. While also creating report ready (copy and paste into word) output. And it does all of this with an interface your business users already know.?

Creating A Simple Example Simulator

You may be asked to take a dataset and apply some rules for performing a calculation. Where the business unit wants to know what would happen if some part of the calculation were changed.

Let’s use an example where we are asked to bin or group some data. They want to know the number of data points that would fall into the different bins if they were to change the thresholds for those bins.?

Step 1. Load the Data

We have a static dataset that we want to group into different bins. In this instance it is a single column of percentages. So we copy that data into a column in our sheet.?

Step 2. Define our Bins

With the data loaded we want to create an initial definition of our bins. This will be a minimum and maximum value that determines what bin a data point goes into. Your business users may have an initial set of ranges they are interested in. If that is the case, you can use that as your default values. Or you can set the ranges yourself. Since the end result is a set of ranges that are adjustable the initial values are not overly important.

A helpful tip is to give each bin or group an easy to remember label. Here we will use the labels Bin A, Bin B, Bin C, etc. creating five bins and creating bins with 20% ranges where the final bin will capture any remaining data points down to 0%.?

Step 3. Count How Many Data Points Are In Each Bin

This is where the magic happens in terms of what the users want to know. We have our data and our boundaries or minimum and maximum values. How do we find the number of elements that fit inside of our bins?

In this case we can use the COUNTIFS function. This will allow us to count how many of our data points meet multiple criteria. In our example the criteria would be finding any percentages that are above our minimum bin value AND less than or equal to our maximum bin value.

As a formula that looks like

=COUNTIFS(
  [DATA COLUMN],
  ">[Min Bin Value]", 
  [DATA COLUMN],
  "<=[Max Bin Value]"
)        

This returns a simple count of the number of items that fall into a given bin or group. This count is often all the information business users want. I typically like to add some extra context to this information. Things like a total count of the data elements, and what percentage of the total data each bin makes up.

This extra information does two things. First it helps me check my work by making sure I am not double counting our missing any data. Second it gives the business unit the ability to understand the magnitude of any changes they make or see how the underlaying data may be skewed.?

Step 4. Adding Additional Details

As mentioned above a simple check you can add to help make sure you captured all of your data is to total the number of items in all of the bin. This should match the number data points you feed in. If it does not, it tells you that either your formula is incorrect or your bin ranges may be missing data points and it is something you should investigate. This check can be implemented with a SUM function over the range where you have your bin counts.

Calculating the percent of the total number of data points allows you to easily see how much of your data the each bin or group represent. You could infer this with just the count and the total, but having the computer show you this number makes it easier to understand at a glance.

The percentage can calculated with another SUM function where you divide the Bin Count by the total number of data points. The total number of the data points if obtained by using the COUNT function on your data column. As a note this function will only count the number of cells in your data column that contain numeric values. This is beneficial because it will ignore any column headings. As a final check you can then also sum your bin or group percentages to make sure they equal 100%.?

The picture below shows what this would look like in practice. You have a data column, A small set of instructions for your business users. The table labeling and defining your bins. Then the columns showing your counts and percentages.?

No alt text provided for this image

we can have excel show us the formulas used in each of the cells to identify how this all work. It is all created using the COUNT, COUNTIFS, and SUM functions.

No alt text provided for this image

This may be a very straightforward use case for people who are reasonably skilled at using Microsoft Excel. But this level of interactivity has saved me from countless requests to rerun simulations with small variations in ranges. Business users are able to take this and work though different scenarios in their meetings rather than stopping and coming to you with a new set of ranges to run for them.

Building A More Complex Example Simulation

The simple use cases are great but what if you need more information? What if your simulation requires more decisions points, or uses more data, or has more complex calculations before you can return simple number to your users? What if they want visualizations to go along with the numbers?

We can do all of these things, it will just require a little more work and using a few more of excels functions and data structures.

Defining our data set

In this example we will look at three different groups of data.

  1. The first group is entities that aggregate up from entities in the second group.
  2. The second group is a list of all of the individual entities in our data.
  3. The third is a subset of entities in the second group that have a common characteristic.

The diagram below shows this visually to help conceptualize our different groupings. In your specific cases these groupings could be anything. Group 1 could be industries you serve, group 2 could be each company you do business with, and group 3 could be any of the businesses in group 2 that have a similar characteristic. Something like the companies that make annual purchases over a certain dollar amount with you.?

No alt text provided for this image

Measures

Now that we have our entities grouped, we want to select some common measures we can use for our analysis. In this case we will call them Measure 1, 2, and 3. These will depend on what kind of comparisons you are hoping to make. In this case we are going to weight and combine our measures to create an overall weighted score that we will then use to group our entities into categories.

A complication we will run into is that we have what is known as “sparse data” which means that not every entity has a value for all of our measures. Group 1 will have the most complete set of data as it aggregates data from group 2. Meanwhile, groups 2 and 3 may not have data for one or all of our measures. We can reconcile some of this by using more than one calculation method, but there will be cases where some entities are not included in our final results due to missing data.

Weighting

If we are going to combine these measure we need some sort of formula that we can apply across all of our values to come up with a single number for them. In this case we are going to take the measure and apply a set of weights to them. That means we will multiply them by a percentage and those percentages will total up to 100. We then sum them together to get a combined weighted score.?

No alt text provided for this image

We will be using two sets of weights. The first is applied if the entities have data for all three measures.?The second set is applied if an entity is missing a value for measure 1 but sill has data for measures 2 and 3. As we mentioned above there will be cases that do not match this table and these weights, so they will not be included in our final results.

This exclusion will be important to remember because later when looking at the total number of results you may wonder why they do not match the total number of entities in the data.?You may need to investigate if that difference can be explained by how many entities are missing measure data.

Desired results and visuals

Now that we have defined our data, our measures, and our weights. What are our users expecting to see in the end? Like the simple example we will want to show them the number of entities that fall into each category, and calculate the percentage each bin or category holds.

We will also add an additional dimension to this by including the county that each entity resides in and counting the number of entities in each category by county. In our case there are eight counties.

We will want to show a bar chart comparing the different groups and how many are in each category. We will want to show a histogram of weighted scores for Group 1. We will also want to show the average category rating for each county (using numerical representation).

The picture below shows what we expect to deliver to the user in the end.

No alt text provided for this image

We see all of our desired visuals and tables. As well as some instructions on how to use this simulator. Like our simple example when the business users change the table values in red all the other fields and charts will recalculate and update automatically.

Setting up the Simulation - Adding in our data

In this case we have two separate data sets we will be looking at. The first is Group 1 data which is the aggregated numbers from our second group. This data set will have the following columns.

  • Agg_Entity_UID - This is a numeric value used to identify and separate entities in the data
  • Agg_Entity_Name – This is a human readable name for the entity
  • Measure_1 – This is a numeric value for our first measure
  • Measure_2 – This is a numeric value for our second measure
  • Measure_3 – This is a numeric value for our third measure
  • Weighted_points – This is a calculated column that we will create in this article using formulas
  • Category_points – This is another calculated column that we will create using our weighted points and table to associate a point value with a category
  • County – This is a numeric value that represents one of our eight counties. For Group 1 the county is determined by which one holds the majority of our aggregated entities

We will create a new sheet in our workbook with this data and call it “Group 1”

No alt text provided for this image

For group two we will also create a new sheet and name it “Group 2” and it will contain similar data but with a few additional columns.?

  • Agg_Entity_UID - This is a numeric value used to identify and separate aggregate entities in the data
  • Agg_Entity_Name – This is a human readable name for the aggregate entity
  • Entity_UID - This is a numeric value used to identify and separate entities in the data
  • Entity_Name – This is a human readable name for the entity
  • Measure_1 – This is a numeric value for our first measure
  • Measure_2 – This is a numeric value for our second measure
  • Measure_3 – This is a numeric value for our third measure
  • Weighted_points – This is a calculated column that we will create in this article using formulas
  • Category_points – This is a calculated column that we will create using our weighted points and table to associate a point value with a category
  • County – This is a numeric value that represents one of our eight counties where the entity resides
  • Subset – This is a numeric value that identifies the entity as belonging to a specific subset. This is how we will identify entities that belong to Group 3

No alt text provided for this image

Data Structure

A very useful feature in excel is converting a range of cells into a Table. You can do this with the “Format As Table” button in the Home ribbon.

No alt text provided for this image

The reason this becomes useful is that it allows us to reference Tables and Columns by name rather than using Sheet and column letter/number combinations. This turns something ‘SheetName’!A1:A200 into TableName[ColumnName] which will be much easier to work with when writing formulas. As well as being very helpful when trying to track down errors.

Tables also make sure that a constant formula is used across all rows in a column and will warn you if it finds any inconsistency.

Named Variables

In addition to using tables we can specify specific cells to name so that we are able to call them in formulas by that name. We will use this feature to name all of the variables that our users enter. That way we are able to reference them easily with names that we have defined.?

To do this we select the cell we want to name and use the Define Name button in the Formulas ribbon.

No alt text provided for this image
No alt text provided for this image

You will be shown a pop up window where you can define a name and a cell reference. Now we will be able to reference whatever value is in this cell by using the name ‘Measure_1’. In the end we should wind up with a set of named fields similar to the ones in the image below

No alt text provided for this image

Calculating our Weighted Points

We started with the data from our three measures but now we need to combine them into a single weighted score. We also know that we will need to apply different weights for different sets of data availability.

This may sound difficult initially, but we can use a simple IF function to tell excel to do different sets of multiplication. The difference between the two sets of weights is whether or not Measure_1 is in our data set for a given entity. So, we tell the IF statement to test if the value in Measure_1 is #N/A by using the ISNA function. If it is #N/A we use our set of partial weights. If the Value is not #N/A then we use the full set of weights.

=IF(
  ISNA([@[Measure_1]])=TRUE,
  SUM(([@[Measure_2]]*Measure_2_partial)+([@[Measure_3]]*Measure_3_partial),
  SUM(([@[Measure_1]]*Measure_1)+([@[Measure_2]]*Measure_2)+ ([@[Measure_3]]*Measure_3)
)        

Notice that since we have the data formatted as a table. We are able to reference data columns as [@[Column_Name]] when doing calculations inside of that same table. This is helpful as it will allow us to copy and paste this formula in both of our group data tables and it will calculate the weighted points for that table.

Converting Weighted Points into Categories

Now that we have a single number that represents our measures. We want to use this to group our entities into categories. This is similar to our simple example in that we will have defined ranges that our users will be able to adjust for each category. Since we will want a little more information than just counts, we will need to create a new column where we label each entity with what category it would fall into. To do this we will create a new table with the user defined ranges, we will create a numeric value to represent each category, and we will use a Ranged VLOOKUP function.?

No alt text provided for this image

So looking at the picture above we can see that we defined the categories, assigned a numeric value to each category, and had the users define the minimum value for each weighted point range. The reason we do not have them enter a maximum value is that the maximum will be defined by 100 at the top and then the minimum value for each of the next categories. When they change the minimum value, the maximums will adjust automatically. This gives them one less thing to think about and also prevents them from creating overlapping ranges. The formula for this is just a reference to the appropriate cell.

No alt text provided for this image

With our ranges defined w can now create a range lookup table. The reason we do this is because on the sheet that users will see we have them enter ranges from high to low. But when using a VLOOKUP excel we will need it to be arranged low to high. A simple solution for this is to name the minimum value cells the same way we did for the weights. Then create a second table on the sheet that holds our group data where we have these minimum values and their associated points arranged the way excel needs them (low to high).

The image below hopefully helps show that we are just inverting the table the users enter the minimum ranges into.

No alt text provided for this image

Creating the RANGED VLOOKUP

We have our user defined ranges and a lookup table that we can use to associate point values. So now we need to create a column where we will use a VLOOKUP to convert the weighted points into the category points.

=IFERROR(
  VLOOKUP(
    [@[Weighted_points]],
    'Group 2'!$M$3:$N$7,
    2,
    TRUE
    ),
  #N/A
  )        

To walk though this formula we see it is a VLOOKUP that is wrapped up in an IFERROR formula. We do this because we want to make sure we return #N/A for any values that having missing data. Those #N/As will not be included when we count our totals.?If there was no error handling they may return a 0 which would be counted by our other formulas later.

In the VLOOKUP function itself we specify the column in our data table that we want to use to search our lookup table for (here its our weighted points). Then we define the Sheet and Cells where our lookup table exists. In this case it is Sheet “Group 2” and our lookup table is in cells M3 to N7. The number 2 represents what column in our lookup table we want to return the value for. This would be the second column which means our VLOOKUP will return the category point values associated with the weighted point value. The last argument TRUE tells the VLOOKUP that we want it to use the first column in our lookup table as a range. Meaning any value between say 0 and 60 will be associated with a point value of 1. Or any value between 60 and 70 will be associated with a point value of 2, etc.

Just to show how this looks on the sheet with the data tables. We create this lookup table next to our data table. And with the Ranged VLOOKUP it is able to identify what point value should be associated with the weighted points. Resulting in a column in our data table that shows the category point value.?

No alt text provided for this image

This would be the same process on the Group 1 sheet.

Creating our Counts and Percentages

We have now brought in our data, combined our measures into a single metric, and used that metric to separate the entities in our data into category groups. Now we want count how many of them fall into each category for our different groups.

To do this we can use a COUNTIF function. This function allows us to count the number of data elements that match a given criteria. In our case we would tell the function to count the number of entities in our data that match a categories numeric point value.

An example of what that would look like for Group 1 and Category A would be

=COUNTIF(
  group1[Category_points],
  "=5"
)        

We would then copy this formula down and change what point value we want to match on to reflect our different categories. We then calculate the percentage of the included entities in each category wish a SUM function similar to our simple example

?To show what this looks like the image below shows the formulas for Group 1.?

No alt text provided for this image

We would do the same for Group 2. However, when it comes to Group 3 things get slightly more complex. We need to use a COUNTIFS formula instead. This allows us to add multiple criteria that must be true in order to be included in the count. Since group 3 is a subset of entities in Group 2 we need to use the subset identifiers to keep only the entities who belong to Group 3.

In our example Group 3 is made up of Group 2 entities that are either in subset 7 or subset 8. So we add these conditions to our COUNTIFS formula so only results that are equal to our category points AND are in a subset greater than or equal to 7 AND in a subset Less than or Equal to 8 are counted. This ensures that we only count entities that are in subset 7 and 8.?

=COUNTIFS(
  group2[Category_points],"=5”,
  group2[Subset],">=7",
  group2[Subset],"<=8"
)        

This completes the first part of our interactive simulation. We now have a table where users can adjust the category ranges to see how many entities in our data would be sorted into the different categories. Next we will work on creating visualizations to help users conceptualize the differences between the groups.

Visualization of the Group Counts

Given our completed count table we want to add a simple visualization for our business users to quickly compare these results across groups. To do this we can add a simple bar chart with multiple series.

From our table we will select the Category, Group 1, Group 2, and Group 3 columns. And Inset a bar chart.

No alt text provided for this image

The result should look similar to the one above. We want our axis labels to be our categories, and for there to be three data series, one for each group.

Histogram of Group 1 Weighted Points

In order to help the users target the category ranges to different parts of the data it would be useful to show them the shape of that data. We do this by giving them a histogram of the Group 1 weighted points. This will let them get a rough estimate of how many entities exist in a range before entering new ranges into the simulator to get a full count. This will also let them see how changes to the weights will affect the shape of the data.

To do this we go to the sheet with our Group 1 data and select the Weighted Points column and insert a histogram. We can then format it and move it to our dashboard sheet.

No alt text provided for this image
No alt text provided for this image

County Table

One of the additional data breakdowns requested was to show the number of entities in each category by county. So, in our data we have the county each entity is in, and we can use the same kind of COUNTIFS formula from above to display this breakdown.

=COUNTIFS(
  group1[Category_points],
  "=5",
  group1[County],
  "=1"
)        

Looking at this formula we see that it will only count entities that match a given category point value AND match a given county number. In this example we are looking for category A which has a point value of 5, and county 1.

When we expand this out we can create a table that shows the number of entities for all combinations of county and category.

No alt text provided for this image

As an additional piece of data I do include the Average points per county. This provides a quick estimation of what category an entity is likely to belong to if you know its county. It is calculated using an AVERAGEIFS formula.

=AVERAGEIF(
  group1[County],
  "=1",
  group1[Category_points]
)        

Here we see the formula will average the category points for any entity that has a county equal to 1. We can then write that formula for all of our counties by adjusting the county numeric value.

To create the graph for these averages we again select the average values and insert a bar chart.

No alt text provided for this image

We can repeat that process for Group 2 as well.

Final Notes

We have now created all of the parts needed for our simulator. We can send this over to our business users and walk then though how to use it. They should be able to use this to guide their discussions and decisions. They will no longer need to come to you for every set of numbers they wish to run. They only time you will need to be involved is when they decide to change how a calculation is performed.

Even in situations where the calculation methods change, and you are needed. It can be faster to make those updates in excel if you have set up your simulations using data structures and variable naming in ways that let you work in higher level abstraction.

Hopefully this has been helpful in in demonstrating ways to use excel in place of some of the more advanced tools. Either to navigate technical and infrastructure challenges or just simply to meet your users where they are.

要查看或添加评论,请登录

Mark Richards的更多文章

社区洞察

其他会员也浏览了