登录查看更多内容

Pirate Play Part I: Creating the Data in R

Dr. Aria Fredman (ADHD, ASD, PTSD)

UX Researcher

发布日期: 2020年8月18日

In this article I want to create a dataset with attributes of fictional pirates. Note that I am aware that this article has a ridiculous premise, but I did it because while producing this type of content (that is, demonstrating that I can do data science, not data science with silly stories) while job hunting is important for visibility (plus I remember it so much better after writing about it), at some point it is more fun if I let some silliness in; and if I can add some fun to my life while job hunting during a pandemic, I am going to seize opportunity.

It’s also high time (or is it high tide?) someone wrote a data science blog post filled with terrible pirate puns and R code. Please note that these data are fictional, and do not come from any real company. I think of this article both as a portfolio and instructional piece, and therefore I think that the validity of these data are far less important compared to the code, and what I show that we can learn from potential findings (actionable insights FTW!).

The scenario I’ll be working with for all of the parts of this exercise is as follows: Imagine that you discovered the true negative correlation between number of pirates in the world, and global warming. Also imagine that you are unfamiliar with the idea that while correlation might be causation, it might not be. If that was the case, I would recommend you watch the short talk I gave on the topic at a recent data science conference, because we would not want you coming to the erroneous conclusion that you can fight global warming by getting volunteers to take up piracy.

Now imagine that you did not watch my talk, came to the erroneous conclusion, and have started a non-profit promoting piracy for volunteers. Unfortunately (for you; the rest of us will find your pirate woos quite lucky), despite training your pirate volunteers, you are finding that churn is a real problem, and you would like to understand it better, so that you can create interventions to increase pirate retention. To clarify, you are focusing your investigation on the pirates that quit out of their own volition, and not those who died in battle.

You only recently understood the value of continuous listening, and ongoing feedback from your volunteers, but have not yet had the chance to implement such a program within your organization. You also do not have any qualitative data, which is unfortunate, and it is often what we need to understand WHY people are doing what they are doing; quantitative research tends to be much better at explaining WHAT people are doing. Since you need some actionable insights despite not have the luxury of collecting more data, you turn to the data that you do have, and to your favorite pirate coding language: Argh! Um, I mean R.

Since this is the first part in the series, we will first create the data that we will analyze in subsequent parts. We want to end up with 13 columns:

First mission date
Last mission date
Crew number (which is assigned at the start of the volunteering)
Pirate ID
Mission count of how many missions were completed
Total number of mission hours completed
The level the pirate reached, which is based on how many missions were completed
Age
Occupation, with options being student, working, retired, or other
Whether there is a preference for day missions, night missions, or both
A satisfaction rating, which is given by other crew members post mission, and then averaged
What US state the pirate hails from
Whether any of the pirate’s crew members died during a mission

The code for this article can be found here.

We need to start by loading the libraries we’ll be using during this exercise.

We also want to set a seed so that we can all create replicable data, and get the same results.

Let’s start with creating the first column, which is the date of the first mission. We want it to be some time between ‘2010/01/01’ and ‘2020/08/15’, but limit the dates to be only on the first of each month. Therefore, we create a sequence between those dates (and make sure to specify that the strings should be treated as dates), specifying that we only want one date per month. We then randomly sample those dates so we have a vector that is 30k dates long, and make a column out of all those dates. Note that we have to specify that we want to sample the dates with replacement, because we want a column 30k variables long, but are only providing 128 dates. Therefore, we obviously need dates to appear more than once, and need to replace every date we pick back into the pool of dates that we are picking from, so that we have the ability to choose it more than once.

For our second variable we will create a column with the date of the last mission. To make sure that the last date is after the first date, and does not come before it, I will start by making the last date the 1st date, and then add a certain amount of days to it, so that the two dates are not identical. To increase the variability between people, I’ll add a random number of days between 0 and 4k to each person’s start date. To get a random number for each pirate, I will first create a variable that is 30k integers long of random numbers between 0 and 4000 (and once again we need the replace parameter set to TRUE, because we need 30k, when the range of 0 to 4000 is much less than that, so we need to sample with replacement).

I then add this number of days to the first date. If the resulting date is earlier than 08/16 (which I want to be my latest date), I keep the new end date as the old end date + the random number, otherwise I force it to be 08/15.

My problem now though is that I have too many dates that are 08/15, and I would like fewer of those and to increase the variance of the final end dates. To increase the variance I now want to subtract a random number of days from each final date, somewhere between 0 and 65. This leads me to create another variable with 30k random numbers, this time between 0 and 65.

I check to see whether the last date is not 08/15, or if it is 08/15 but subtracting the random number from it would then make it earlier than the first date. If either of those conditions are met, I leave the final date as is. Otherwise, I subtract the random number from the final date.

I now combine the two column dates together in a dataframe, and arrange it by the first mission date.

I run a quick sanity check to confirm that I don’t have any cases where the last date comes before the first date. If I don’t have any then I should product 0 rows with this filtering, which is what I find.

I also want to confirm that I have increased variability and to not have too many final 08/15 date cases by creating an additional variable that is a 1 if the person’s final date was 08/15, and 0 if it was not, and then counting how many 0s and 1s I have, by just counting the rows after grouping by that new variable. We can see that we do not have too many people with 08/15 date as their final date.

I also want to know which crew each pirate is part of. Each group of pirates that started on the same date will be part of the same crew. To create the crews I’m ranking the dates and assigning crew names by rank.

Next, I want to give each pirate a unique id, which in effect will just be the row numbers.

To create a variable of how many missions were completed I start by finding how many days the person was active by calculating the difference in days between their first and last mission dates, then dividing that number by 30 and rounding it, so about a mission a month.

However, I want to increase variability a little bit, which is why I am going to again create a vector of random numbers–this time they’ll be between -2 and 2–and then add that vector to the mission counts variable so long as that number does not result in a negative number, in which case I’ll set it to 1.

I now run a quick sanity check to see if I have any negative mission counts, and see that I do not.

I’m going to assume that each mission takes 6 hours, plus or minus 5 (so between 1 and 11). Therefore, to create a variable of how many hours a person have been volunteering for, I’ll take the number of missions that they have been on, multiply it by 6, and then add to this product a number from a vector that is full of randomly generated numbers between -5 and 5. However, I only do this if adding that number doesn’t result in a negative number (negative volunteer hours does not make sense). If it does result in a negative number I’ll assign mission hours to 1.

Similar to above, a quick sanity check to confirm that we don’t have any negative mission hours.

I also want to give pirate levels based on how many missions they have accomplished. If they accomplished:

up to 10 missions they’ll be at level 1
11 to 30 level 2
31 to 50 level 3
51 to 80 level 4
81 to 120 level 5
over 120 level 6 I will use case_when within mutate to create this variable that is constructed with a handful of rules.

I do want to mess up the accuracy of pirateLevel a little bit, so that we can investigate it in the data wrangling blog post. The first step to doing this is creating a vector that is 0 96% of the time, and 1 the remaining 4% of the time.

Then every time it’s 1 I will add a level to the assigned level, but leave the pirate level accurate and as-is when the variable is 0. This way, I will only mess up 9% of the levels.

I am using the which function to get the index of the messed up levels. I then once subtract 1 from the first of those indices, and once add 1 to it, so that I can pull the row before the first index, the first index, and the row after the first index it, using bracket subsetting notation. I can now look at the levels and make sure only the middle one is wrong as a sanity check.

The same sanity check as above, but looking at the index of the second messed up level.

To create the age variable I’m making 3 vectors and meshing them together. One of those is 15k long with random numbers between 18 and 30; the second is 10k with random numbers between 31 and 55; and the third is 5k with random numbers between 56 and 75. I then resample that meshed-up vector without replacement, so that I get all the ages I already created, but I get them in random order in the dataframe, so that I do not have ages strongly correlated with start dates.

For the occupation variable, I want all the people to be assigned an occupation based on age, and so I turn to case_when again, and make the people under 22 students, non-students under 65 working, and the rest retired.

However, I want to add some variance to those occupation - age combos. To do that I am first choosing from the IDs associated with retired or student 20% random IDs that I will eventually convert to working occupation. I use pull to pull those IDs into a vector.

I am also selecting 11% of those between ages 40 and 65–that will be in the dataset as working–for early retirement.

Then a random 10% of those who aren’t students, or in either of the above ID lists to, convert to student.

Finally, I select 12% of everyone who is not in any of the above groups to convert to other occupation type.

To convert all the occupations I specify in a case_when that any occupation whose line has an: 1) id in the working list should be converted to working 2) id in the retired list should be converted to retired 3) id in the student list should be converted to student 4) id in the other list should be converted to other. All other IDs just retrain their previous occupation.

I also want to run a small sanity check where I am eyeballing the rows in the dataframe where the ids are in the student id list, because originally they would not have been students, and so I want to see if they’re all students now. By asking for counts by occupation, I see that I only have students in that group, so they were all converted as expected.

Similar to above, I am selecting random IDs to create my column specifying whether someone cares about pirating during a certain time of day. I first pull random 70% IDs of anyone who isn’t retired, to make them into night pirates.

I then select random 70% of the remaining IDs to turn them into day pirates.

I am now ready to create a new variable, by specifying that if a row’s ID is in the night list they are tagged as preferring the the night shift; if it is in the day list they prefer the day shift; and if it’s neither of those they’re tagged as not having a preference.

Another small sanity check, here counting number of shift preferences by occupation, to see if they’re ordered by what I’d expect. I expect students and working to be mostly night shift, and retired to be mostly day shift.

For the satisfaction rating column, I want to randomly assign a satisfaction rating to each row that is ostensibly given to them by their crew. I want less students to get very high ratings compared to non-students, so I am starting by choosing random 65% of students’ IDs.

and 75% of non-student IDs.

I am next calculating the length of the vector that is 65% of students, which is really just the length of the vector of the random students’ IDs.

I also want to calculate the length of the remaining 35% of students, which is the length of all students minus the above calculated number.

We now do the same for the non-students, which means that we are looking for the length of the 75% non-students vector.

and also the length of all non-students minus the above vector to get the length of the remaining 25%.

I then create 2 sequences: one that has all the integers between 85 and 100, and the second that has the integers between 0 and 84.

To create the actual satisfaction ratings, I create four dataframes that I eventually bind vertically, gluing the rows on top of each other. For the first dataframe I start with a vector where I randomly sample the high ratings sequence (i.e. 85 to 100) for the length of 65% of students, then putting it in a dataframe with the associated IDs. I rename the columns so that when I eventually join this to the original dataframe it will be easy to do.

for the second dataframe I randomly sample the low ratings sequence (i.e. 0 to 84) for the length of 35% of the remaining students, then create the dataframe with IDs that do not have a student occupation and are not in the 65% IDs.

Similarly, I am sampling the high ratings sequence for the length of 75% of non-students, and then createing the dataframe with their IDs.

I now sample the low ratings sequence for the length of the remaining 25% of non-students, then make the dataframe with the non-students, non-75% IDs.

The next step is to bind the dataframes vertically. That is, we are stacking the rows on top of each other, so we are binding the rows together.

As a sanity check I am choosing the IDs in the original pirates dataframe that are NOT in the crew satisfaction dataframe, and counting them. Since I should have used all the IDs, this should be 0.

I am also choosing the IDs that are shared between the dataframes, and counting them; since this should be all of them I expect it to be 30k.

As I passed my sanity check I’m joining the crew satisfaction dataframe to the pirates one by the ID, and renaming the joined dataframe pirates.

I will now do an additional sanity check, as I want to make sure that the proportions of satisfaction ratings vary across all the occupations. This is easy to do by plotting. The left and right plots are identical, with the exception of the scales = “free” parameter, which lets the axes of the different plots vary so it’s easier to visually compare apples to apples, and will only be in the left plots. I am plotting the satisfaction rating by occupation. The rating goes on the x-axis, and the occupation will each be a separate plot, so it goes in facets. I am using grid.arrange to bind the two plots together, and since I want them next to each other, I specify that the number of columns that I want in my combined plot is 2.

And we can also eyeball descriptive statistics.

For my state variable I want a column telling me which state the pirate hailed from. I searched the internet for the state names, and then copied and pasted them into a string.

I next replaced \r \n \t the in the states string with a space, and then took that output and added a comma before and after wherever I had two capital letters in a row (in effect subbing that for a comma, first capital letter, second capital letter, comma).

I also stripped spaces before and after commas.

I now needed to split the string on the commas, and save it as a list. The split saves all the elements in a single list, so to access each bit to be saved in a list as individual elements we need to use double square brackets to select a single bit as a single element.

Finally, I am making the list all lowercase, and adding to it four, random, non-states, to make our data a little messy.

Eyeballing the list.

I am creating the states column by sampling the list I created. I do, however, want to make sure that I have way more states than non-states. Therefore, I set the prob parameter to pick the first 102 list elements (50 states + dc + district of columbia) at an equal probability, and the last four elements–which are the not states–to be to also be picked at an equal probability to each other, but only a fraction of the likelihood compared to the states.

For my sanity check I’m ensuring that the four non-states were picked much less than the states.

For my final variable I want to create a column of whether the person had a crew member who died. I want most people to not have a dead crewmember. Having one will be indicated with 1, and not having one will be indicated with a 0. Therefore, I am creating a column with random 0s or 1s that will be the length of the dataframe, but the 0 will be 20 times more likely to get picked than the 1.

The dead crewmember sanity check entails grouping by the variable, and then counting how many rows we have per category.

Since I want to have an article on wrangling these data, I do want these data to be a little messy. One way to do that is to mess up the data by duplicating ID 20,000 four times at the end of the dataframe, and then making:

ID 30,001’s mission 1 date a month later
ID 30,002’s mission 1 date NA
ID 30,003’s last mission 1 date a month later
ID 30,004’s crew 50

Eyeballing the duplicated and modified rows to make sure it did what we expected as a sanity check

Another way to mess up the data is by adding NAs. Here I am selecting different percentages of random IDs for each column to convert to NA.

For each column in the dataframe, if the IDs are in the randomly selected IDs above, we convert that column value to NA; otherwise we leave it as the value in that column.

You may have noticed that the above code does not include the date columns. What I learned in this process is that using the ifelse function I can change your input class, and if you do this with a column full of dates, you will end up with a column that has been completely altered. However, you can use data.table::fifelse or if_else instead, which will preserve your date type, while completing the same transformation as ifelse; the two even use the same type of syntax as ifelse.

The problem is that while it is useful for the ifelse to force the datatype to be the same across the input and output, it then makes it impossible to convert one datatype into another, including NA. What I ended up using as a workaround, was assigning NA after selecting what I wanted to convert using the square bracket notation. For my use case I wanted to convert a bunch of dates to NA that were associated with IDs in a list, so I found the indices in my dataframe of the people whose ID was in that list using the which function, and then selected the dates associated with those indices, and just assigned NA to them like so:

The sanity check is plotting the missing values by column.

Finally, we save the data as a CSV, then read it in a looking at it for our sanity check.

We now have our dataframe, and can proceed to wrangling it, which is what the topic of the next article will be. Stay tuned!

要查看或添加评论，请登录

Dr. Aria Fredman (ADHD, ASD, PTSD)的更多文章

Then they came for me—and there was no one left to speak for me: The personal needs to be political

2021年4月9日

Then they came for me—and there was no one left to speak for me: The personal needs to be political

Sometimes I see on LinkedIn folx opining that this is a professional space, so non-professional topics like race should…

3 条评论
A Spock among humans: Personal thoughts of my ongoing journey to self-acceptance

2021年4月2日

A Spock among humans: Personal thoughts of my ongoing journey to self-acceptance

I went through a really rough time in grad school. Like most divorces, mine was difficult for all parties involved, and…
Reading in Very Large CSV Data into Python

2020年8月12日

Reading in Very Large CSV Data into Python

I'm currently doing data science on my laptop, practicing while I look for a new job during the pandemic. This means…

2 条评论
Survival Analysis Models

2020年8月10日

Survival Analysis Models

The other week I had the pleasure of examining some real-life churn data, trying to understand the causes of it. I…
Human motivation: Psychological principles relating to reward and punishment

2020年6月20日

Human motivation: Psychological principles relating to reward and punishment

***Hi! I'm Aria. I wrote this article for a couple of reasons.
My articles relating to diversity, mental health, UX, and gaming

2019年7月2日

My articles relating to diversity, mental health, UX, and gaming

Looking at my LinkedIn profile, it becomes quickly apparent that my interests are quite heterodox. One of my passions…
Never data driven, always data guided: What (partial) correlations and p-values are, and why critical thinking trumps all numbers

2019年6月25日

Never data driven, always data guided: What (partial) correlations and p-values are, and why critical thinking trumps all numbers

As the above graphic very clearly (and, yes, annoyingly ;-) ) emphasizes, if you are given numbers, you can come up…
Kintsugi

2019年6月25日

Kintsugi

“Becoming a father of a son with special needs was the turning point in my life that has shaped who I am today. It has…
Representation in Games and Technology: The Holiday Edition

2018年12月27日

Representation in Games and Technology: The Holiday Edition

I will start this post with a confession: I am the mom who tells her kids that the tooth fairy is not real. That being…

2 条评论
Pokemon Go is a Gateway Game for Gamer Aria (now say that 5 times fast!): Positive UX aspects of Pokemon Go

2018年12月11日

Pokemon Go is a Gateway Game for Gamer Aria (now say that 5 times fast!): Positive UX aspects of Pokemon Go

My earliest memory of playing a #videogame is from 1989 (gosh, do I feel old now!) I remember sitting at a friend’s…

1 条评论

See all articles

Pirate Play Part I: Creating the Data in R

Dr. Aria Fredman (ADHD, ASD, PTSD)

UX Researcher

Dr. Aria Fredman (ADHD, ASD, PTSD)的更多文章

社区洞察

其他会员也浏览了

Crossroads of GIS and Public Safety: My Journey with Gistic Research

Simple Hydromet Exchange Format: Streamlining Hydrometeorological Data Sharing

Finding Suspects with OSINT Mapping: A Game-Changing Tool for Investigations

4 Breakthrough Uses for Big Data and Why You'll Benefit From Them

IBM and Data Science are Helping Save the World through Call for Code

Monitoring Rancher cluster health with kube-state-metrics

Model AND Map Connections in Real-World Systems

Learn how to Effectively report statistical data

Wildfire Data in British Columbia with Tableau

Personal summer challenge on MS Excel

Dr. Aria Fredman (ADHD, ASD, PTSD)的更多文章

Then they came for me—and there was no one left to speak for me: The personal needs to be political

A Spock among humans: Personal thoughts of my ongoing journey to self-acceptance

Reading in Very Large CSV Data into Python

Survival Analysis Models

Human motivation: Psychological principles relating to reward and punishment

My articles relating to diversity, mental health, UX, and gaming

Never data driven, always data guided: What (partial) correlations and p-values are, and why critical thinking trumps all numbers

Kintsugi

Representation in Games and Technology: The Holiday Edition

Pokemon Go is a Gateway Game for Gamer Aria (now say that 5 times fast!): Positive UX aspects of Pokemon Go

社区洞察

其他会员也浏览了

Crossroads of GIS and Public Safety: My Journey with Gistic Research

Simple Hydromet Exchange Format: Streamlining Hydrometeorological Data Sharing

Finding Suspects with OSINT Mapping: A Game-Changing Tool for Investigations

4 Breakthrough Uses for Big Data and Why You'll Benefit From Them

IBM and Data Science are Helping Save the World through Call for Code

Monitoring Rancher cluster health with kube-state-metrics

Model AND Map Connections in Real-World Systems

Learn how to Effectively report statistical data

Wildfire Data in British Columbia with Tableau

Personal summer challenge on MS Excel