Pirate Play Part I: Creating the Data in R
Photo from https://www.yourlifeindata.com/2019/12/20/is-it-time-to-live-the-pirate-life/

Pirate Play Part I: Creating the Data in R

In this article I want to create a dataset with attributes of fictional pirates. Note that I am aware that this article has a ridiculous premise, but I did it because while producing this type of content (that is, demonstrating that I can do data science, not data science with silly stories) while job hunting is important for visibility (plus I remember it so much better after writing about it), at some point it is more fun if I let some silliness in; and if I can add some fun to my life while job hunting during a pandemic, I am going to seize opportunity.

It’s also high time (or is it high tide?) someone wrote a data science blog post filled with terrible pirate puns and R code. Please note that these data are fictional, and do not come from any real company. I think of this article both as a portfolio and instructional piece, and therefore I think that the validity of these data are far less important compared to the code, and what I show that we can learn from potential findings (actionable insights FTW!).

The scenario I’ll be working with for all of the parts of this exercise is as follows: Imagine that you discovered the true negative correlation between number of pirates in the world, and global warming. Also imagine that you are unfamiliar with the idea that while correlation might be causation, it might not be. If that was the case, I would recommend you watch the short talk I gave on the topic at a recent data science conference, because we would not want you coming to the erroneous conclusion that you can fight global warming by getting volunteers to take up piracy.

Now imagine that you did not watch my talk, came to the erroneous conclusion, and have started a non-profit promoting piracy for volunteers. Unfortunately (for you; the rest of us will find your pirate woos quite lucky), despite training your pirate volunteers, you are finding that churn is a real problem, and you would like to understand it better, so that you can create interventions to increase pirate retention. To clarify, you are focusing your investigation on the pirates that quit out of their own volition, and not those who died in battle.

You only recently understood the value of continuous listening, and ongoing feedback from your volunteers, but have not yet had the chance to implement such a program within your organization. You also do not have any qualitative data, which is unfortunate, and it is often what we need to understand WHY people are doing what they are doing; quantitative research tends to be much better at explaining WHAT people are doing. Since you need some actionable insights despite not have the luxury of collecting more data, you turn to the data that you do have, and to your favorite pirate coding language: Argh! Um, I mean R.

Since this is the first part in the series, we will first create the data that we will analyze in subsequent parts. We want to end up with 13 columns:

  1. First mission date
  2. Last mission date
  3. Crew number (which is assigned at the start of the volunteering)
  4. Pirate ID
  5. Mission count of how many missions were completed
  6. Total number of mission hours completed
  7. The level the pirate reached, which is based on how many missions were completed
  8. Age
  9. Occupation, with options being student, working, retired, or other
  10. Whether there is a preference for day missions, night missions, or both
  11. A satisfaction rating, which is given by other crew members post mission, and then averaged
  12. What US state the pirate hails from
  13. Whether any of the pirate’s crew members died during a mission

The code for this article can be found here.

We need to start by loading the libraries we’ll be using during this exercise.

No alt text provided for this image

We also want to set a seed so that we can all create replicable data, and get the same results.

No alt text provided for this image

Let’s start with creating the first column, which is the date of the first mission. We want it to be some time between ‘2010/01/01’ and ‘2020/08/15’, but limit the dates to be only on the first of each month. Therefore, we create a sequence between those dates (and make sure to specify that the strings should be treated as dates), specifying that we only want one date per month. We then randomly sample those dates so we have a vector that is 30k dates long, and make a column out of all those dates. Note that we have to specify that we want to sample the dates with replacement, because we want a column 30k variables long, but are only providing 128 dates. Therefore, we obviously need dates to appear more than once, and need to replace every date we pick back into the pool of dates that we are picking from, so that we have the ability to choose it more than once.

No alt text provided for this image

For our second variable we will create a column with the date of the last mission. To make sure that the last date is after the first date, and does not come before it, I will start by making the last date the 1st date, and then add a certain amount of days to it, so that the two dates are not identical. To increase the variability between people, I’ll add a random number of days between 0 and 4k to each person’s start date. To get a random number for each pirate, I will first create a variable that is 30k integers long of random numbers between 0 and 4000 (and once again we need the replace parameter set to TRUE, because we need 30k, when the range of 0 to 4000 is much less than that, so we need to sample with replacement).

No alt text provided for this image

I then add this number of days to the first date. If the resulting date is earlier than 08/16 (which I want to be my latest date), I keep the new end date as the old end date + the random number, otherwise I force it to be 08/15.

No alt text provided for this image

My problem now though is that I have too many dates that are 08/15, and I would like fewer of those and to increase the variance of the final end dates. To increase the variance I now want to subtract a random number of days from each final date, somewhere between 0 and 65. This leads me to create another variable with 30k random numbers, this time between 0 and 65.

No alt text provided for this image

I check to see whether the last date is not 08/15, or if it is 08/15 but subtracting the random number from it would then make it earlier than the first date. If either of those conditions are met, I leave the final date as is. Otherwise, I subtract the random number from the final date.

No alt text provided for this image

I now combine the two column dates together in a dataframe, and arrange it by the first mission date.

No alt text provided for this image

I run a quick sanity check to confirm that I don’t have any cases where the last date comes before the first date. If I don’t have any then I should product 0 rows with this filtering, which is what I find.

No alt text provided for this image

I also want to confirm that I have increased variability and to not have too many final 08/15 date cases by creating an additional variable that is a 1 if the person’s final date was 08/15, and 0 if it was not, and then counting how many 0s and 1s I have, by just counting the rows after grouping by that new variable. We can see that we do not have too many people with 08/15 date as their final date.

No alt text provided for this image

I also want to know which crew each pirate is part of. Each group of pirates that started on the same date will be part of the same crew. To create the crews I’m ranking the dates and assigning crew names by rank.

No alt text provided for this image

Next, I want to give each pirate a unique id, which in effect will just be the row numbers.

No alt text provided for this image

To create a variable of how many missions were completed I start by finding how many days the person was active by calculating the difference in days between their first and last mission dates, then dividing that number by 30 and rounding it, so about a mission a month.

No alt text provided for this image

However, I want to increase variability a little bit, which is why I am going to again create a vector of random numbers–this time they’ll be between -2 and 2–and then add that vector to the mission counts variable so long as that number does not result in a negative number, in which case I’ll set it to 1.

No alt text provided for this image

I now run a quick sanity check to see if I have any negative mission counts, and see that I do not.

No alt text provided for this image

I’m going to assume that each mission takes 6 hours, plus or minus 5 (so between 1 and 11). Therefore, to create a variable of how many hours a person have been volunteering for, I’ll take the number of missions that they have been on, multiply it by 6, and then add to this product a number from a vector that is full of randomly generated numbers between -5 and 5. However, I only do this if adding that number doesn’t result in a negative number (negative volunteer hours does not make sense). If it does result in a negative number I’ll assign mission hours to 1.

No alt text provided for this image

Similar to above, a quick sanity check to confirm that we don’t have any negative mission hours.

No alt text provided for this image

I also want to give pirate levels based on how many missions they have accomplished. If they accomplished:

  1. up to 10 missions they’ll be at level 1
  2. 11 to 30 level 2
  3. 31 to 50 level 3
  4. 51 to 80 level 4
  5. 81 to 120 level 5
  6. over 120 level 6 I will use case_when within mutate to create this variable that is constructed with a handful of rules.
No alt text provided for this image

I do want to mess up the accuracy of pirateLevel a little bit, so that we can investigate it in the data wrangling blog post. The first step to doing this is creating a vector that is 0 96% of the time, and 1 the remaining 4% of the time.

No alt text provided for this image

Then every time it’s 1 I will add a level to the assigned level, but leave the pirate level accurate and as-is when the variable is 0. This way, I will only mess up 9% of the levels.

No alt text provided for this image

I am using the which function to get the index of the messed up levels. I then once subtract 1 from the first of those indices, and once add 1 to it, so that I can pull the row before the first index, the first index, and the row after the first index it, using bracket subsetting notation. I can now look at the levels and make sure only the middle one is wrong as a sanity check.

No alt text provided for this image

The same sanity check as above, but looking at the index of the second messed up level.

No alt text provided for this image

To create the age variable I’m making 3 vectors and meshing them together. One of those is 15k long with random numbers between 18 and 30; the second is 10k with random numbers between 31 and 55; and the third is 5k with random numbers between 56 and 75. I then resample that meshed-up vector without replacement, so that I get all the ages I already created, but I get them in random order in the dataframe, so that I do not have ages strongly correlated with start dates.

No alt text provided for this image

For the occupation variable, I want all the people to be assigned an occupation based on age, and so I turn to case_when again, and make the people under 22 students, non-students under 65 working, and the rest retired.

No alt text provided for this image

However, I want to add some variance to those occupation - age combos. To do that I am first choosing from the IDs associated with retired or student 20% random IDs that I will eventually convert to working occupation. I use pull to pull those IDs into a vector.

No alt text provided for this image

I am also selecting 11% of those between ages 40 and 65–that will be in the dataset as working–for early retirement.

No alt text provided for this image

Then a random 10% of those who aren’t students, or in either of the above ID lists to, convert to student.

No alt text provided for this image

Finally, I select 12% of everyone who is not in any of the above groups to convert to other occupation type.

No alt text provided for this image

To convert all the occupations I specify in a case_when that any occupation whose line has an: 1) id in the working list should be converted to working 2) id in the retired list should be converted to retired 3) id in the student list should be converted to student 4) id in the other list should be converted to other. All other IDs just retrain their previous occupation.

No alt text provided for this image

I also want to run a small sanity check where I am eyeballing the rows in the dataframe where the ids are in the student id list, because originally they would not have been students, and so I want to see if they’re all students now. By asking for counts by occupation, I see that I only have students in that group, so they were all converted as expected.

No alt text provided for this image

Similar to above, I am selecting random IDs to create my column specifying whether someone cares about pirating during a certain time of day. I first pull random 70% IDs of anyone who isn’t retired, to make them into night pirates.

No alt text provided for this image

I then select random 70% of the remaining IDs to turn them into day pirates.

No alt text provided for this image

I am now ready to create a new variable, by specifying that if a row’s ID is in the night list they are tagged as preferring the the night shift; if it is in the day list they prefer the day shift; and if it’s neither of those they’re tagged as not having a preference.

No alt text provided for this image

Another small sanity check, here counting number of shift preferences by occupation, to see if they’re ordered by what I’d expect. I expect students and working to be mostly night shift, and retired to be mostly day shift.

No alt text provided for this image

For the satisfaction rating column, I want to randomly assign a satisfaction rating to each row that is ostensibly given to them by their crew. I want less students to get very high ratings compared to non-students, so I am starting by choosing random 65% of students’ IDs.

No alt text provided for this image

and 75% of non-student IDs.

No alt text provided for this image

I am next calculating the length of the vector that is 65% of students, which is really just the length of the vector of the random students’ IDs.

No alt text provided for this image

I also want to calculate the length of the remaining 35% of students, which is the length of all students minus the above calculated number.

No alt text provided for this image

We now do the same for the non-students, which means that we are looking for the length of the 75% non-students vector.

No alt text provided for this image

and also the length of all non-students minus the above vector to get the length of the remaining 25%.

No alt text provided for this image

I then create 2 sequences: one that has all the integers between 85 and 100, and the second that has the integers between 0 and 84.

No alt text provided for this image

To create the actual satisfaction ratings, I create four dataframes that I eventually bind vertically, gluing the rows on top of each other. For the first dataframe I start with a vector where I randomly sample the high ratings sequence (i.e. 85 to 100) for the length of 65% of students, then putting it in a dataframe with the associated IDs. I rename the columns so that when I eventually join this to the original dataframe it will be easy to do.

No alt text provided for this image

for the second dataframe I randomly sample the low ratings sequence (i.e. 0 to 84) for the length of 35% of the remaining students, then create the dataframe with IDs that do not have a student occupation and are not in the 65% IDs.

No alt text provided for this image

Similarly, I am sampling the high ratings sequence for the length of 75% of non-students, and then createing the dataframe with their IDs.

No alt text provided for this image

I now sample the low ratings sequence for the length of the remaining 25% of non-students, then make the dataframe with the non-students, non-75% IDs.

No alt text provided for this image

The next step is to bind the dataframes vertically. That is, we are stacking the rows on top of each other, so we are binding the rows together.

No alt text provided for this image

As a sanity check I am choosing the IDs in the original pirates dataframe that are NOT in the crew satisfaction dataframe, and counting them. Since I should have used all the IDs, this should be 0.

No alt text provided for this image

I am also choosing the IDs that are shared between the dataframes, and counting them; since this should be all of them I expect it to be 30k.

No alt text provided for this image

As I passed my sanity check I’m joining the crew satisfaction dataframe to the pirates one by the ID, and renaming the joined dataframe pirates.

No alt text provided for this image

I will now do an additional sanity check, as I want to make sure that the proportions of satisfaction ratings vary across all the occupations. This is easy to do by plotting. The left and right plots are identical, with the exception of the scales = “free” parameter, which lets the axes of the different plots vary so it’s easier to visually compare apples to apples, and will only be in the left plots. I am plotting the satisfaction rating by occupation. The rating goes on the x-axis, and the occupation will each be a separate plot, so it goes in facets. I am using grid.arrange to bind the two plots together, and since I want them next to each other, I specify that the number of columns that I want in my combined plot is 2.

No alt text provided for this image

And we can also eyeball descriptive statistics.

No alt text provided for this image

For my state variable I want a column telling me which state the pirate hailed from. I searched the internet for the state names, and then copied and pasted them into a string.

No alt text provided for this image

I next replaced \r \n \t the in the states string with a space, and then took that output and added a comma before and after wherever I had two capital letters in a row (in effect subbing that for a comma, first capital letter, second capital letter, comma).

No alt text provided for this image

I also stripped spaces before and after commas.

No alt text provided for this image

I now needed to split the string on the commas, and save it as a list. The split saves all the elements in a single list, so to access each bit to be saved in a list as individual elements we need to use double square brackets to select a single bit as a single element.

No alt text provided for this image

Finally, I am making the list all lowercase, and adding to it four, random, non-states, to make our data a little messy.

No alt text provided for this image

Eyeballing the list.

No alt text provided for this image

I am creating the states column by sampling the list I created. I do, however, want to make sure that I have way more states than non-states. Therefore, I set the prob parameter to pick the first 102 list elements (50 states + dc + district of columbia) at an equal probability, and the last four elements–which are the not states–to be to also be picked at an equal probability to each other, but only a fraction of the likelihood compared to the states.

No alt text provided for this image

For my sanity check I’m ensuring that the four non-states were picked much less than the states.

No alt text provided for this image

For my final variable I want to create a column of whether the person had a crew member who died. I want most people to not have a dead crewmember. Having one will be indicated with 1, and not having one will be indicated with a 0. Therefore, I am creating a column with random 0s or 1s that will be the length of the dataframe, but the 0 will be 20 times more likely to get picked than the 1.

No alt text provided for this image

The dead crewmember sanity check entails grouping by the variable, and then counting how many rows we have per category.

No alt text provided for this image

Since I want to have an article on wrangling these data, I do want these data to be a little messy. One way to do that is to mess up the data by duplicating ID 20,000 four times at the end of the dataframe, and then making:

  1. ID 30,001’s mission 1 date a month later
  2. ID 30,002’s mission 1 date NA
  3. ID 30,003’s last mission 1 date a month later
  4. ID 30,004’s crew 50
No alt text provided for this image

Eyeballing the duplicated and modified rows to make sure it did what we expected as a sanity check

No alt text provided for this image

Another way to mess up the data is by adding NAs. Here I am selecting different percentages of random IDs for each column to convert to NA.

No alt text provided for this image

For each column in the dataframe, if the IDs are in the randomly selected IDs above, we convert that column value to NA; otherwise we leave it as the value in that column.

No alt text provided for this image

You may have noticed that the above code does not include the date columns. What I learned in this process is that using the ifelse function I can change your input class, and if you do this with a column full of dates, you will end up with a column that has been completely altered. However, you can use data.table::fifelse or if_else instead, which will preserve your date type, while completing the same transformation as ifelse; the two even use the same type of syntax as ifelse.

The problem is that while it is useful for the ifelse to force the datatype to be the same across the input and output, it then makes it impossible to convert one datatype into another, including NA. What I ended up using as a workaround, was assigning NA after selecting what I wanted to convert using the square bracket notation. For my use case I wanted to convert a bunch of dates to NA that were associated with IDs in a list, so I found the indices in my dataframe of the people whose ID was in that list using the which function, and then selected the dates associated with those indices, and just assigned NA to them like so:

No alt text provided for this image

The sanity check is plotting the missing values by column.

No alt text provided for this image

Finally, we save the data as a CSV, then read it in a looking at it for our sanity check.

No alt text provided for this image

We now have our dataframe, and can proceed to wrangling it, which is what the topic of the next article will be. Stay tuned!






.

要查看或添加评论,请登录

Dr. Aria Fredman (ADHD, ASD, PTSD)的更多文章

社区洞察

其他会员也浏览了