Tying it all together with stringr
Samantha Bell
Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics
Manipulating strings and pulling patterns of text is a frequent coding task and can be a challenge. Among the many options available for manipulating strings is the "stringr" package. Today's tip dives into some of the functions stringr has to offer.
Today we will explore:
- str_locate() - identifying the start and end character indices of your search text
- word() - pulling entire words based on their index or range of indices. "Words" are defined by a separator (default is a single space)
- str_split() - separating a string into a list, using a provided separator
- str_extract() - print matched text from a string
[We will also touch on the use of the collapse option from paste() - allowing for quick creation of long strings with separator text.]
Install and load the stringr package before getting started:
install.packages("stringr") library(stringr)
Follow along by running the code in R studio. Ready! Set! Go!
_________________________________________________________________________
1) Let's try to locate a string of text within some sentences
str_locate() will provide us with the start and end indices of our desired search text.
str_locate() takes in the string of text you are searching within and the pattern you are searching for.
It will give the location of matched text within the string as output. To ignore the case of the text, wrap the pattern in the regex() function and use ignore.case = TRUE.
We can use example lyrics to look for occurrences of the text, "some sing"
exampleSong <- "All God's creatures got a place in the choir. Some sing low and some sing higher. Some sing out loud on a telephone wire. Some just clap their hands, or paws, or anything they've got now." toMatch <- "some sing" # Find the FIRST location, NOT ignoring case str_locate(exampleSong, toMatch) # Find the FIRST location, IGNORING case str_locate(exampleSong, regex(toMatch, ignore_case = TRUE) ) # Find ALL locations, IGNORING case str_locate_all(exampleSong, regex(toMatch, ignore_case = TRUE) )
Note the difference in output when we add the option to ignore case:
NOT ignoring case will miss the first capitalized "Some"
Ignoring case will properly find the first location of "Some sing"
__________________________________________________________________________
2) Capture an unknown word or words based on position
word() will let us pull from a string based on a set position or range, with the ability to specify the separator as well. This returns the entire contents from start to finish, including the separator.
word() takes in the string you are searching within, the start and end locations of the word(s), and the separator text used to split the string into words
(by default this is a single space).
To see how this works, let's first use our example song text again.
First, pull the 10th and 11th words. Notice that this is the first instance of "some sings" that we searched for above, but this time we are counting the indices based on WORD count, not CHARACTER.
# Pull the first instance of "some sing" word(exampleSong, start = 10, end = 11) # You can simplify the command if only providing the start position word(exampleSong, 10) # Pull the 10th word
A negative start or end will allow you to count backwards from the end of the string.
word(exampleSong, start = -6, end = -1) # Pull from the end of the string
Now that we have an understanding of how word() works, we can take a look at a different application.
In this case, we have a string of test dates from 3 patients, and we want to pull certain test dates out. The list of dates looks like this:
Since our dates are stored using ".." as a separator, we must specify this when using word()
Remember that "." is a regex symbol with its own meaning! To avoid this being interpreted by R, we must escape the characters. A double slash before a symbol will let R know to read it as a normal character.
"\\" means "escape"
Therefore ".." becomes "\\.\\." when we specify the separator in the command to word().
We start by creating our data table, then test out the separator while looking for the oldest and newest test for each patient.
We can use a loop to iterate over each row in the range 1 to number of rows. Note what happens if you do not escape the dots.
# Create our table with a row for each patient testDates <- rbind( A = "03/23/2020..04/02/2020..05/14/2020..09/25/2020", B = "05/04/2020..07/15/2020..08/21/2020", C = "09/13/2020..09/24/2020..10/01/2020..11/25/2020" ) # Find the first test date for each patient for(patient in 1:dim(testDates)[1]){ print(word(testDates[patient], start = 1, sep = "\\.\\.")) } ### For comparison, what happens if you forget to escape the dots? word(testDates[1], start = 1, sep = "..") # Find the most recent test date for all patients for(patient in 1:dim(testDates)[1]){ print(word(testDates[patient], start = -1, sep = "\\.\\.")) }
__________________________________________________________________________
3) Separate a single list of words by applying separator text.
str_split() will return a list of split strings, split on the separator text provided. The split pattern itself is not returned.
str_split() takes in the string that will be split and the pattern to split the string
In part 2, we looked at how word() captures individual dates as words. But if you remember from part 1, word () also take ranges of words by supplying both a start and end position. What does that mean for our date list, where we don't want to see the separator?
Using our same test date table, for Patient A, we will:
- Find the 1st through 3rd dates
- Print the non-split captured text from word()
- Split the dates on the same separator used to find them, and print as a list
# Assign the captured word range to a variable dates <- word(testDates[patient], start = 1, end = 3, sep = "\\.\\.") dates # View # List as split dates print(str_split(dates, pattern = "\\.\\."))
The results will look like this:
Note the dots included in the non-split dates. word() prints everything from start to end.
* But what if we need only certain dates as a list and don't want the entire range?
Instead of using word() and str_split() together, we can replace the start and end positions by specifying a list of indices to the word function itself!
Using c() to provide a list of positions to the word() function, we can select certain test dates and return them as a list.
Let's pick out test 1 and 3 for each patient:
for(patient in 1:dim(testDates)[1]){ print(word(testDates[patient], start = c(1, 3), sep = "\\.\\.")) }
The results will be lists of only test 1 and 3 for each patient:
__________________________________________________________________________
4) Find specific text patterns in a string
str_extract() allows us to return matched text found within a string.
str_extract() takes in the string you are searching within and a regex pattern you are searching for
Let's say we have a list of hospital diagnosis codes, but we only want to pull out certain ones. The codes we are looking for start with "T" and are followed by the numbers 36-50. (ex: T36). Before we can match using str_extract(), we want to create a regex pattern to match against which will:
- Be one string of text (required for regex)
- Have each desired code separated by the or symbol "|", so that we pull matches to any of the possible codes in our desired list
- Not require us to type out each code by hand
Starting our pattern, paste() can list out all the numbers in our range:
paste(36:50)
But that is not quite what we want the numbers to look like for use as a pattern. Remember that to be a regex pattern it must be one string of text. And we also want a leading "T" on each number. If we use the collapse option within paste, we can get one string and can specify how our numbers are separated.
Using collapse within paste() lets you create any separator you wish for your list, collapsing it into one string.
Improve our pattern:
- The "|" is used as "or" in the regex, so each code must be separated by this.
- By adding the letter "T" to our collapse, we add the leading T to each trailing number
paste(paste(36:50), collapse ="|T")
Almost there! But we are still missing the first "T". This is an easy fix. We can use paste0() to quickly add the leading T without any additional separators. It looks long, but the paste0() code will simply surround our previous pattern: paste0("T", previousPattern)
# Save pattern to a variable pat <- paste0("T", paste(paste(36:50), collapse = "|T") ) # View our finished pattern pat
Let's review the development of this pattern:
Now we are ready to use str_extract() to find our matching codes!
# String of codes to pull matches from codes <- "T35..T42..J12..H14..M44..T55..A23..T47..A36..T43..M21..H51" str_extract(codes, pat) # Print the FIRST match str_extract_all(codes, pat) # Print ALL matches
__________________________________________________________________________