登录查看更多内容

Tying it all together with stringr

Samantha Bell

Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics

发布日期: 2020年12月3日

Manipulating strings and pulling patterns of text is a frequent coding task and can be a challenge. Among the many options available for manipulating strings is the "stringr" package. Today's tip dives into some of the functions stringr has to offer.

Today we will explore:

str_locate() - identifying the start and end character indices of your search text
word() - pulling entire words based on their index or range of indices. "Words" are defined by a separator (default is a single space)
str_split() - separating a string into a list, using a provided separator
str_extract() - print matched text from a string

[We will also touch on the use of the collapse option from paste() - allowing for quick creation of long strings with separator text.]

Install and load the stringr package before getting started:

install.packages("stringr")
library(stringr)

Follow along by running the code in R studio. Ready! Set! Go!

_________________________________________________________________________

1) Let's try to locate a string of text within some sentences

str_locate() will provide us with the start and end indices of our desired search text.

str_locate() takes in the string of text you are searching within and the pattern you are searching for.

It will give the location of matched text within the string as output. To ignore the case of the text, wrap the pattern in the regex() function and use ignore.case = TRUE.

We can use example lyrics to look for occurrences of the text, "some sing"

exampleSong <- "All God's creatures got a place in the choir. Some sing low and some sing higher. Some sing out loud on a telephone wire. Some just clap their hands, or paws, or anything they've got now."


toMatch <- "some sing"

# Find the FIRST location, NOT ignoring case
str_locate(exampleSong, toMatch) 

# Find the FIRST location, IGNORING case
str_locate(exampleSong, regex(toMatch, ignore_case = TRUE) ) 

# Find ALL locations, IGNORING case
str_locate_all(exampleSong, regex(toMatch, ignore_case = TRUE) )

Note the difference in output when we add the option to ignore case:

NOT ignoring case will miss the first capitalized "Some"

Ignoring case will properly find the first location of "Some sing"

__________________________________________________________________________

2) Capture an unknown word or words based on position

word() will let us pull from a string based on a set position or range, with the ability to specify the separator as well. This returns the entire contents from start to finish, including the separator.

word() takes in the string you are searching within, the start and end locations of the word(s), and the separator text used to split the string into words

(by default this is a single space).

To see how this works, let's first use our example song text again.

First, pull the 10th and 11th words. Notice that this is the first instance of "some sings" that we searched for above, but this time we are counting the indices based on WORD count, not CHARACTER.

# Pull the first instance of "some sing"
word(exampleSong, start = 10, end = 11) 


# You can simplify the command if only providing the start position 
word(exampleSong, 10) # Pull the 10th word

A negative start or end will allow you to count backwards from the end of the string.

word(exampleSong, start = -6, end = -1) # Pull from the end of the string

Now that we have an understanding of how word() works, we can take a look at a different application.

In this case, we have a string of test dates from 3 patients, and we want to pull certain test dates out. The list of dates looks like this:

Since our dates are stored using ".." as a separator, we must specify this when using word()

Remember that "." is a regex symbol with its own meaning! To avoid this being interpreted by R, we must escape the characters. A double slash before a symbol will let R know to read it as a normal character.

"\\" means "escape"

Therefore ".." becomes "\\.\\." when we specify the separator in the command to word().

We start by creating our data table, then test out the separator while looking for the oldest and newest test for each patient.

We can use a loop to iterate over each row in the range 1 to number of rows. Note what happens if you do not escape the dots.

# Create our table with a row for each patient
testDates <- rbind( A = "03/23/2020..04/02/2020..05/14/2020..09/25/2020",
                    B = "05/04/2020..07/15/2020..08/21/2020",
                    C = "09/13/2020..09/24/2020..10/01/2020..11/25/2020" )


# Find the first test date for each patient
for(patient in 1:dim(testDates)[1]){ 
  print(word(testDates[patient], start = 1, sep = "\\.\\."))
}


### For comparison, what happens if you forget to escape the dots?
word(testDates[1], start = 1, sep = "..")

# Find the most recent test date for all patients
for(patient in 1:dim(testDates)[1]){ 
  print(word(testDates[patient], start = -1, sep = "\\.\\."))
}

__________________________________________________________________________

3) Separate a single list of words by applying separator text.

str_split() will return a list of split strings, split on the separator text provided. The split pattern itself is not returned.

str_split() takes in the string that will be split and the pattern to split the string

In part 2, we looked at how word() captures individual dates as words. But if you remember from part 1, word () also take ranges of words by supplying both a start and end position. What does that mean for our date list, where we don't want to see the separator?

Using our same test date table, for Patient A, we will:

Find the 1st through 3rd dates
Print the non-split captured text from word()
Split the dates on the same separator used to find them, and print as a list

# Assign the captured word range to a variable
dates <- word(testDates[patient], start = 1, end = 3, sep = "\\.\\.") 
dates # View


# List as split dates 
print(str_split(dates, pattern = "\\.\\."))

The results will look like this:

Note the dots included in the non-split dates. word() prints everything from start to end.

* But what if we need only certain dates as a list and don't want the entire range?

Instead of using word() and str_split() together, we can replace the start and end positions by specifying a list of indices to the word function itself!

Using c() to provide a list of positions to the word() function, we can select certain test dates and return them as a list.

Let's pick out test 1 and 3 for each patient:

for(patient in 1:dim(testDates)[1]){
  print(word(testDates[patient], start = c(1, 3), sep = "\\.\\."))
}

The results will be lists of only test 1 and 3 for each patient:

__________________________________________________________________________

4) Find specific text patterns in a string

str_extract() allows us to return matched text found within a string.

str_extract() takes in the string you are searching within and a regex pattern you are searching for

Let's say we have a list of hospital diagnosis codes, but we only want to pull out certain ones. The codes we are looking for start with "T" and are followed by the numbers 36-50. (ex: T36). Before we can match using str_extract(), we want to create a regex pattern to match against which will:

Be one string of text (required for regex)
Have each desired code separated by the or symbol "|", so that we pull matches to any of the possible codes in our desired list
Not require us to type out each code by hand

Starting our pattern, paste() can list out all the numbers in our range:

paste(36:50)

But that is not quite what we want the numbers to look like for use as a pattern. Remember that to be a regex pattern it must be one string of text. And we also want a leading "T" on each number. If we use the collapse option within paste, we can get one string and can specify how our numbers are separated.

Using collapse within paste() lets you create any separator you wish for your list, collapsing it into one string.

Improve our pattern:

The "|" is used as "or" in the regex, so each code must be separated by this.
By adding the letter "T" to our collapse, we add the leading T to each trailing number

paste(paste(36:50), collapse ="|T")

Almost there! But we are still missing the first "T". This is an easy fix. We can use paste0() to quickly add the leading T without any additional separators. It looks long, but the paste0() code will simply surround our previous pattern: paste0("T", previousPattern)

# Save pattern to a variable
pat <- paste0("T", paste(paste(36:50), collapse = "|T") ) 

# View our finished pattern
pat

Let's review the development of this pattern:

Now we are ready to use str_extract() to find our matching codes!

# String of codes to pull matches from
codes <- "T35..T42..J12..H14..M44..T55..A23..T47..A36..T43..M21..H51" 


str_extract(codes, pat) # Print the FIRST match

str_extract_all(codes, pat) # Print ALL matches

__________________________________________________________________________

HAPPY PROGRAMMING!

要查看或添加评论，请登录

Samantha Bell的更多文章

Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!

2022年2月18日

Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!

Have some dirty phone numbers in your data? This package can help! THE TASK Many data analysts will encounter projects…
Grow your plot expertise in R with drag-and-drop from esquisse

2021年12月13日

Grow your plot expertise in R with drag-and-drop from esquisse

Ever felt overwhelmed by ggplot? Are you unsure of how to get started with building your own visuals in R? The esquisse…
UPDATE - Cleaning addresses (with a new package)

2021年6月8日

UPDATE - Cleaning addresses (with a new package)

If you have made use of code to simplify, clean, geocode, or round address coordinates, this package may be the one for…

2 条评论
Freshen up - Update your R version and packages from within R Studio!

2021年5月25日

Freshen up - Update your R version and packages from within R Studio!

Is it time for an update? If you can't remember the last time you updated R, the answer is most likely, "yes". Noticing…
Spot the difference - comparing tables in R

2021年5月17日

Spot the difference - comparing tables in R

Ever wondered how to compare code output without looking over each row and column by hand? This handy use of…
Tracking Progress in R

2021年3月15日

Tracking Progress in R

It sure does seem like "a watched pot never boils" when waiting for loops or mapped functions to complete many…
Simplifying and Grouping Address Fields Using R

2021年2月15日

Simplifying and Grouping Address Fields Using R

Trying to group records by street address can be a daunting task. Although hotspot analyses are a key part of writing…

1 条评论
Exporting Multiple Pages to an Excel Workbook from R

2020年11月6日

Exporting Multiple Pages to an Excel Workbook from R

Reports exported from R language can become unwieldy as results quickly start to fill up your destination folders…
FUN FACT: find those duplicates!

2020年10月22日

FUN FACT: find those duplicates!

Using duplicated() in R I thought I would share this fun & helpful R function which can be used to easily find…

1 条评论
Understanding the Chronic Optimist in Your Life

2019年12月20日

Understanding the Chronic Optimist in Your Life

In a world becoming increasingly aware of everyday anxieties, those of us who approach life with perpetual optimism can…

See all articles

Tying it all together with stringr

Samantha Bell

Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics

Today we will explore:

Install and load the stringr package before getting started:

1) Let's try to locate a string of text within some sentences

We can use example lyrics to look for occurrences of the text, "some sing"

2) Capture an unknown word or words based on position

To see how this works, let's first use our example song text again.

Now that we have an understanding of how word() works, we can take a look at a different application.

We start by creating our data table, then test out the separator while looking for the oldest and newest test for each patient.

3) Separate a single list of words by applying separator text.

Using our same test date table, for Patient A, we will:

* But what if we need only certain dates as a list and don't want the entire range?

Let's pick out test 1 and 3 for each patient:

4) Find specific text patterns in a string

Starting our pattern, paste() can list out all the numbers in our range:

Improve our pattern:

Now we are ready to use str_extract() to find our matching codes!

HAPPY PROGRAMMING!

Samantha Bell的更多文章

社区洞察

其他会员也浏览了

YAML vs YML: Developer’s Guide to Syntax and Ease of Use

MarkItDown: A Powerful Tool for Converting Data to Markdown for LLM Applications

LAMBDA spotlight: Text.DropSliceBetween

Spring: Working with Application Context.

What is Yaml file?

Boost Your #VideoMarketing Strategy with Free #YouTube Video Data Using?#Python

DIY-Crush Your #Competition with #Automated #YouTube Data Collection Using #Python Code

Advanced Techniques for Using Event Filters in PyQt5

How to Predict Ad Clicks with Python: A Machine Learning Approach to Improve Your Ad Performance

Plotting and Visualisation: Matplotlib Basics

Today we will explore:

Install and load the stringr package before getting started:

1) Let's try to locate a string of text within some sentences

We can use example lyrics to look for occurrences of the text, "some sing"

2) Capture an unknown word or words based on position

To see how this works, let's first use our example song text again.

Now that we have an understanding of how word() works, we can take a look at a different application.

We start by creating our data table, then test out the separator while looking for the oldest and newest test for each patient.

3) Separate a single list of words by applying separator text.

Using our same test date table, for Patient A, we will:

* But what if we need only certain dates as a list and don't want the entire range?

Let's pick out test 1 and 3 for each patient:

4) Find specific text patterns in a string

Starting our pattern, paste() can list out all the numbers in our range:

Improve our pattern:

Now we are ready to use str_extract() to find our matching codes!

HAPPY PROGRAMMING!

Samantha Bell的更多文章

Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!

Grow your plot expertise in R with drag-and-drop from esquisse

UPDATE - Cleaning addresses (with a new package)

Freshen up - Update your R version and packages from within R Studio!

Spot the difference - comparing tables in R

Tracking Progress in R

Simplifying and Grouping Address Fields Using R

Exporting Multiple Pages to an Excel Workbook from R

FUN FACT: find those duplicates!

Understanding the Chronic Optimist in Your Life

社区洞察

其他会员也浏览了

YAML vs YML: Developer’s Guide to Syntax and Ease of Use

MarkItDown: A Powerful Tool for Converting Data to Markdown for LLM Applications

LAMBDA spotlight: Text.DropSliceBetween

Spring: Working with Application Context.

What is Yaml file?

Boost Your #VideoMarketing Strategy with Free #YouTube Video Data Using?#Python

DIY-Crush Your #Competition with #Automated #YouTube Data Collection Using #Python Code

Advanced Techniques for Using Event Filters in PyQt5

How to Predict Ad Clicks with Python: A Machine Learning Approach to Improve Your Ad Performance

Plotting and Visualisation: Matplotlib Basics