Apply Functions
piping
Much like the | operator in bash, the %>% operator in R pipes the output from the first expression to the second. For example instead of:
sum(c(1,2,3))
[1] 6
It is extremely common practice in the tidyverse to pipe output from one function to another. For example:
subset <- iris %>%
subset(Sepal.Length > 5) %>%
mutate(Sepal.Length.Sq = Sepal.Length^2)
head(subset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.Sq
1 5.1 3.5 1.4 0.2 setosa 26.01
2 5.4 3.9 1.7 0.4 setosa 29.16
3 5.4 3.7 1.5 0.2 setosa 29.16
4 5.8 4.0 1.2 0.2 setosa 33.64
5 5.7 4.4 1.5 0.4 setosa 32.49
6 5.4 3.9 1.3 0.4 setosa 29.16
select
select is a handy function used to select columns from a data.frame or tibble. For example:
iris %>% select(Sepal.Length, Species) %>% head()
Sepal.Length Species
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5.0 setosa
6 5.4 setosa
That alone is not that impressive, as we could easily do something like:
iris[, c("Sepal.Length", "Species")] %>% head()
Sepal.Length Species
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5.0 setosa
6 5.4 setosa
However, in the same way you can write 1:4 to represent a vector of numbers from 1-4, you can select columns from Sepal.Length to Petal.Length (and everything in between) by using Sepal.Length:Petal.Length.
iris %>% select(Sepal.Length:Petal.Length) %>% head()
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
select is particularly useful when paired with selection helpers, as you can select certain columns based on their names:
iris %>% select(contains(
"length"
)) %>%
head()
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
# or case sensitive
iris %>% select(contains(
"Length",
ignore.case=F
)) %>%
head()
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
selection helpers
Selection helpers are functions that make selecting variables easier. They are particularly easy to use with select.
everything matches all variables. For example:
iris %>% select(everything()) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
It is primarily useful when used in combination with functions like pivot_longer and pivot_wider.
last_col selects the last variable, possibly with an offset.
iris %>% select(last_col()) %>% head()
Species
1 setosa
2 setosa
3 setosa
4 setosa
5 setosa
6 setosa
Or, with an offset:
iris %>% select(1:last_col(2)) %>% head()
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
contains selects columns where the columns name contains another string. For example:
iris %>% select(contains("sepal")) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
|
|
In the same way that contains looks for a string within the column names of a data.frame, starts_with and ends_with select columns where column names either start with one or more values or end with one or more values (respectively). For example, to get the columns starting with "Sepal":
iris %>% select(starts_with("sepal")) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Or to get columns that end in "width":
iris %>% select(ends_with("width")) %>% head()
Sepal.Width Petal.Width
1 3.5 0.2
2 3.0 0.2
3 3.2 0.2
4 3.1 0.2
5 3.6 0.2
6 3.9 0.4
For more fine grain control, matches behaves the same way, but instead of literal string matching, we can feed a regular expression to matches. For example, we could get all columns containing one or more ".":
iris %>% select(matches("+\\.")) %>% head()
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
Sometimes, you’ll have datasets with columns labeled sequentially, for example:
head(billboard)
# A tibble: 6 x 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doo… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 B… Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
# … with 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>,
# wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53 <dbl>, wk54 <dbl>,
# wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>,
# wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>, wk65 <dbl>, wk66 <lgl>,
# wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>,
# wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
Here, we have columns labeled wk1 all the way until wk76. Using num_range and select we can get any number of those specific columns:
billboard %>% select(num_range("wk", 70:75)) %>% head()
# A tibble: 6 x 6
wk70 wk71 wk72 wk73 wk74 wk75
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 NA NA NA NA NA NA
5 NA NA NA NA NA NA
6 NA NA NA NA NA NA
all_of is a selection helper designed to select strictly the columns whose names are inside the provided vector.
my_values <- c("Sepal.Length", "Sepal.Width")
iris %>% select(all_of(my_values)) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
But, whenever a single value in your vector isn’t present, an error is thrown.
my_values <- c("Sepal.Length", "Sepal.Width", "Sepal.Weight")
iris %>% select(all_of(my_values)) %>% head()
Error: Cannot subset columns that do not exist.
✖ Column `Sepal.Weight` does not exist.
For times you would like to select the values if they exist, any_of is more useful. It is similar to all_of, but doesn’t check if a value is missing.
my_values <- c("Sepal.Length", "Sepal.Width", "Sepal.Weight")
iris %>% select(any_of(my_values)) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
transmute
transmute is a useful function that adds new variables and drops all existing ones. If a variable already exists, it overwrites the variable. For example, let’s say we wanted to capitalize the values of Species in the iris dataset:
iris %>%
transmute(Species = toupper(Species)) %>%
head()
Species
1 SETOSA
2 SETOSA
3 SETOSA
4 SETOSA
5 SETOSA
6 SETOSA
Here, the values in the Species column are overwritten with the fully capitalized version. All of the other columns are dropped. One way to maintain other columns, would be to include them in the transmute call:
iris %>%
transmute(Species = toupper(Species), Sepal.Length, Sepal.Width) %>%
head()
Species Sepal.Length Sepal.Width
1 SETOSA 5.1 3.5
2 SETOSA 4.9 3.0
3 SETOSA 4.7 3.2
4 SETOSA 4.6 3.1
5 SETOSA 5.0 3.6
6 SETOSA 5.4 3.9
Alternatively, you could use mutate, which has the same behavior, but preserves existing variables.
mutate
mutate is just like transmute, but the original data is preserved. For example:
iris %>%
mutate(Species = toupper(Species)) %>%
head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 SETOSA
2 4.9 3.0 1.4 0.2 SETOSA
3 4.7 3.2 1.3 0.2 SETOSA
4 4.6 3.1 1.5 0.2 SETOSA
5 5.0 3.6 1.4 0.2 SETOSA
6 5.4 3.9 1.7 0.4 SETOSA
Here, since Species already exists as a column, the column is overwritten by our new capitalized values. If the name of the new column does not already exist, the original Species column will remain untouched. For example:
iris %>%
mutate(Species_Cap = toupper(Species)) %>%
head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_Cap
1 5.1 3.5 1.4 0.2 setosa SETOSA
2 4.9 3.0 1.4 0.2 setosa SETOSA
3 4.7 3.2 1.3 0.2 setosa SETOSA
4 4.6 3.1 1.5 0.2 setosa SETOSA
5 5.0 3.6 1.4 0.2 setosa SETOSA
6 5.4 3.9 1.7 0.4 setosa SETOSA
mutate is extremely useful, and is difficult (and less intuitive) to replicate in pandas in Python.
case_when
case_when is a function that allows you to vectorize multiple if_else statements. For example, let’s say we want to create a new column in our iris dataset called size, where the value is Large if Sepal.Length is greater than 5, and Not Large otherwise?
new_iris <- iris %>%
mutate(size = case_when(
Sepal.Length > 5 ~ "Large",
Sepal.Length <= 5 ~ "Not Large"
))
head(new_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa Large
2 4.9 3.0 1.4 0.2 setosa Not Large
3 4.7 3.2 1.3 0.2 setosa Not Large
4 4.6 3.1 1.5 0.2 setosa Not Large
5 5.0 3.6 1.4 0.2 setosa Not Large
6 5.4 3.9 1.7 0.4 setosa Large
Here, mutate is responsible for creating a new column called size, and case_when assigns the value Large when Sepal.Length is greater than 5 and Not Large when Sepal.Length is less than or equal to Not Large. In this case we have exhaustively gone through all of the possible values of our new column, size, because for each and every possible value of Sepal.Length we have an associated value (Large and Not Large). In reality, this is not always possible. For example, let’s remove the second case:
new_iris <- iris %>%
mutate(size = case_when(
Sepal.Length > 5 ~ "Large"
))
head(new_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa Large
2 4.9 3.0 1.4 0.2 setosa <NA>
3 4.7 3.2 1.3 0.2 setosa <NA>
4 4.6 3.1 1.5 0.2 setosa <NA>
5 5.0 3.6 1.4 0.2 setosa <NA>
6 5.4 3.9 1.7 0.4 setosa Large
As you can see, by default, if no cases match, NA is the resulting value. One common technique to handle "all other cases" is the following:
new_iris <- iris %>%
mutate(size = case_when(
Sepal.Length > 5 ~ "Large",
TRUE ~ "Not Large"
))
head(new_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa Large
2 4.9 3.0 1.4 0.2 setosa Not Large
3 4.7 3.2 1.3 0.2 setosa Not Large
4 4.6 3.1 1.5 0.2 setosa Not Large
5 5.0 3.6 1.4 0.2 setosa Not Large
6 5.4 3.9 1.7 0.4 setosa Large
Here, each case is evaluated. If at the end, there was no match, TRUE is always a match, and therefore the result will be Not Large.
between
between is a dead simple function from dplyr that is an efficiently implemented shortcut for the following:
x <- 5
print(x >= 4 && x <= 10)
[1] TRUE
# instead you can use between
between(x, 4, 10)
[1] TRUE
group_by
group_by is a function commonly used in conjunction with mutate, transmute, and summarize. It is useful when you want to perform a tapply-like operation on a data.frame. For example, let’s say you wanted to get the average Petal.Length by Species. Using tapply, you would do something like:
tapply(iris$Petal.Length, iris$Species, mean)
setosa versicolor virginica
1.462 4.260 5.552
While useful, tapply 's end result isn’t in a format that is conducive to further analysis or wrangling. For example, what if we wanted to calculate and then plot (in ggplot) the difference between the mean Petal.Length and the mean Sepal.Length by Species? Using tapply, you would have to do something like:
diff <- tapply(iris$Petal.Length, iris$Species, mean) - tapply(iris$Sepal.Length, iris$Species, mean)
myDF <- data.frame(Species = names(diff), diff = unname(diff))
ggplot(myDF, aes(x=diff, y=Species)) + geom_bar(stat="identity")
Again, a little bit more difficult to read than the following, and if you had more operations to complete, the previous example would make it difficult to do even more. In the following example, however, we can continue to utilize and build on myDF:
myDF <- iris %>%
group_by(Species) %>%
mutate(diff=mean(Petal.Length) - mean(Sepal.Length))
myDF %>% ggplot(aes(x=diff, y=Species)) + geom_bar(stat="identity")
summarize
summarize is a useful function to get a new, tidy, data frame that is a summary of some other data. It’s particularly useful in conjunction with group_by, when you want to compare groups.
For example, let’s say you wanted to the following:
-
Create a new column called
Sepal.Length.Catwith valuessmallwhenSepal.Length< 5.1,largewhenSepal.Length>= 5.8, andmediumotherwise. -
Get a summary containing the average
Sepal.WidthbySepal.Length.CatandSpecies. -
Get a summary containing the variation in averages for each
Species.
iris %>%
mutate(Sepal.Length.Cat = case_when(
Sepal.Length < 5.1 ~ "small",
Sepal.Length >= 5.8 ~ "large",
TRUE ~ "medium"
)) %>%
group_by(Sepal.Length.Cat, Species) %>%
summarize(avg_sepal_width_grouped = mean(Sepal.Width)) %>%
group_by(Species) %>%
summarize(std_of_avgs = sd(avg_sepal_width_grouped))
`summarise()` regrouping output by 'Sepal.Length.Cat' (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
Species std_of_avgs
<fct> <dbl>
1 setosa 0.402
2 versicolor 0.329
3 virginica 0.255
As you can see, it has some pretty powerful functionality that would be more difficult to replicate (and harder to read) using base R.
str_extract and str_extract_all
str_extract and str_extract_all are useful functions from the stringr package. You can install the package by running:
install.packages("stringr")
str_extract extracts the text which matches the provided regular expression or pattern. Note that this differs from grep in a major way. grep simply returns the index in which a pattern match was found. str_extract returns the actual matching text. Note that grep typically returns the entire line where a match was found. str_extract returns only the part of the line or text that matches the pattern. For example:
text <- c("cat", "mat", "spat", "spatula", "gnat")
# All 5 "lines" of text were a match.
grep(".*at", text)
[1] 1 2 3 4 5
text <- c("cat", "mat", "spat", "spatula", "gnat")
stringr::str_extract(text, ".*at")
[1] "cat" "mat" "spat" "spat" "gnat"
As you can see, although all 5 words match our pattern and would be returned by grep, str_extract only returns the actual text that matches the pattern. In this case "spatula" is not a "full" match — the pattern .at only captures the "spat" part of "spatula". In order to capture the rest of the word you would need to add something like .* to the end of the pattern:
text <- c("cat", "mat", "spat", "spatula", "gnat")
stringr::str_extract(text, ".*at.*")
[1] "cat" "mat" "spat" "spatula" "gnat"
Examples
How can I extract the text between parenthesis in a vector of texts?
text <- c("this is easy for (you)", "there (are) challenging ones", "text is (really awesome) (ok?)")
# Search for a literal "(", followed by any amount of any text other than more parenthesis ([^()]*), followed by a literal ")".
stringr::str_extract(text, "\\([^()]*\\)")
[1] "(you)" "(are)" "(really awesome)"
To get all matches, not just the first match:
text <- c("this is easy for (you)", "there (are) challenging ones", "text is (really awesome) more text (ok?)")
# Search for a literal "(", followed by any amount of any text (.*), followed by a literal ")".
stringr::str_extract_all(text, "\\([^()]*\\)")
[[1]]
[1] "(you)"
[[2]]
[1] "(are)"
[[3]]
[1] "(really awesome)" "(ok?)"
lubridate
lubridate is a fantastic package that makes the typical tasks one would perform on dates, that much easier.
Examples
How do I convert a string "07/05/1990" to a Date?
library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:data.table':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:base':
date, intersect, setdiff, union
dat <- "07/05/1990"
dat <- mdy(dat)
class(dat)
[1] "Date"
How do I convert a string "31-12-1990" to a Date?
my_string <- "31-12-1990"
dat <- dmy(my_string)
dat
[1] "1990-12-31"
class(dat)
[1] "Date"
nchar
nchar is a function which counts the number of characters and symbols in a word or a string. Punctuation and blank spaces are counted as well.
Examples
How can I find the number of characters and or symbols in the word "Protozoa"?
nchar("Protozoa")
[1] 8
How can I find the number of characters and or symbols for the following strings all at once: "pneumonoultramicroscopicsilicovolcanoconiosis", "password: DatamineRocks#stat1900@"?
string_vector <- c("pneumonoultramicroscopicsilicovolcanoconiosis", "password: DatamineRocks#stat1900@")
nchar(string_vector)
[1] 45 33
Fun Fact: pneumonoultramicroscopicsilicovolcanoconiosis is the longest word in the English dictionary.