Manipulating Strings in R Programming
Strings are the second most widely used data types in R Programming (numeric data types are still at the top), and now or then, you will anyway come up with a case where you need to deal with the strings and a way, they can be handled in. In our previous article, I have tried to explain some functions that are associated and can help us deal with the manipulation of numeric values. Through this article, I will try to cover some functions that are associated with the strings and can help us manipulate the string values. However, the functions covered here are not the only ones that help us manipulate the strings; there might be other functions that are equally helpful but didn’t get a chance here in this article due to the preferences I have. Most of the functions used here are commonly used ones.
String Manipulation: Definition
A process using which you can analyze the strings is called string manipulation. However, it is not only limited to analyzing but also includes changing, slicing, parsing, the strings. The R programming language does have built-in functions that will do this task most of the time. The following are a few of them listed and which we are going to cover throughout this article.
- The paste() Function
- The substr() Function
- The strsplit() Function
- The grep() Function
- The abbreviate() Function
- The toupper() Function
- The tolower() Function
- The casefold() Function
The paste() Function
The paste()
function in R programming allows us to concatenate multiple string values to create a longer string. The function can use different separators to separate each string values with the help of the “sep =” argument. Moreover, it has the “collapse =” argument, that helps us to make a single long line of string by collapsing multiple strings provided. The syntax for the paste()
function is as shown below:
paste(…, sep = " ", collapse = NULL)
Where,
...
– specifies any R object that we need to print.sep
– specifies the separator which separates the multiple objects. It is an optional argumentcollapse
– specifies whether the multiple inputs should be collapsed to create a long string or not. It is an optional argument.
Let us see some examples for the paste()
function.
> #Using paste() function to concatenate strings
> paste("My", "name", "is", "Lalit") #Without sep = argument
[1] "My name is Lalit" #Output
>
> paste("My", "name", "is", "Lalit", sep = ",") #With sep = argument
[1] "My,name,is,Lalit"
In the first example, we haven’t specified the “sep =” argument which leads us to get a standard separator (white space it is). In the second example, we have used the comma as a separator.
> a <- paste(c(2:5), c(10:13), sep = "--")
> print(a)
[1] "2--10" "3--11" "4--12" "5--13"
Here, the first and second string arguments are of the same length, therefore each element of the first is concatenated with each element of the second with separator “–“.
The substr() Function
The substr()
function in R, helps us extracting or replacing a substring out of a given string value or out of a vector of strings. This extraction/replacement happens by specifying the beginning and the ending point. Meaning, the substring will be extracted/replaced out of a given string based on the starting index value and the ending index value.
The syntax for this function is as shown below:
substr(x, start, stop)
Where,
x
– specifies the argument which is nothing but a vector of string/s or a direct string itself.start
– means the starting index point from where the function should start extracting/replacing the substring.end
– means the ending index point at where the function will stop extracting/replacing the substring and the result will be generated.
Let us see some examples of the substr()
function
> #Extracting a substring out of a string
>
> x <- "Programming is fun"
> substr(x, 16, 18)
[1] "fun" #substring extracted from index 16 upto index 18
Similarly, we can replace the substring from a given string using a combination of the assignment operator and the substr()
function. Let us replace “fun”, with “luv”
> #Replacing a substring out of a string
> substr(x, 16, 18) <- "luv"
> print(x)
[1] "Programming is luv"
This code replaces the substring from 16 to 18th position (i.e. “fun”) with a new string “luv”.
> p <- c("I", "Like", "R", "Programming")
> substr(p, 1, 1) <- "$" #Replacing character at first position from each
string under P
> print(p)
[1] "$" "$ike" "$" "$rogramming"
Here in this example, the first character of each string is replaced with “$” from the original vector of strings.
> p <- c("I", "Like", "R", "Programming")
> #Replacing first character of each string with respective symbols as shown below:
> substr(p, 1, 1) <- c("@", "#", "$", "%")
> print(p)
[1] "@" "#ike" "$" "%rogramming"
Here, the first character from each string is replaced with the subsequent string symbols under the original vector of strings.
The strsplit() Function
When you are in need of splitting an entire string or a vector of strings into multiple substrings using a literal or a regular expression, you can use the strsplit()
function. This function takes a vector of strings or a string as input and returns a list of all elements that are repeated at a specific character.
This function can be very useful in the case of Text-Mining or Text Analytics and has syntax as shown below:
strsplit(x, split, fixed = FALSE)
Here,
x
– specifies a vector of character strings or a string.split
– specifies the delimiter at which the split will happen. This can also be a vector of characters at the occurrence of which the split is supposed to happen.fixed
– if set TRUE, this argument will allow you to split the text at a fixed width.
> z <- "Next few months will be important to keep COVID-19 down."
> strsplit(z, " ") #Making a split at every space.
[[1]]
[1] "Next" "few" "months" "will" "be" "important"
[7] "to" "keep" "COVID-19" "down."
Here in this code, split happens at the occurrence of every space.
> o <- "Let's-split-text-at-every-hyphen"
> strsplit(o, "-") #Making split at every hyphen.
[[1]]
[1] "Let's" "split" "text" "at" "every" "hyphen"
In the example above, the split is happening at each hyphen. If you would have noticed, the output generated by strsplit() after splitting is a list of elements from the original vector.
> class(strsplit(o, "-"))
[1] "list"
> class(strsplit(z, " "))
[1] "list"
if you want to convert the output into a simple atomic vector of dimension one, you can use the unlist()
function and enclose the strsplit()
into it to get a vectored output.
> unlist(strsplit(z, " "))
[1] "Next" "few" "months" "will" "be" "important"
[7] "to" "keep" "COVID-19" "down."
>
> unlist(strsplit(o, "-"))
[1] "Let's" "split" "text" "at" "every" "hyphen"
The grep() Function
The grep()
function is a pattern matcher as well as a searcher in R. This function, if used on a string or a vector of strings, could search a specific pattern of strings within the given string and return a relative index as an output everywhere it finds the pattern match. Please note that it just returns the relative index of the matched string and not the string itself. Syntax of this function is as shown below:
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
Where,
pattern
– specifies a string pattern that to be matched in a given string vector.x
– specifies a string or a vector of strings under which we are willing to search the pattern.ignore.case
– ignores the case of string letters. If it has the value FALSE, the pattern match will be case sensitive, if TRUE, the case of strings will be of no concern.perl
– this argument specifies if the regular expressions that are compatible with perl can be used or not. This is a logical argument.value
– This argument specifies whether the values are to be returned after matching or just the indices of the matching values. The argument is logical.fixed
– lets the system know whether the matching is to be done as it is or not. If TRUE, this overrides the conflicting arguments.useBytes
– If set to TRUE, a byte-by-byte matching will be made for the pattern.invert
– if set TRUE, it reverses the function and returns the indices of a non-matching pattern.
Time for the examples associated with grep() function:
> #Using grep() to return indices of matched strings
> vect <- c("Mohan", "Sam", "Ben", "Eliza", "Tucker")
> grep("a", vect, value = FALSE)
[1] 1 2 4 #Returning indices of strings where pattern matched.
>
> #using grep() to return the strings with matched patern instead of indices
> vect <- c("Mohan", "Sam", "Ben", "Eliza", "Tucker")
> grep("a", vect, value = TRUE)
[1] "Mohan" "Sam" "Eliza" #Returning string where pattern matched.
Here, in first example, when we use the grep()
function, it returns the indices where pattern match found (i.e. “a”). Whereas, in the second example, it returns all the values which are having a pattern match.
#Inverting the results of grep() function
> vect <- c("Mohan", "Sam", "Ben", "Eliza", "Tucker")
> grep("a", vect, value = TRUE, invert = TRUE)
[1] "Ben" "Tucker"
In this example, the “invert = TRUE” argument is used; which causes the function to work inversely and it returns all those string values where pattern in not matching.
The abbreviate() Function
You can abbreviate strings under R using the abbreviate()
function. This function abbreviate strings upto minimum length of four letters by default (you can reduce this length and make abbreviations of two letters as well) in a way that the string remains unique (unless there are duplicates).
Following is the syntax for this function:
abbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE, method = c("left.kept", "both.sides"), named = TRUE)
Here in this function:
names.arg
– Meaning a vector of strings/names to be abbreviated.minlength
– specifies the minimum length up to which the abbreviation should happen. By default, it takes four-letter abbreviations.use.classes
– specifies whether the abbreviation should follow the case class? should lower case letters be truncated first? a logical argument and get ignored by R right now.dot
– specifies whether the dots should be appended under abbreviated texts or not. This is a logical argument.strict
– this argument keeps a strict look on minlength argument above if mentioned TRUE. The default value is FALSE.method
– specifies how the truncation should happen. Whether the character from left to keep or truncation should happen from both sides.named
– if TRUE, returns the names from the original vector along with the vector name.
> #using abbreviate() function with default arguments
> continents <- c("Asia", "Europe", "North America", "South America", "Antartica", "Africa", "Oceania")
> abbreviate(continents)
Asia Europe North America South America Antartica
"Asia" "Eurp" "NrtA" "SthA" "Antr"
Africa Oceania
"Afrc" "Ocen"
You can see the abbrevations with minimum length as four characters. Also, original names of the objects are kept in output.
> #abbrevating the string to two characters
> abbreviate(continents, minlength = 2)
Asia Europe North America South America Antartica
"As" "Er" "NA" "SA" "An"
Africa Oceania
"Af" "Oc"
Here, the minimum length is set to 2 characters. This causes us to abbrevate the names upto two characters.
> #We can also remove the original names
> abbreviate(continents, minlength = 2, named = FALSE)
[1] "As" "Er" "NA" "SA" "An" "Af" "Oc"
Since the “named =” argument is specified as FALSE, the original names are ignored from the output.
The toupper() Function
When we need to convert the entire string into the upper case, we can use the toupper()
function which is built-in under R. This function takes a string or a vector of strings as an argument and then converts it into the upper case. Syntax of the toupper()
function is as below:
toupper(x)
Where,
x
– specifies a character/string vector.
> #The toupper() function converts the string into upper case
> x <- "Let's manipulate the strings in R using different Functions"
> toupper(x)
[1] "LET'S MANIPULATE THE STRINGS IN R USING DIFFERENT FUNCTIONS"
This function, as said above, converts the string into upper case.
The tolower() Function
As countarary to the toupper()
function, the tolower()
function converts the given upper case string or vector of upper case strings into lower case. Syntax for this function is as shown below:
tolower(x)
Where,
x
– specifies a character vector (with characters into upper case).
> # Converting string into lower case
> x <- "LET'S MANIPULATE THE STRINGS IN R USING DIFFERENT FUNCTIONS"
> tolower(x)
[1] "let's manipulate the strings in r using different functions"
The casefold() Function
The casefold()
function is not much different from the tolower()
and toupper()
functions. It also converts the given string argument into lower or upper case depending on the value of argument “upper”. If the value for upper is specified as TRUE, it converts the string into the upper case; else converts the string into lower case by default. Syntax of the casefold()
function is as shown below:
casefold(x, upper = FALSE)
Where,
x
– represents the character vector that needs to be converted to a lower or an upper case.upper
– argument specifies whether the string should be converted into an upper case or lower case. Particularly, it is a logical argument with a default value FALSE.
> #Examle of casefold() Function
> x <- "Let's use the casefold function"
> casefold(x, upper = FALSE) #Converting to lower case
[1] "let's use the casefold function"
> casefold(x, upper = TRUE) #Converting to upper case.
[1] "LET'S USE THE CASEFOLD FUNCTION"
Well, this is it from this article. From our next article in this series, we will come up with more interesting topic from the world of R porgramming.