Introduction to R Programming: Part 1
Whenever I give a thought on a programming language that can be best suited for data analysis, statistical and scientific computing, a name that always flashes in my mind is R Programming Language. The programming language is designed as well as developed specifically for statistical analysis and computing during early of the 90s. It is one of the most popular languages around the globe used by statisticians, data analysts, data scientists, researchers to clean the data, analyze it, make nice graphics out of as well as further statistical analysis which can lead towards either a valid prediction or a conclusion throughout.
As per the IEEE, which releases the list of top programming languages every year, R is fifth out of the top ten programming languages in 2019. Being a programming language that is designed specifically for a domain (Data/ Statistical Analysis) the language has maintained the popularity because of the recent growing world of big data and analytics. A more expressive syntax, interface which is easy to use and series versatile graphical packages is what makes this language popular among the data geeks.
Why R Programming Language?
For any aspiring user, this question might be the one of interest. Why should we use R for statistical analysis and computing? What makes it stand apart from all others in the market?
Well, following are the reasons that specifies why we should go for it:
1) It is free of cost and open-source
R is licensed under the terms of General Public License (GNU) due to which it is free to download and can be downloaded on any platform, and all that is free. The more beautiful thing is, most of the packages used under R for advanced analysis are licensed under GNU as well. This makes R a programming language with a rich library of packages that are always going to help you in your analysis. All these can be used for your commercial purposes and yes, no one could file a case against you for the same.
2) Runs on All Platforms Smoothly
R can be downloaded and installed on all platforms. Whether it be Windows, Mac, or Linux. R has a compatibility to adopt with these environments and it runs smoothly on each one of them. It also can import the code in other programming platforms without a Hassle (known as cross-platform interoperability) due to which it becomes extremely easy to import and run R code through various platforms.
3) High Demand equals to High Chances
As per Forbes, Median Annual Salary for Data Scientists who are at Individual Contributor Level I (Less than 3 years of experience) in the year 2019 is USD 95,000. Which is a handsome package if you ask me. It is a fact that having an idea about R programming only doesn’t help you being in the list of those who earn that much. But having it in your arsenal will increase your chances of acquiring your dream job that pays you handsomely.
I think that seems to be enough to emphasize what powerhouse this language is. Doesn’t it fascinate you to have a dive deep through the ocean of this programming language which itself has a bunch of opportunities wide open for you?
In this article, I am not going to explain the procedures about how to install R software and R Studio on your local systems. Instead of that, I will right away take you to the tour of the real world besides that. This allows you to have an introduction towards this programming language and some nuts and bolts of it which are needed to be tightened before you dive deep in it. Let’s make it count!
R as a calculator
The most common feature of any programming language is its way of handling scientific computing with ease.
> #Addition
> 3 + 3
[1] 6
> #Subtraction
> 4 - 2
[1] 2
> #Multiplication
> 2 * 4
[1] 8
> #Division
> 8/2
[1] 4
> #Log
> log(10)
[1] 2.302585
> #square root
> sqrt(2.3)
[1] 1.516575
Well, that was too much of new things in a single code. Isn’t it? But don’t worry, we will discuss those one by one.
First of all comes the comments. R programming language has its way of representing comments. Whatever you want to show as a comment, you use the symbol “#” commonly known as hash, number sign, or pound as a suffix to that line and it will be considered as a comment under R.
If you would have noticed, at the start of every new line, “>” this prompt appears and it is a way of R to let you know that a new line has begun. Well, this symbol only appears when you are working with the R Programming Console. R Programming Console is an environment that is designed in a way that every line gets executed as soon as you complete it by hitting the Enter button from your keyboard. This is the best part of R for those who want to learn interactively. Every piece of code you write is just getting executed and then you will get an output for the same. This makes the learning process much simpler. However, for programmers, it also has the facility to write hundreds of lines for a code and submit it at once to the system through R Script.
General mathematical symbols such as “+” and “-” used for the addition and subtraction respectively as it can be seen through the code.
For multiplication, most of the programming languages use an asterisk (“*”) as an operator and so does the R language as well. It allows the system to multiply numbers that are at either side of the asterisk.
Forward slash (“/”) resembles mathematical division under the R programming language.
Under R programming language, “log” and “sqrt” are the functions using which we can find out the logarithmic values and square root of any numerical value that is used as an argument under these functions. Functions are the built-in procedures under programming language that allows you to do a certain task without much hassle. All you need to do is just call the name of function out and then specify an argument under it. Here “sqrt” is a function that is encapsulated in a way that it finds out the square root of any number provided as an argument.
If you would have noticed, after every line of code gets executed, you will get [1] before the actual output of the code executed. This is nothing but an index that specifies the first element of the line.
Assigning value to a variable
Variables are an integral part of any programming language as they allow the system to store the value for a certain placeholder which can later be used within the code multiple times. In most of the programming languages equals-to (“=”) operator is used while assigning values to the variables.
You must be aware of the fact that X = 5 means value 5 has been assigned to the variable X. We can assign the value to a variable through this way in R as well. However this is not recommended in R programming.
In R, the primary assignment operator is an arrow-like symbol which is a combination “<-” which allows you to assign value towards the right to the variable towards left. Equals-to operator is not recommended, since it might create confusion to you when you are using it for some other purpose (Ex. associating function arguments in R Programming.). Therefore it is a better choice to use the “<-” operator for variable assignment purposes under R.
Following is an example of a variable assignment under the R programming language.
> #Variable assignment using equals-to operator (not recommended)
> x = 5 #Value 5 is assigned to variable x
> #Proper variable assignment using an arrow operator (Use this method)
> x <- 5 #Value 5 is assigned to variable x
Combining elements to form an array (Concatenation in R)
In R, the most commonly used data structure is an array also known as vector. To create an array, or in more laymen terms, to combine the multiple elements in R, we have a combine operator which is “c”. This operator allows us to combine multiple elements (of the same data type) under a single variable.
> #Trying to combine multiple elements to form a vector > my_vector <- (1, 2, 5, 2, 1, 4, 3, 5, 6) Error: unexpected ',' in "my_vector <- (1," > #Actual way of combining multiple elements > > my_vector <- c(1, 2, 5, 2, 1, 4, 3, 5, 6) > print(my_vector) [1] 1 2 5 2 1 4 3 5 6
If you could see, in the first code, I tried to combine the multiple elements to form a vector or an array. Without using a combine or concatenating operator. Which has shown an error since the code is not valid as per the grammar of the R programming language. However, in the second example, I have added “c” under the code while defining a vector that works as a combine or concatenating operator and then I was able to combine the elements to form a vector named my_vector.
I also have used print() function. As the name itself suggest, this function prints anything that has been set as an argument under it.
Note than when you use a combine operator in R, it assumes that every element you are trying to add under the vector is of the same data type.
Suppose you try creating a vector under R, using a combine operator as shown below. See what happens.
> #creating a vector with different data types
> my_vector2 <- c(1, 2, TRUE)
> print(my_vector2)
[1] 1 2 1
In the code above you could see, two numeric values and one Boolean value (TRUE/FALSE). Once you printed the results of my_vector2, the system has automatically transformed Boolean value TRUE into the numeric value as 1 (for Boolean data type TRUE = 1 and FALSE = 0). This is called as vector-coercion in R. There are separate functions in R to coerce the entire vector as well. However, that is not part of our current journey. This concept of coercion leads us to our next topic, Data Types.
Data Types in R
R is considered as an object-oriented programming language. And every object-oriented programming language requires to know the type of data that we feed to a variable or a vector. If you would have worked on C, C++, you may be well aware that we need to define the variable as either integer or text before we assign a value to it. So that system can understand the difference between the values we are assigning to the variable. R, on the other hand, is flexible. It doesn’t ask you to define the type of variable first to assign a value to it. Instead of that, it is smart enough to identify the data type by its own.
Here comes the concept of data types. In R, we have five basic data types or classes as listed below.
- Numeric (Integers and floating-point numbers)
- Integers (Whole numbers only)
- Complex
- Character (text values)
- Logical (TRUE/FALSE)
Each data type has its own characteristics and properties to follow. See the code below with an explanation for a better understanding.
Numeric Data Type
> #Numeric Data Type
> x <- 3
> y <- 1.4
> print(class(x))
[1] "numeric"
> print(class(y))
[1] "numeric"
Here in the example above, we have defined two variables x and y which has two values one is integer and the other is floating-point respectively. If we need to figure out what is the data type of these two variables, we need something that specifies the same.
In R, we have class() function that returns the data type of argument provided under it. Therefore, when we use this function in the combination with print, we get to know that the data type of these two variables is “numeric”. R language contains integers and floating-point numbers under “numeric” class.
> #Integers only data type
> x <- 3L
> print(class(x))
[1] "integer"
If you need to define a variable with only whole numbers or with data type Integers, you need to specify “L” at the end of the numeric part in R (see the example above).
The reason behind using “L” is quite interesting. All the integers (whole numbers in R) are 32-bit long integers. Thus, it is a sensible explanation that the “L” resembles long when we define the integer values. However, there is no concrete documentation that strengthens this claim. This is just a possible reason.
Complex Data type
There are not only the integers and floating-point numbers though. We also have a type which is a combination of a real part of the number as well as an imaginary part of a number. This kind of data type is known as “complex” in R. The “complex” data looks like the one shown in the code below.
> #Complex Data Type
> z <- 2 + 3i
> print(class(z))
[1] "complex"
> p <- 3.6 + 9i
> print(class(p))
[1] "complex"
While defining a complex variable, the first part is always a real part, that can either be numeric or integer. However, the second part which is a combination of a number and alphabet “i”, is considered as an imaginary part. Together they form a complex data type.
Character Data Type
The next data type is a well-known and one which we all are familiar with. Character data type. Any combination of characters, strings, sentences can be considered as a character data type in R. note that, anything that gets covered under double quotes (be it number, integer, complex, anything) also gets considered as a character data type.
> #Character Data Type
> w <- "Hello World!"
> print(class(w))
[1] "character"
> s <- "3 + 5i"
>
> print(class(s))
[1] "character"
>
> q <- c("1", "5L")
>
> print(class(q))
[1] "character"
If you see the example above, variable w is a classic example of a character data type. As it has all the strings within it. Whereas, interestingly, variables “s” and “q” are the ones that have elements of complex data type and a combination of numeric and integer data type.
However, the class for these two variables is character. Because those are enclosed inside double-quotes. Therefore, anything that’s enclosed in single or double quotes is considered as of character data type in R.
Logical Data Type
Logical data type is one of the most important data types in R. Any logical values such as TRUE and FALSE which have the Boolean alternatives 1 and 0 respectively are considered as logical data type values. See the example below:
> c <- TRUE
> print(class(c))
[1] "logical"
> d <- FALSE
> print(class(d))
[1] "logical"
This is by far the part one of the series of articles “Introduction to R”. In this part, I have tried to cover the nuts and bolts associated with R Programming Language which are needed to be tightened for anyone who needs to dive deep in the ocean of R. Stay tuned for the next one!
Until then take care! Stay home, Stay Safe!
Also Read: Introduction to R Programming – Part 2