Introduction to R Programming: Part 3 – Data Frames, Arrays and Factors
What is R Programming?
In the previous article of this series, we have had an introduction to data structures in R programming language. How data structures are different than data types, how data structures do work in R and the first three of the data structures namely Vectors, Lists, and Matrices in R Programming.
You can follow the previous article of this series at
Introduction to R Programming – Part 1
Introduction to R Programming – Part 2
In this article, we will continue the same journey to move farther and discuss the remaining three data types namely Data Frames, Arrays, and Factors in R in detail. We will see how the data frames, arrays, and factors are generated under R. Besides we will also discuss some in detailed operations that can be done on each of these data structures. Let’s dive deep in!
Data Frames in R Programming
First of all, it becomes important for us to discuss the origin of the data frame. Where did it come from? Data Frames is a result of the constantly evolving nature of R Programming way back in the initial days. Every data structure is most of the time in a two-dimensional tabular form (considered as an array). For Ex. Employee details of an organization may consist of Emp ID, Emp Name, Address, Salary and some other details in a tabular form with each column is of a different data type. When it comes to R, we didn’t have anything that can be of the same structure. Yes, there were lists; but those can’t be considered as a tabular data. Therefore, there was a necessity of creating a two-dimensional array, which can be used to store data with different attributes at the same time without changing the attributes of each column.
In definable terms, the data frame is a two-dimensional tabular structure (with rows and columns as dimensions) which allows you to store lists or vectors of different data types without changing/coercing the data type. However, while defining a data frame, one thing that we always need to keep in mind if the length of the column. Every column should be of the same length while defining a data frame.
Under R Programming, a data frame can be created using a predefined function named “data.frame()”. This function allows you to create a data frame under the R environment. Let’s see how we can create a data frame under R.
> #Creating vectors that can be used as arguments under dataframe
> emp_ID <- c(1:4)
> emp_name <- c("John", "Prasun", "Martha", "Jenny")
> Working_since_years <- c(2, 4, 8, 1)
> salary_USD <- c(1500, 1800, 2500, 1000)
> #creating data frame
> my_emp_details <- data.frame(emp_ID, emp_name, Working_since_years, salary_USD)
> print(my_emp_details)
emp_ID emp_name Working_since_years salary_USD
1 1 John 2 1500
2 2 Prasun 4 1800
3 3 Martha 8 2500
In the example above, we have first created four vectors name emp_ID, emp_name, working_since_years, and salary_USD. Each of these vectors is about the same length (that is the requirement of a data frame). The four vectors are then used as arguments under data.frame() function. Once we run this code, we can see a two-dimensional tabular structure that consists of four columns and three rows.
There are certain characteristics of a data frame which we should take into consideration while creating a data frame
Characteristics of a Data Frame
Please find below are the characteristics a data frame do posses:
- It can store data of multiple data types as a two-dimensional array. Meaning, data can be of any specified types (that are valid in R).
- Should have a name to every column (non-empty column names).
- The number of rows in each column should be the same.
Since a data frame can consist of columns of different data types and for that sake data structures as well, it is always a good idea to check for the structure of a data frame. Structure of a data frame in R, we have a unique function named as“str()”. This function Let us check the data structure of the data frame we just have created.
> print(str(my_emp_details))
'data.frame': 4 obs. of 4 variables:
$ emp_ID : int 1 2 3 4
$ emp_name : Factor w/ 4 levels "Jenny","John",..: 2 4 3 1
$ Working_since_years: num 2 4 8 1
$ salary_USD : num 1500 1800 2500 1000
The function first checks what is the data structure of my_emp_details(that is a data frame). How many observations and variables that data frame consists and data type/structure of each variable with the number of observations.
Extracting Data from a Data Frame
Extracting data from a data frame is as simple as extracting elements from a vector, list, or matrix. The new feature that gets added here is, you can extract particular columns from a data frame through their names. See an example below that allows you to extract specific columns from a data frame.
> slice <- data.frame(my_emp_details$emp_name, my_emp_details$salary_USD)
> print(slice)
my_emp_details.emp_name my_emp_details.salary_USD
1 John 1500
2 Prasun 1800
3 Martha 2500
4 Jenny 1000
Now, in the above example, we have created a new data frame named “slice”, which consists of two columns of the original data frame “my_emp_details”. We can mention the column names with dollar sign after the name of the original data frame. This is nothing but slicing the original data frame towards a new one.
You can also use the common slicing method that includes indexing to get the specific rows and columns from a data frame. See the example code below:
> #Slicing all rows and frist two columns from list
> print(my_emp_details[ , 1:2])
emp_ID emp_name
1 1 John
2 2 Prasun
3 3 Martha
4 4 Jenny
We can also use a specific number of rows and columns as a slicing using the “combine” function. See the code below for a better understanding.
> #Extracting third and fourth row with second and fourth column.
> print(my_emp_details[c(3, 4), c(2, 4)])
emp_name salary_USD
3 Martha 2500
4 Jenny 1000
Adding a New Column to the Existing List
Unlike Vectors, lists, and matrices, it is also possible to expand a data frame. We will see how to add a new column under the existing data frame.
> #Adding a new column under existing data frame.
> my_emp_details$dept <- c("IT", "HR", "Finance", "Operations")
> print(my_emp_details)
emp_ID emp_name Working_since_years salary_USD dept
1 1 John 2 1500 IT
2 2 Prasun 4 1800 HR
3 3 Martha 8 2500 Finance
4 4 Jenny 1 1000 Operations
Here, we have added a new column named “dept” to the existing data frame named “my_emp_details” with four department names to which each employee belongs.
Arrays in R Programming
Until now, we have looked up at the data structures which were either one-dimensional or two-dimensional. However, by any chance, do we have a data structure within R that can hold the data of more than two-dimensions?. Well the answer is a definite Yes!
Within R programming language, we have a data structure named “array” that can be used to store data with more than two dimensions. Ideally, an array is a data structure that stores data into the form of matrices, rows, and columns. Meaning, if I have an array with dimension (2, 2, 4), it means, the resultant array will be a combination of 4 matrices with 2 rows and 2 columns.
Syntax of an Array
Creating an array is really very simple. We can use array() function which is built-in under R. Let us see the syntax for the same.
name_of_array <- array(data, dim = (row_size, column_size, matrices, dimnames))
Where –
- data – is an input data in the form of vector/s assigned to the array function.
- row_size – specifies the number of rows a resultant array can consist of.
- column_size – specifies the number of rows a resultant array can consist of.
- dimnames – can be used to set the row names and column names as per user preferences.
Let us see how an array can be created under the R programming environment
We will create an array with two rows, four columns, and with two matrices as a structure.
> #Creating two vectors for an array
> vect1 <- c(4, 1, 2, 5)
> vect2 <- c(1, 8, 6, 3, 5, 2, 7, 9)
>#Creating an array using vect1 and vect2 as arguments
>#Array with 2 rows, 4 columns and 2 matrices
> array1 <- array(c(vect1, vect2), dim = c(2, 4, 2))
> print(array1)
, , 1
[,1] [,2] [,3] [,4]
[1,] 4 2 1 6
[2,] 1 5 8 3
, , 2
[,1] [,2] [,3] [,4]
[1,] 5 7 4 2
[2,] 2 9 1 5
Here, we have first created two vectors each of length 4 and 8 respectively. Now, once the vectors are created, we have used those two as a “data” argument under the array() function. then we specified the number of rows, number of columns, and number of matrices under “dim =” argument to get an array with two rows, four columns, and two matrices.
Adding names to Rows, Columns and Matrices for an Array
We will now see, how to add row names, column names as well as matrix names as labels within an array. dimnames argument will help us in assigning these labels.
> vect1 <- c(4, 1, 2, 5)
> vect2 <- c(1, 8, 6, 3, 5, 2, 7, 9)
> #Creating vectors with named ranges for rows, columns and matrices of array
> col_names <- c("Column1", "Column2", "Column3", "Column4") #Column Names Vector
> row_names <- c("Row1", "Row2") #Row Names Vector
> matr_names <- c("Matrix1", "Matrix1") #Matrix Names Vector
> array1 <- array(c(vect1, vect2),
dim = c(2, 4, 2),
dimnames = list(row_names, col_names, matr_names)) #Array Creation with labels
> print(array1)
, , Matrix1
Column1 Column2 Column3 Column4
Row1 4 2 1 6
Row2 1 5 8 3
, , Matrix1
Column1 Column2 Column3 Column4
Row1 5 7 4 2
Row2 2 9 1 5
Accessing Elements from an Array
Accessing elements from an array is as same as accessing rows and columns from a matrix. The only difference is, within the array, you have three arguments to specify: row_num, col_num, and matrix_num. Let’s see how we can access the elements of an array.
> #Accessing element at first row and third column of second matrix from "array1"
> print(array1[1, 3, 2])
[1] 4
> #Accessing second row from first matrix in "array1"
> print(array1[2, , 1])
Column1 Column2 Column3 Column4
1 5 8 3
> #Accessing entire second matrix from "array1"
> print(array1[, , 2])
Column1 Column2 Column3 Column4
Row1 5 7 4 2
Row2 2 9 1 5
The next data structure that comes in the picture is a Factor. Let’s see how can you create a factor in R programming.
Factors in R Programming
Factors in R are considered as a data structure that specifies the levels associated with a categorical set of data. This set of data can contain both character as well as integer arguments. However, the most widely used factors as the character ones. Examples of factors in our day to day lives are the star ratings we provide on play store apps. we have a predefined set of levels there (star rating as one, two, three, four, and five). If we try to take entire star rating data associated with an app in the play store, we will only get values at any of these five levels. Another example can be given of a data which have marital status as a column. The data can only have responses from predefined levels for marital status which are (single, married, divorced, separated, widowed). Which means, we have five levels for marital status.
Inside R environment, factor() is a function that allows us to create a factor. We will see how a factor can be created in R.
> #Creating a data vector with multiple values
> data <- c("Married", "Single", "Single", "Divorced", "Separated", "Widowed", "Widowed",
"Married", "Divorced")
> print(data)
[1] "Married" "Single" "Single" "Divorced" "Separated"
[6] "Widowed" "Widowed" "Married" "Divorced"
> #Check if the data consist of factors
> print(is.factor(data))
[1] FALSE
> #Convert data into factors
> factor_data <- factor(data)
> print(factor_data)
[1] Married Single Single Divorced Separated Widowed
[7] Widowed Married Divorced
Levels: Divorced Married Separated Single Widowed
> #Check is the given vector a factor data structure or not
> print(is.factor(factor_data))
[1] TRUE
First of all, we created a vector named “data” which consist of multiple values for marital status. We tried to check whether “data” is stored as a factor or not. is.factor() is a function under R, which allows us to check whether the given vector has a data structure as a factor or not. We get result as FALSE, since vector named “data” is not a factor. Thereafter, we used factor() function in R, which allows us to create a factor data structure. After that we checked whether the newly converted vector is of data structure factor or not. We get result as TRUE for is.factor() function when we applied it on the newly created vector. Which means it is converted into a factor.
This is how we can create factors data structure inside R programming. We have covered all types of data structures and the next article in this series will be associated with “Control Statements in R”. Until then, let’s stop here.
Please post your feedback, or let me know if you have any questions related to the concepts discussed in the comments section below.
Stay safe, Stay Healthy! 🙂