Introduction
In this article, I am going to demonstrate how to use the Dplyr package in R along with a plane dataset. We will be using various functions provided with the Dplyr package to manipulate and transform the data and to create a subset of data as well. The functions that we will be using are filter(), arrange(), and select().
Loading package and dataset
We will be using a predefined plane dataset that belongs to a package named nycflights13. Therefore we need to load the package first as follows.
# Loading the nycflights13 library
library(nycflights13)
Now we need to load the dataset planes as we will be using it to transform and manipulate the data.
planes
The above code will generate the following output.
# Displaying the 'planes' tibble with formatted columns
planes
# A tibble: 3,322 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
2 N102UW 1998 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
3 N103US 1999 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
4 N104UW 1999 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
6 N105UW 1999 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
7 N107US 1999 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
8 N108UW 1999 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
9 N109UW 1999 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
10 N110UW 1999 Fixed wing multi engine AIRBUS INDUS... A320-214 2 182 NA Turbo-fan
# ... with 3,312 more rows
Now we need to load the library named Dplyr to use the functions of the package Dplyr. We can use the below syntax to load the Dplyr library.
# Loading the dplyr library
library(dplyr)
The above code will generate the following output.
library(dplyr)
# Attaching package: ‘dplyr’
# The following objects are masked from ‘package:stats’:
# filter, lag
# The following objects are masked from ‘package:base’:
# intersect, setdiff, setequal, union
Filter() function to filter data on the basis of variable names
The filter function is used to select and display observations according to the values of the arguments included inside the function. The filter function creates subsets of observations. Inside the function, the data frame name is the first argument and the second argument is the name of the variable of the dataset whose values we will be using to filter the data frame.
Let us discuss this function with the help of the example below.
# Filtering 'planes' tibble for year 1999
filter(planes, year == 1999)
# A tibble: 206 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
2 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
3 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
4 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
5 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
6 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
7 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
8 N111US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
9 N112US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
10 N113UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
# ... with 196 more rows
In the argument part of the filter function, we have mentioned year == 1999, therefore a new dataset containing all the observations of year 1999 will be displayed.
# Filtering 'planes' tibble for year 1999 and seats == 55
filter(planes, year == 1999, seats == 55)
# A tibble: 23 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N12967 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
2 N13964 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
3 N13965 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
4 N13968 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
5 N13969 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
6 N13970 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
7 N13975 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
8 N13978 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
9 N13979 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
10 N14972 1999 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
# ... with 13 more rows
In the argument part of the filter function, we have mentioned the value of the year as 1999 and the value of seats as 55, therefore filtering operation will be executed by Dplyr library and a new dataset containing all the observations of 1999 and containing 55 seats will be displayed.
Using assignment operator
The result of the new subset of the dataset that has been created can also be saved in another variable using the assignment operator.
# Filtering 'planes' tibble for year 2000 and seats == 55, assigning result to df
df <- filter(planes, year == 2000, seats == 55)
If we want to save the subset in a variable and print the result at the same time then we can include the entire assignment syntax inside parentheses as follows.
# Filtering 'planes' tibble for year 2000 and seats == 55, storing result in 'df'
(df <- filter(planes, year == 2000, seats == 55))
The above code will generate the following output.
# Filtering 'planes' tibble for year 2000 and seats == 55, storing result in df
(df <- filter(planes, year == 2000, seats == 55))
# A tibble: 30 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N12921 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
2 N12922 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
3 N12924 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
4 N12996 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
5 N13988 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
6 N13989 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
7 N13992 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
8 N13994 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
9 N13995 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
10 N13997 2000 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
# ... with 20 more rows
The arrange function to arrange data in different orders
arrange() function is used to select and display rows either in ascending or descending order according to the order mentioned inside the arrange() function as an argument. It takes the names of the variables as arguments and displays data in two different orders.
# Sorting 'planes' tibble by year, seats, and speed
arrange(planes, year, seats, speed)
The above code will generate the following output.
# Arranging 'planes' tibble by year, seats, and speed
arrange(planes, year, seats, speed)
# A tibble: 3,322 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N381AA 1956 Fixed wing multi engine DOUGLAS DC-7BF 4 102 232 Reciprocating
2 N201AA 1959 Fixed wing single engine CESSNA 150 1 2 90 Reciprocating
3 N567AA 1959 Fixed wing single engine DEHAVILLAND OTTER DHC-3 1 16 95 Reciprocating
4 N378AA 1963 Fixed wing single engine CESSNA 172E 1 4 105 Reciprocating
5 N575AA 1963 Fixed wing single engine CESSNA 210-5(205) 1 6 NA Reciprocating
6 N14629 1965 Fixed wing multi engine BOEING 737-524 2 149 NA Turbo-fan
7 N615AA 1967 Fixed wing multi engine BEECH 65-A90 2 9 202 Turbo-prop
8 N425AA 1968 Fixed wing single engine PIPER PA-28-180 1 4 107 Reciprocating
9 N383AA 1972 Fixed wing multi engine BEECH E-90 2 10 NA Turbo-prop
10 N364AA 1973 Fixed wing multi engine CESSNA 310Q 2 6 167 Reciprocating
# ... with 3,312 more rows
We can also use the desc() function as an argument inside the arrange function to display the rows and columns of the new data frame reordered in descending order.
# Arranging 'planes' tibble by descending year
arrange(planes, desc(year))
# A tibble: 3,322 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N150UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
2 N151UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
3 N152UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
4 N153UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
5 N154UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
6 N155UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
7 N156UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
8 N157UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
9 N198UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
10 N199UW 2013 Fixed wing multi engine AIRBUS A321-211 2 199 NA Turbo-fan
# ... with 3,312 more rows
Select() function to select columns of a table
The select function is used to display a subset of a dataset containing only those columns that are mentioned inside the select function as arguments. The select() function also displays data based on the conditions mentioned inside the select() function. These conditions are applied to the variables of the dataset loaded at the beginning.
Displaying columns by including column names as arguments. To select columns by names we can use the below syntax.
# Selecting specific columns 'year', 'manufacturer', and 'seats' from 'planes' tibble
select(planes, year, manufacturer, seats)
The above code will generate the following output.
# Selecting specific columns 'year', 'manufacturer', and 'seats' from 'planes' tibble
select(planes, year, manufacturer, seats)
# A tibble: 3,322 x 3
year manufacturer seats
<int> <chr> <int>
1 2004 EMBRAER 55
2 1998 AIRBUS INDUSTRIE 182
3 1999 AIRBUS INDUSTRIE 182
4 1999 AIRBUS INDUSTRIE 182
5 2002 EMBRAER 55
6 1999 AIRBUS INDUSTRIE 182
7 1999 AIRBUS INDUSTRIE 182
8 1999 AIRBUS INDUSTRIE 182
9 1999 AIRBUS INDUSTRIE 182
10 1999 AIRBUS INDUSTRIE 182
# ... with 3,312 more rows
To display all the columns between type and engines, we can use the below syntax.
# Selecting columns 'type' through 'engines' from 'planes' tibble
select(planes, type:engines)
The above code will generate the following output.
# Selecting columns from 'type' to 'engines' from 'planes' tibble
select(planes, type:engines)
# A tibble: 3,322 x 4
type manufacturer model engines
<chr> <chr> <chr> <int>
1 Fixed wing multi eng… EMBRAER EMB-145XR 2
2 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
3 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
4 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
5 Fixed wing multi eng… EMBRAER EMB-145LR 2
6 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
7 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
8 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
9 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
10 Fixed wing multi eng… AIRBUS INDUST… A320-214 2
# ... with 3,312 more rows
To display all the selected columns excluding those from type to engines, we can use the below syntax.
# Selecting all columns except from 'type' to 'engines' in 'planes' tibble
select(planes, -(type:engines))
The above code will generate the following output.
# Selecting all columns except 'type' to 'engines' from 'planes' tibble
select(planes, -(type:engines))
# A tibble: 3,322 x 5
tailnum year seats speed engine
<chr> <int> <int> <int> <chr>
1 N10156 2004 55 NA Turbo-fan
2 N102UW 1998 182 NA Turbo-fan
3 N103US 1999 182 NA Turbo-fan
4 N104UW 1999 182 NA Turbo-fan
5 N10575 2002 55 NA Turbo-fan
6 N105UW 1999 182 NA Turbo-fan
7 N107US 1999 182 NA Turbo-fan
8 N108UW 1999 182 NA Turbo-fan
9 N109UW 1999 182 NA Turbo-fan
10 N110UW 1999 182 NA Turbo-fan
# ... with 3,312 more rows
Summary
In this article, I demonstrated how to use the Dplyr package in R along with the plane dataset. We have used various functions provided by the Dplyr package to manipulate and transform the data and to create a subset of data as well. Various functions such as filter(), arrange(), and select() are used. Proper coding snippets and outputs are also provided.