How To Use Dplyr Package To Manipulate Data In R

Introduction

In this article, I am going to demonstrate how to use the Dplyr package in R along with a plane dataset. We will be using various functions provided with the Dplyr package to manipulate and transform the data and to create a subset of data as well. The functions that we will be using are filter(), arrange(), and select().

Loading package and dataset

We will be using a predefined plane dataset that belongs to a package named nycflights13. Therefore we need to load the package first as follows.

# Loading the nycflights13 library
library(nycflights13)

Now we need to load the dataset planes as we will be using it to transform and manipulate the data.

planes

The above code will generate the following output.

# Displaying the 'planes' tibble with formatted columns
planes
# A tibble: 3,322 x 9
   tailnum year  type                    manufacturer   model      engines seats speed engine    
   <chr>   <int> <chr>                   <chr>          <chr>       <int>  <int> <int> <chr>     
 1 N10156   2004 Fixed wing multi engine EMBRAER        EMB-145XR       2     55    NA Turbo-fan 
 2 N102UW   1998 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
 3 N103US   1999 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
 4 N104UW   1999 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
 5 N10575   2002 Fixed wing multi engine EMBRAER        EMB-145LR       2     55    NA Turbo-fan 
 6 N105UW   1999 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
 7 N107US   1999 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
 8 N108UW   1999 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
 9 N109UW   1999 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
10 N110UW   1999 Fixed wing multi engine AIRBUS INDUS... A320-214        2    182    NA Turbo-fan 
# ... with 3,312 more rows

Now we need to load the library named Dplyr to use the functions of the package Dplyr. We can use the below syntax to load the Dplyr library.

# Loading the dplyr library
library(dplyr)

The above code will generate the following output.

library(dplyr)

# Attaching package: ‘dplyr’

# The following objects are masked from ‘package:stats’:

#     filter, lag

# The following objects are masked from ‘package:base’:

#     intersect, setdiff, setequal, union

Filter() function to filter data on the basis of variable names

The filter function is used to select and display observations according to the values of the arguments included inside the function. The filter function creates subsets of observations. Inside the function, the data frame name is the first argument and the second argument is the name of the variable of the dataset whose values we will be using to filter the data frame.

Let us discuss this function with the help of the example below.

# Filtering 'planes' tibble for year 1999
filter(planes, year == 1999)
# A tibble: 206 x 9
   tailnum year  type                    manufacturer    model       engines seats speed engine    
   <chr>   <int> <chr>                   <chr>           <chr>        <int>  <int> <int> <chr>     
 1 N103US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 2 N104UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 3 N105UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 4 N107US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 5 N108UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 6 N109UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 7 N110UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 8 N111US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
 9 N112US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
10 N113UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214         2    182    NA Turbo-fan 
# ... with 196 more rows

In the argument part of the filter function, we have mentioned year == 1999, therefore a new dataset containing all the observations of year 1999 will be displayed.

# Filtering 'planes' tibble for year 1999 and seats == 55
filter(planes, year == 1999, seats == 55)
# A tibble: 23 x 9
   tailnum year  type                    manufacturer model     engines seats speed engine    
   <chr>   <int> <chr>                   <chr>        <chr>       <int>  <int> <int> <chr>     
 1 N12967   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 2 N13964   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 3 N13965   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 4 N13968   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 5 N13969   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 6 N13970   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 7 N13975   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 8 N13978   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 9 N13979   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
10 N14972   1999 Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
# ... with 13 more rows

In the argument part of the filter function, we have mentioned the value of the year as 1999 and the value of seats as 55, therefore filtering operation will be executed by Dplyr library and a new dataset containing all the observations of 1999 and containing 55 seats will be displayed.

Using assignment operator

The result of the new subset of the dataset that has been created can also be saved in another variable using the assignment operator.

# Filtering 'planes' tibble for year 2000 and seats == 55, assigning result to df
df <- filter(planes, year == 2000, seats == 55)

If we want to save the subset in a variable and print the result at the same time then we can include the entire assignment syntax inside parentheses as follows.

# Filtering 'planes' tibble for year 2000 and seats == 55, storing result in 'df'
(df <- filter(planes, year == 2000, seats == 55))

The above code will generate the following output.

# Filtering 'planes' tibble for year 2000 and seats == 55, storing result in df
(df <- filter(planes, year == 2000, seats == 55))
# A tibble: 30 x 9
   tailnum year  type                    manufacturer model     engines seats speed engine    
   <chr>   <int> <chr>                   <chr>        <chr>       <int>  <int> <int> <chr>     
 1 N12921  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 2 N12922  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 3 N12924  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 4 N12996  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 5 N13988  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 6 N13989  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 7 N13992  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 8 N13994  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
 9 N13995  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
10 N13997  2000  Fixed wing multi engine EMBRAER      EMB-145LR       2     55    NA Turbo-fan 
# ... with 20 more rows

The arrange function to arrange data in different orders

arrange() function is used to select and display rows either in ascending or descending order according to the order mentioned inside the arrange() function as an argument. It takes the names of the variables as arguments and displays data in two different orders.

# Sorting 'planes' tibble by year, seats, and speed
arrange(planes, year, seats, speed)

The above code will generate the following output.

# Arranging 'planes' tibble by year, seats, and speed
arrange(planes, year, seats, speed)
# A tibble: 3,322 x 9
   tailnum year  type                     manufacturer model          engines seats speed engine        
   <chr>   <int> <chr>                    <chr>        <chr>            <int>  <int> <int> <chr>         
 1 N381AA   1956  Fixed wing multi engine  DOUGLAS      DC-7BF               4    102   232 Reciprocating 
 2 N201AA   1959  Fixed wing single engine CESSNA       150                  1      2    90 Reciprocating 
 3 N567AA   1959  Fixed wing single engine DEHAVILLAND  OTTER DHC-3          1     16    95 Reciprocating 
 4 N378AA   1963  Fixed wing single engine CESSNA       172E                 1      4   105 Reciprocating 
 5 N575AA   1963  Fixed wing single engine CESSNA       210-5(205)           1      6    NA Reciprocating 
 6 N14629   1965  Fixed wing multi engine  BOEING       737-524              2    149    NA Turbo-fan     
 7 N615AA   1967  Fixed wing multi engine  BEECH        65-A90               2      9   202 Turbo-prop    
 8 N425AA   1968  Fixed wing single engine PIPER        PA-28-180            1      4   107 Reciprocating 
 9 N383AA   1972  Fixed wing multi engine  BEECH        E-90                 2     10    NA Turbo-prop    
10 N364AA   1973  Fixed wing multi engine  CESSNA       310Q                 2      6   167 Reciprocating 
# ... with 3,312 more rows

We can also use the desc() function as an argument inside the arrange function to display the rows and columns of the new data frame reordered in descending order.

# Arranging 'planes' tibble by descending year
arrange(planes, desc(year))
# A tibble: 3,322 x 9
   tailnum year  type                    manufacturer model     engines seats speed engine    
   <chr>   <int> <chr>                   <chr>        <chr>       <int>  <int> <int> <chr>     
 1 N150UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 2 N151UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 3 N152UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 4 N153UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 5 N154UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 6 N155UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 7 N156UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 8 N157UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
 9 N198UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
10 N199UW  2013  Fixed wing multi engine  AIRBUS       A321-211       2    199    NA Turbo-fan 
# ... with 3,312 more rows

Select() function to select columns of a table

The select function is used to display a subset of a dataset containing only those columns that are mentioned inside the select function as arguments. The select() function also displays data based on the conditions mentioned inside the select() function. These conditions are applied to the variables of the dataset loaded at the beginning.

Displaying columns by including column names as arguments. To select columns by names we can use the below syntax.

# Selecting specific columns 'year', 'manufacturer', and 'seats' from 'planes' tibble
select(planes, year, manufacturer, seats)

The above code will generate the following output.

# Selecting specific columns 'year', 'manufacturer', and 'seats' from 'planes' tibble
select(planes, year, manufacturer, seats)
# A tibble: 3,322 x 3
    year manufacturer     seats
   <int> <chr>            <int>
 1  2004 EMBRAER             55
 2  1998 AIRBUS INDUSTRIE   182
 3  1999 AIRBUS INDUSTRIE   182
 4  1999 AIRBUS INDUSTRIE   182
 5  2002 EMBRAER             55
 6  1999 AIRBUS INDUSTRIE   182
 7  1999 AIRBUS INDUSTRIE   182
 8  1999 AIRBUS INDUSTRIE   182
 9  1999 AIRBUS INDUSTRIE   182
10  1999 AIRBUS INDUSTRIE   182
# ... with 3,312 more rows

To display all the columns between type and engines, we can use the below syntax.

# Selecting columns 'type' through 'engines' from 'planes' tibble
select(planes, type:engines)

The above code will generate the following output.

# Selecting columns from 'type' to 'engines' from 'planes' tibble
select(planes, type:engines)
# A tibble: 3,322 x 4
   type                  manufacturer   model       engines
   <chr>                 <chr>          <chr>         <int>
 1 Fixed wing multi eng… EMBRAER        EMB-145XR         2
 2 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
 3 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
 4 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
 5 Fixed wing multi eng… EMBRAER        EMB-145LR         2
 6 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
 7 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
 8 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
 9 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
10 Fixed wing multi eng… AIRBUS INDUST… A320-214          2
# ... with 3,312 more rows

To display all the selected columns excluding those from type to engines, we can use the below syntax.

# Selecting all columns except from 'type' to 'engines' in 'planes' tibble
select(planes, -(type:engines))

The above code will generate the following output.

# Selecting all columns except 'type' to 'engines' from 'planes' tibble
select(planes, -(type:engines))
# A tibble: 3,322 x 5
   tailnum  year seats speed engine    
   <chr>   <int> <int> <int> <chr>     
 1 N10156   2004    55    NA Turbo-fan 
 2 N102UW   1998   182    NA Turbo-fan 
 3 N103US   1999   182    NA Turbo-fan 
 4 N104UW   1999   182    NA Turbo-fan 
 5 N10575   2002    55    NA Turbo-fan 
 6 N105UW   1999   182    NA Turbo-fan 
 7 N107US   1999   182    NA Turbo-fan 
 8 N108UW   1999   182    NA Turbo-fan 
 9 N109UW   1999   182    NA Turbo-fan 
10 N110UW   1999   182    NA Turbo-fan 
# ... with 3,312 more rows

Summary

In this article, I demonstrated how to use the Dplyr package in R along with the plane dataset. We have used various functions provided by the Dplyr package to manipulate and transform the data and to create a subset of data as well. Various functions such as filter(), arrange(), and select() are used. Proper coding snippets and outputs are also provided.


Similar Articles