Introduction
In this article, I am going to demonstrate how to create subsets of data using logical values for analysis of datasets so as to extract relevant data for creating a machine learning model. Logical values creates subsets of data containing only those observations that return a Boolean value of true when checked for certain conditions along with dollar operator and single square brackets operator and excluding those values which return false against conditions specified inside the square brackets.
Extracting data from datasets or creating subsets of data is a part of a data pre-processing technique used in R to obtain clean and relevant data for accurate predictions to be made through a machine learning model.
For additional analysis of data in R, pre-processing of data is performed to create subsets of dataset. Several objects are available in R such as data frames, vectors, arrays and lists which can be used to create subsets of dataset and store the values of subset in them. There are different methods available to create subsets of vectors, arrays, data frames, and lists.
Performing analysis of data through pre-processing is one of the most important jobs in R. To create a subset of dataset in R several operators can be used.
Different types of operators for creating subsets of data
There are three kinds of operators which can be used to create different subsets which are as follows,
Dollar operator
We can create subsets of entire datasets by using the dollar operator. By mentioning dollar operator along with dataset name, we can select different variables of datasets at a time and create a subset of that variable alone as a vector. A vector object is formed when the dollar operator is used with a data frame.
Now we will discuss with some examples, on how to use dollar operator to create subsets of datasets using logical values. We will be creating subsets of dataset using logical values. We will be using quakes dataset to use different operators as follows,
- > data = quakes[-(30:990),]
- > data
- lat long depth mag stations
- 1 -20.42 181.62 562 4.8 41
- 2 -20.62 181.03 650 4.2 15
- 3 -26.00 184.10 42 5.4 43
- 4 -17.97 181.66 626 4.1 19
- 5 -20.42 181.96 649 4.0 11
- 6 -19.68 184.31 195 4.0 12
- 7 -11.70 166.10 82 4.8 43
- 8 -28.11 181.93 194 4.4 15
- 9 -28.74 181.74 211 4.7 35
- 10 -17.47 179.59 622 4.3 19
- 11 -21.44 180.69 583 4.4 13
- 12 -12.26 167.00 249 4.6 16
- 13 -18.54 182.11 554 4.4 19
- 14 -21.00 181.66 600 4.4 10
- 15 -20.70 169.92 139 6.1 94
- 16 -15.94 184.95 306 4.3 11
- 17 -13.64 165.96 50 6.0 83
- 18 -17.83 181.50 590 4.5 21
- 19 -23.50 179.78 570 4.4 13
- 20 -22.63 180.31 598 4.4 18
- 21 -20.84 181.16 576 4.5 17
- 22 -10.98 166.32 211 4.2 12
- 23 -23.30 180.16 512 4.4 18
- 24 -30.20 182.00 125 4.7 22
- 25 -19.66 180.28 431 5.4 57
- 26 -17.94 181.49 537 4.0 15
- 27 -14.72 167.51 155 4.6 18
- 28 -16.46 180.79 498 5.2 79
- 29 -20.97 181.47 582 4.5 25
- 991 -20.73 181.42 575 4.3 18
- 992 -15.45 181.42 409 4.3 27
- 993 -20.05 183.86 243 4.9 65
- 994 -17.95 181.37 642 4.0 17
- 995 -17.70 188.10 45 4.2 10
- 996 -25.93 179.54 470 4.4 22
- 997 -12.28 167.06 248 4.7 35
- 998 -20.13 184.20 244 4.5 34
- 999 -17.40 187.80 40 4.5 14
- 1000 -21.59 170.56 165 6.0 119
- >
As we can see from the code above, a subset of dataset quake has been created, which contains all the variables and includes only those observations which are not mentioned inside parenthesis along with negative sign.
- > data = quakes[-(40:980),-(2:4)]
- > data
- lat stations
- 1 -20.42 41
- 2 -20.62 15
- 3 -26.00 43
- 4 -17.97 19
- 5 -20.42 11
- 6 -19.68 12
- 7 -11.70 43
- 8 -28.11 15
- 9 -28.74 35
- 10 -17.47 19
- 11 -21.44 13
- 12 -12.26 16
- 13 -18.54 19
- 14 -21.00 10
- 15 -20.70 94
- 16 -15.94 11
- 17 -13.64 83
- 18 -17.83 21
- 19 -23.50 13
- 20 -22.63 18
- 21 -20.84 17
- 22 -10.98 12
- 23 -23.30 18
- 24 -30.20 22
- 25 -19.66 57
- 26 -17.94 15
- 27 -14.72 18
- 28 -16.46 79
- 29 -20.97 25
- 30 -19.84 17
- 31 -22.58 21
- 32 -16.32 30
- 33 -15.55 42
- 34 -23.55 10
- 35 -16.30 10
- 36 -25.82 13
- 37 -18.73 17
- 38 -17.64 17
- 39 -17.66 17
- 981 -20.82 67
- 982 -22.95 21
- 983 -28.22 49
- 984 -27.99 22
- 985 -15.54 17
- 986 -12.37 16
- 987 -22.33 51
- 988 -22.70 27
- 989 -17.86 12
- 990 -16.00 33
- 991 -20.73 18
- 992 -15.45 27
- 993 -20.05 65
- 994 -17.95 17
- 995 -17.70 10
- 996 -25.93 22
- 997 -12.28 35
- 998 -20.13 34
- 999 -17.40 14
- 1000 -21.59 119
- >
As we can see from the code above, a subset of dataset quake has been created, which contains all the variables and observations but exclude those variables and observations which are mentioned inside parenthesis along with negative sign.
Now we will use dollar operator with and logical value and lat variable as follows,
- > ds = data$lat[data$lat<20]
- > ds
- [1] -20.42 -20.62 -26.00 -17.97 -20.42 -19.68 -11.70 -28.11 -28.74 -17.47
- [11] -21.44 -12.26 -18.54 -21.00 -20.70 -15.94 -13.64 -17.83 -23.50 -22.63
- [21] -20.84 -10.98 -23.30 -30.20 -19.66 -17.94 -14.72 -16.46 -20.97 -20.73
- [31] -15.45 -20.05 -17.95 -17.70 -25.93 -12.28 -20.13 -17.40 -21.59
- >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having lat variable and its observations. The subset is stored in a variable named ds. The subset extracts all the elements but exclude those elements which do not return a logical value of true.
- > df = data$stations[data$stations<40]
- > df
- [1] 15 19 11 12 15 35 19 13 16 19 10 11 21 13 18 17 12 18 22 15 18 25 18 27 17
- [26] 10 22 35 34 14
- >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having stations variable and its observations. The subset is stored in a variable named df. The subset extracts all the elements but exclude those elements which does not return a logical value of true.
- > dn = data$dept[data$dept<600]
- > dn
- [1] 562 42 195 82 194 211 583 249 554 139 306 50 590 570 598 576 211 512 125
- [20] 431 537 155 498 582 575 409 243 45 470 248 244 40 165
- >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having dept variable and its observations. The subset is stored in a variable named dn. The subset extracts all the elements but excludes those elements which do not return a logical value of true.
- > da = data$mag[data$mag<10]
- > da
- [1] 4.8 4.2 5.4 4.1 4.0 4.0 4.8 4.4 4.7 4.3 4.4 4.6 4.4 4.4 6.1 4.3 6.0 4.5 4.4
- [20] 4.4 4.5 4.2 4.4 4.7 5.4 4.0 4.6 5.2 4.5 4.3 4.3 4.9 4.0 4.2 4.4 4.7 4.5 4.5
- [39] 6.0
- >
As we can see from the above output, using dollar operator with dataset and variable name a subset of quakes dataset is created. Here we are creating subsets using logical values. The subset is having mag variable and its observations. The subset is stored in a variable named da. The subset extracts all the elements but exclude those elements which does not return a logical value of true.
Double square brackets operator
The double square brackets operator can be used to create subsets of data containing either all observations of single variable of a dataset or just a single observation of a particular variable. For creating a subset using the double‐square‐brackets operator, we can use index position of the observations as well as name of the particular variable. We can use double square brackets operator with data frame.
- > data[['long']]
- [1] 181.62 181.03 184.10 181.66 181.96 184.31 166.10 181.93 181.74 179.59 180.69 167.00 182.11 181.66 169.92 184.95 165.96 181.50 179.78 180.31 181.16 166.32 180.16
- [24] 182.00 180.28 181.49 167.51 180.79 181.47 182.37 179.24 166.74 185.05 180.80 186.00 179.33 169.23 181.28 181.40 169.33 176.78 186.10 179.82 186.04 169.41 182.30
- [47] 181.70 166.32 180.08 185.25
As we can see the above code snippet created a subset containing a single variable long. The argument is a variable name inside double square brackets operator.
- > data[[3]]
- [1] 562 650 42 626 649 195 82 194 211 622 583 249 554 600 139 306 50 590 570 598 576 211 512 125 431 537 155 498 582 328 553 50 292 349 48 600 206 574 585 230
- [41] 263 96 511 94 246 56 329 70 493 129
As we can see the above code snippet created a subset containing a single variable dept. The argument is an index position of the variable named dept inside double square brackets operator.
As we can see the above code snippet created a subset containing a single observation of the variable dept. The arguments are an index positions of the rows and columns of that particular observation of the variable dept inside double square brackets operator.
Single square brackets operator
The single square brackets operator can be used to create subsets of data containing all observations of specified number of multiple variables of a dataset. Now we will discuss with some examples, on how to use single square brackets operator along with logical values and dollar sign to create subsets of dataset as follows,
- > data = quakes
- > data
- > da = data$dept[data$dept<100]
- > da
- [1] 42 82 50 50 48 96 94 56 70 46 84 40 96 75 69 50 72 42 42 46 64 82 81 49 94
- [26] 63 53 42 97 48 56 69 93 42 59 40 99 67 45 93 90 65 71 57 74 44 48 46 97 65
- [51] 82 67 55 74 49 93 83 61 42 56 68 69 45 43 65 80 51 68 69 61 69 51 55 54 59
- [76] 56 65 60 40 48 56 44 52 41 40 99 66 47 70 57 80 82 90 45 45 95 65 54 47 94
- [101] 80 54 57 49 62 63 51 45 63 66 58 70 50 58 69 70 41 51 64 45 50 44 68 47 40
- [126] 85 98 58 89 49 40 42 76 63 93 64 83 40 62 75 44 63 40 70 41 82 50 70 74 89
- [151] 53 68 52 66 51 67 64 47 49 75 60 75 56 48 53 85 57 79 82 93 47 98 61 83 55
- [176] 86 78 45 50 57 66 57 89 85 50 75 46 50 80 86 83 70 74 40 87 63 47 71 42 97
- [201] 56 43 93 66 70 54 82 43 77 68 71 68 99 40 62 94 56 49 42 69 48 47 76 61 90
- [226] 57 69 51 44 51 63 87 61 60 63 82 41 40 60 43 54 68 42 43 42 75 71 60 69 45
- [251] 40
- >
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named da.
- > ds = data$long[data$long<170]
- > ds
- [1] 166.10 167.00 169.92 165.96 166.32 167.51 166.74 169.23 169.33 169.41
- [11] 166.32 166.22 166.20 167.06 167.53 167.06 169.71 166.54 169.49 167.40
- [21] 169.48 166.97 167.89 168.98 168.02 169.46 167.10 167.62 165.80 167.68
- [31] 166.07 169.84 166.24 167.16 169.42 169.31 169.09 166.66 166.53 166.00
- [41] 169.50 166.26 167.24 169.33 169.01 167.24 168.80 166.20 169.32 169.28
- [51] 169.58 169.63 169.24 167.10 167.32 166.36 165.77 166.24 166.60 166.29
- [61] 166.47 169.21 167.95 167.14 167.33 165.99 166.14 167.51 169.14 167.26
- [71] 167.26 169.15 169.48 166.37 168.52 167.70 167.32 167.50 166.06 169.04
- [81] 166.87 165.98 165.96 165.76 166.02 167.38 167.18 167.01 167.01 166.83
- [91] 166.94 167.25 166.69 167.34 167.42 166.90 166.85 166.80 166.91 167.54
- [101] 166.18 168.71 166.62 166.49 167.26 167.16 166.36 168.75 167.15 166.28
- [111] 169.76 166.78 168.98 168.69 165.67 167.39 167.91 166.07 166.10 167.10
- [121] 169.37 169.10 167.32 167.18 167.91 168.08 169.71 167.24 169.66 167.03
- [131] 167.43 166.75 167.41 166.55 165.80 166.64 169.46 169.52 167.10 168.93
- [141] 166.90 168.63 169.44 169.90 166.56 167.23 167.24 166.66 169.63 167.02
- [151] 167.05 167.01 166.20 166.30 169.50 167.11 166.53 169.53 165.97 169.75
- [161] 167.95 167.32 166.01 167.44 166.72 166.98 169.05 166.93 167.06
- >
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named ds.
The difference between the double square brackets operator and single square brackets is the indexing of number of variables. The [[creates a subset of single variable and its observations and [ creates a subset of multiple variable and type of the subset is same as that of the dataset. For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.
The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list or other recursive object, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.
Now we will discuss how to use the above mentioned operators to create the subsets of a specified number of variables of a dataset. We will discuss methods to create subsets using positive numerical values of dataset containing data with all the variables and observations of the datasets.
Creating subsets using logical values
The single square brackets operator creates a subset containing more than one variable. To create a subset of multiple variables, we can mention the required number of variables in the syntax of Single Square brackets operator to get a subset of multiple variables.
A subset using logical values can be created using single square brackets operator preceded by dataset name and dollar operator inside square brackets. Such subsets contains only those variables and observations of a dataset whose index positions are not mentioned inside square brackets. Using Single Square brackets operator preceded by dataset name and dollar sign we can mention the index numbers of required number of columns we want to exclude in a resultant subset.
Now we will be using predefined dataset rock of type data frame containing four variables and 48 observations to create subsets using logical values. We will be creating subsets using logical values of several predefined datasets available in R as follows,
- > str(rock)
- 'data.frame': 48 obs. of 4 variables:
- $ area : int 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
- $ peri : num 2792 3893 3931 3869 3949 ...
- $ shape: num 0.0903 0.1486 0.1833 0.1171 0.1224 ...
- $ perm : num 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
- >
The subsets using logical values for the above rock dataset is as follows,
- > da = data$area[data$area<10000]
- > da
- [1] 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 9364 8624 8868 9417 8874
- [16] 9867 7838 8233 6360 4193 7416 5246 6509 4895 6775 7894 5980 5318 7392 7894
- [31] 3469 1468 3524 5267 5048 1016 5605 8793 3475 1651 5514 9718
- >
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named da.
- > ds = data$peri[data$peri<4000]
- > ds
- [1] 2791.900 3892.600 3930.660 3869.320 3948.540 3682.040 3098.650 3986.240
- [9] 3518.040 3999.370 3629.070 3428.740 3518.440 1977.390 1379.350 1916.240
- [17] 1585.420 1851.210 1239.660 1728.140 1461.060 1426.760 990.388 1350.760
- [25] 1461.060 1376.700 476.322 1189.460 1644.960 941.543 308.642 1145.690
- [33] 2280.490 1174.110 597.808 1455.880 1485.580
- >
As we can see from above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named ds.
- > dn = data$shape[data$shape<1]
- > dn
- [1] 0.0903296 0.1486220 0.1833120 0.1170630 0.1224170 0.1670450 0.1896510
- [8] 0.1641270 0.2036540 0.1623940 0.1509440 0.1481410 0.2285950 0.2316230
- [15] 0.1725670 0.1534810 0.2043140 0.2627270 0.2000710 0.1448100 0.1138520
- [22] 0.2910290 0.2400770 0.1618650 0.2808870 0.1794550 0.1918020 0.1330830
- [29] 0.2252140 0.3412730 0.3116460 0.2760160 0.1976530 0.3266350 0.1541920
- [36] 0.2760160 0.1769690 0.4387120 0.1635860 0.2538320 0.3286410 0.2300810
- [43] 0.4641250 0.4204770 0.2007440 0.2626510 0.1824530 0.2004470
- >
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dn.
- > dh = data$perm[data$perm<500]
- > dh
- [1] 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119.0 119.0 119.0 119.0 82.4 82.4 82.4 82.4 58.6 58.6 58.6 58.6 142.0 142.0 142.0 142.0 100.0 100.0 100.0
- [28] 100.0
- >
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dh.
The structure of mtcars dataset is as follows,
- > str(mtcars)
- 'data.frame': 32 obs. of 11 variables:
- $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
- $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
- $ disp: num 160 160 108 258 360 ...
- $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
- $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
- $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
- $ qsec: num 16.5 17 18.6 19.4 17 ...
- $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
- $ am : num 1 1 1 0 0 0 0 0 0 0 ...
- $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
- $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
- >
The subsets using logical values of mtcars dataset is as follows,
- > ds = data$mpg[data$mpg<30]
- > ds
- [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 21.5 15.5 15.2 13.3 19.2 27.3 26.0 15.8 19.7 15.0 21.4
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named ds.
- > da = data$cyl[data$cyl<10]
- > da
- [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named da.
- > df = data$disp[data$disp<1000]
- > df
- [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0 400.0 79.0 120.3
- [28] 95.1 351.0 145.0 301.0 121.0
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and the result is stored in a variable named df.
- > dn = data$hp[data$hp<300]
- > dn
- [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 109
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dn.
- > dg = data$drat[data$drat<10]
- > dg
- [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62 3.54 4.11
- >
As we can see from the above output less than operator returns a logical value of true that includes all the required observations and result is stored in a variable named dg.
Summary
In this article, I demonstrated how to create subsets of dataset using logical values for analysis of dataset so as to extract relevant data. Different kinds of operators and datasets are used to create subsets of dataset using logical values. Proper coding snippets along with outputs are also provided.