Introduction
In this article, I am going to demonstrate how to create a blank subset of a dataset for analysis of datasets so as to extract relevant data for creating a machine learning model. Extracting data from datasets or creating a subset of data is a part of a data pre-processing technique used in R to obtain clean and relevant data for accurate predictions to be made through a machine learning model.
For additional analysis of data in R, pre-processing of data is performed to create subsets of dataset. Several objects are available in R such as data frames, vectors, arrays and lists which can be used to create subsets of dataset and store the values of subset in them. There are different methods available to create subsets of vectors, arrays, data frames, and lists.
Performing analysis of data through pre-processing is one of the most important jobs in R. To create a subset of dataset in R several operators can be used which are as follows.
Different types of operators for creating subset of data
There are three kinds of operators which can be used to create different subsets which are as follows.
Currency operator ($)
We can create subsets of entire dataset by using the dollar operator. By mentioning dollar operator along with dataset name, we can select different variables of dataset at a time and create a subset of that variable alone as a vector. A vector object is formed when the dollar operator is used with a data frame.
Now we will discuss with some examples, on how to use dollar operator to create subset of datasets.
We will be using mtcars dataset to use different operators.
- > data = mtcars
- > data
- mpg cyl disp hp drat wt qsec vs am gear carb
- Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
- Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
- Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
- Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
- Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
- Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
- Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
- Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
- Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
- Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
- Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
- Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
- Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
- Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
- Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
- Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
- Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
- Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
- Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
- Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
- Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
- Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
- AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
- Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
- Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
- Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
- Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
- Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
- Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
- Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
- Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
- Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
- >
Now we will use dollar operator with mpg variable.
- > ds = data$mpg
- > ds
- [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
- [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
- [31] 15.0 21.4
As we can see from the above output, using dollar operator with dataset and variable name, a subset of mtcars dataset is created. The subset has mpg variable and its observations. The subset is stored in a variable named ds.
- > df = data$cyl
- > df
- [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
As we can see from the above output, using dollar operator with dataset and variable name, a subset of mtcars dataset is created. The subset has cyl variable and its observations. The subset is stored in a variable named df.
- > da = data$disp
- > da
- [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0 400.0 79.0 120.3
- [28] 95.1 351.0 145.0 301.0 121.0
As we can see from the above output, using dollar operator with dataset and variable name a subset of mtcars dataset is created. The subset is having disp variable and its observations. The subset is stored in a variable named da.
- > dn = data$hp
- >dn
- [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
As we can see from the above output, using dollar operator with dataset and variable name a subset of mtcars dataset is created. The subset is having hp variable and its observations. The subset is stored in a variable named dn.
Double square brackets operator ([[)
The double square brackets operator can be used to create subsets of data containing either all observations of single variable of a dataset or just a single observation of a particular variable. For creating a subset using the double‐square‐brackets operator, we can use index position of the observations as well as name of the particular variable. We can use double square brackets operator with data frame.
- > data[['mpg']]
- [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
As we can see above code snippet created a subset containing a single variable mpg. The argument is a variable name inside double square brackets operator.
- > data[[3]]
- [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0 400.0 79.0 120.3
- [28] 95.1 351.0 145.0 301.0 121.0
As we can see above code snippet created a subset containing a single variable disp. The argument is an index position of the variable named disp inside double square brackets operator.
As we can see above code snippet created a subset containing a single observation of the variable disp. The arguments are an index positions of the rows and columns of that particular observation of the variable disp inside double square brackets operator.
Single square brackets operator ([)
The single square brackets operator can be used to create subsets of data containing all observations of specified number of multiple variables of a dataset. Now we will discuss with some examples, on how to use single square brackets operator to create subset of dataset as follows,
- > data[]
- mpg cyl disp hp drat wt qsec vs am gear carb
- Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
- Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
- Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
- Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
- Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
- Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
- Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
- Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
- Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
- Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
As we can see from the above output single square brackets operator created a subset of mtcars dataset containing all the variables and observations.
- > data[c(3,2,5)]
- disp cyl drat
- Mazda RX4 160.0 6 3.90
- Mazda RX4 Wag 160.0 6 3.90
- Datsun 710 108.0 4 3.85
- Hornet 4 Drive 258.0 6 3.08
- Hornet Sportabout 360.0 8 3.15
- Valiant 225.0 6 2.76
- Duster 360 360.0 8 3.21
- Merc 240D 146.7 4 3.69
- Merc 230 140.8 4 3.92
- Merc 280 167.6 6 3.92
- Merc 280C 167.6 6 3.92
- Merc 450SE 275.8 8 3.07
- Merc 450SL 275.8 8 3.07
- Merc 450SLC 275.8 8 3.07
- Cadillac Fleetwood 472.0 8 2.93
- Lincoln Continental 460.0 8 3.00
- Chrysler Imperial 440.0 8 3.23
- Fiat 128 78.7 4 4.08
- Honda Civic 75.7 4 4.93
- Toyota Corolla 71.1 4 4.22
- Toyota Corona 120.1 4 3.70
- Dodge Challenger 318.0 8 2.76
- AMC Javelin 304.0 8 3.15
- Camaro Z28 350.0 8 3.73
- Pontiac Firebird 400.0 8 3.08
- Fiat X1-9 79.0 4 4.08
- Porsche 914-2 120.3 4 4.43
- Lotus Europa 95.1 4 3.77
- Ford Pantera L 351.0 8 4.22
- Ferrari Dino 145.0 6 3.62
- Maserati Bora 301.0 8 3.54
- Volvo 142E 121.0 4 4.11
The above code pulls out those variables and observations whose index positions are mentioned in the single square brackets operator and creates a subset of variables of 3, 2 and 5 index positions.
- > data[c(4,3,6)]
- hp disp wt
- Mazda RX4 110 160.0 2.620
- Mazda RX4 Wag 110 160.0 2.875
- Datsun 710 93 108.0 2.320
- Hornet 4 Drive 110 258.0 3.215
- Hornet Sportabout 175 360.0 3.440
- Valiant 105 225.0 3.460
- Duster 360 245 360.0 3.570
- Merc 240D 62 146.7 3.190
- Merc 230 95 140.8 3.150
- Merc 280 123 167.6 3.440
- Merc 280C 123 167.6 3.440
- Merc 450SE 180 275.8 4.070
- Merc 450SL 180 275.8 3.730
- Merc 450SLC 180 275.8 3.780
- Cadillac Fleetwood 205 472.0 5.250
- Lincoln Continental 215 460.0 5.424
- Chrysler Imperial 230 440.0 5.345
- Fiat 128 66 78.7 2.200
- Honda Civic 52 75.7 1.615
- Toyota Corolla 65 71.1 1.835
- Toyota Corona 97 120.1 2.465
- Dodge Challenger 150 318.0 3.520
- AMC Javelin 150 304.0 3.435
- Camaro Z28 245 350.0 3.840
- Pontiac Firebird 175 400.0 3.845
- Fiat X1-9 66 79.0 1.935
- Porsche 914-2 91 120.3 2.140
- Lotus Europa 113 95.1 1.513
- Ford Pantera L 264 351.0 3.170
- Ferrari Dino 175 145.0 2.770
- Maserati Bora 335 301.0 3.570
- Volvo 142E 109 121.0 2.780
- >
The above code pulls out those variables and observations whose index positions are mentioned in the single square brackets operator and creates a subset of variables of 4, 3 and 6 index positions.
The difference between the double square brackets operator and single square brackets is the indexing of number of variables. The [[ creates a subset of single variable and its observations and [ creates a subset of multiple variable and type of the subset is same as that of the dataset.
For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.
The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list or other recursive object, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.
Now we will discuss how to use the above-mentioned operators to create the subsets of a specified number of variables of a dataset. We will discuss methods to create blank subsets of datasets containing data with all the variables and observations of the datasets.
Creating blank subsets of a dataset
The single square brackets operator creates a subset containing more than one variable. To create a subset of multiple variables, we can mention the required number of variables in the syntax of Single Square brackets operator to get a subset of multiple variables.
Now we will be using a predefined dataset rock of type data frame containing four variables and 48 observations to create blank subsets containing all the variables of a dataset as follows,
A blank subset can be created using single square brackets operator preceded by dataset name. A blank subset contains all the variables and observations of a dataset. Using Single Square brackets operator preceded by dataset name we can mention the required number of variables we want to insert in a resultant subset.
Now we will be creating blank subsets of several predefined datasets available in R as follows,
- > str(rock)
- 'data.frame': 48 obs. of 4 variables:
- $ area : int 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
- $ peri : num 2792 3893 3931 3869 3949 ...
- $ shape: num 0.0903 0.1486 0.1833 0.1171 0.1224 ...
- $ perm : num 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
- >
The blank subset for the above rock dataset is as follows,
- > rock[]
- area peri shape perm
- 1 4990 2791.900 0.0903296 6.3
- 2 7002 3892.600 0.1486220 6.3
- 3 7558 3930.660 0.1833120 6.3
- 4 7352 3869.320 0.1170630 6.3
- 5 7943 3948.540 0.1224170 17.1
- 6 7979 4010.150 0.1670450 17.1
- 7 9333 4345.750 0.1896510 17.1
- 8 8209 4344.750 0.1641270 17.1
- 9 8393 3682.040 0.2036540 119.0
- 10 6425 3098.650 0.1623940 119.0
- 11 9364 4480.050 0.1509440 119.0
- 12 8624 3986.240 0.1481410 119.0
- 13 10651 4036.540 0.2285950 82.4
- 14 8868 3518.040 0.2316230 82.4
- 15 9417 3999.370 0.1725670 82.4
- 16 8874 3629.070 0.1534810 82.4
- 17 10962 4608.660 0.2043140 58.6
- 18 10743 4787.620 0.2627270 58.6
- 19 11878 4864.220 0.2000710 58.6
- 20 9867 4479.410 0.1448100 58.6
- 21 7838 3428.740 0.1138520 142.0
- 22 11876 4353.140 0.2910290 142.0
- 23 12212 4697.650 0.2400770 142.0
- 24 8233 3518.440 0.1618650 142.0
- 25 6360 1977.390 0.2808870 740.0
- 26 4193 1379.350 0.1794550 740.0
- 27 7416 1916.240 0.1918020 740.0
- 28 5246 1585.420 0.1330830 740.0
- 29 6509 1851.210 0.2252140 890.0
- 30 4895 1239.660 0.3412730 890.0
- 31 6775 1728.140 0.3116460 890.0
- 32 7894 1461.060 0.2760160 890.0
- 33 5980 1426.760 0.1976530 950.0
- 34 5318 990.388 0.3266350 950.0
- 35 7392 1350.760 0.1541920 950.0
- 36 7894 1461.060 0.2760160 950.0
- 37 3469 1376.700 0.1769690 100.0
- 38 1468 476.322 0.4387120 100.0
- 39 3524 1189.460 0.1635860 100.0
- 40 5267 1644.960 0.2538320 100.0
- 41 5048 941.543 0.3286410 1300.0
- 42 1016 308.642 0.2300810 1300.0
- 43 5605 1145.690 0.4641250 1300.0
- 44 8793 2280.490 0.4204770 1300.0
- 45 3475 1174.110 0.2007440 580.0
- 46 1651 597.808 0.2626510 580.0
- 47 5514 1455.880 0.1824530 580.0
- 48 9718 1485.580 0.2004470 580.0
- >
The structure of mtcars dataset is as follows,
- > str(mtcars)
- 'data.frame': 32 obs. of 11 variables:
- $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
- $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
- $ disp: num 160 160 108 258 360 ...
- $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
- $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
- $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
- $ qsec: num 16.5 17 18.6 19.4 17 ...
- $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
- $ am : num 1 1 1 0 0 0 0 0 0 0 ...
- $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
- $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
- >
The blank subset of mtcars dataset is as follows,
- > mtcars[]
- mpg cyl disp hp drat wt qsec vs am gear carb
- Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
- Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
- Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
- Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
- Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
- Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
- Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
- Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
- Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
- Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
- Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
- Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
- Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
- Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
- Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
- Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
- Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
- Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
- Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
- Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
- Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
- Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
- AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
- Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
- Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
- Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
- Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
- Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
- Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
- Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
- Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
- Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
- >
Summary
In this article, I demonstrated how to create a blank subset of a dataset for analysis of datasets so as to extract relevant data. Different kinds of operators and datasets are used to create blank subsets. Proper coding snippets along with outputs are also provided.