Â
Wrangling a big dataset is like sumo wresting -
you have to use leverage
last update: 2021-08-08
Â
Wrangling a big dataset is like sumo wresting -
you have to use leverage
Â
which()
and subsettingdata.frame
objectsaggregate()
Â
Â
Vectors of data are like houses on a street. Each house contains a data value and each house has an address.
Â
A vector of data has ‘addresses’ 1 to i
my_vector[1:i]
indicates all the addresses in a vector
my_vec <- c(1,7,3,5) # 4 addresses length(my_vec)
## [1] 4
my_vec[1:4]
## [1] 1 7 3 5
Â
Matrices of data are like houses on multiple streets. Each house has an address and so does each “row” of houses.
A matrix of data has rows and columns.
Row “addresses” 1 to i, column addresses 1 to j. my_vector[1:i, 1:j]
indicates all the addresses in a matrix.
my_mat <- matrix(c(1,7,3,5), nrow = 2) # 2 rows dim(my_mat)
## [1] 2 2
my_mat[1:2, 1:2]
## [,1] [,2] ## [1,] 1 3 ## [2,] 7 5
Â
Arrays of data are like skyscrapers (well, ones with 3 dimensions are…). Each array viewed from above has rows and columns of rooms, but also depth consisting of floors.
Â
So, arrays have more than 2 dimensions.
A 3 dimensional array has row “addresses” 1 to i, column addresses 1 to j, and depth of 1 to k.
my_arr[1:i, 1:j, 1:k]
indicates all the addresses in the array.
Â
my_arr <- array(c(1,7,3,5, 2,5,3,4, 3,6,7,8), dim = c(2,2,3)) # i,j,k dim(my_arr)
## [1] 2 2 3
my_arr[1:2, 1:2, 1:3]
## , , 1 ## ## [,1] [,2] ## [1,] 1 3 ## [2,] 7 5 ## ## , , 2 ## ## [,1] [,2] ## [1,] 2 3 ## [2,] 5 4 ## ## , , 3 ## ## [,1] [,2] ## [1,] 3 7 ## [2,] 6 8
data.frame
indicesData frames are like matrices, where the columns have names
data("OrchardSprays") head(OrchardSprays)
## decrease rowpos colpos treatment ## 1 57 1 1 D ## 2 95 2 1 E ## 3 8 3 1 B ## 4 69 4 1 H ## 5 92 5 1 G ## 6 90 6 1 F
data.frame
indicesAccess data with numerical indices
# rows 1:6 of column 1 OrchardSprays[1:6, 1]
## [1] 57 95 8 69 92 90
# by column name OrchardSprays[1:6, "treatment"]
## [1] D E B H G F ## Levels: A B C D E F G H
which()
functionwhich()
returns addresses of data objects based on some conditional value
OrchardSprays$treatment == "A"
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE ## [13] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE ## [37] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [49] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [61] FALSE FALSE TRUE FALSE
which(OrchardSprays$treatment == "A")
## [1] 8 14 19 26 36 41 53 63
which()
OrchardSprays$decrease
## [1] 57 95 8 69 92 90 15 2 84 6 127 36 51 2 69 71 87 72 5 ## [20] 39 22 16 72 4 130 4 114 9 20 24 10 51 43 28 60 5 17 7 ## [39] 81 71 12 29 44 77 4 27 47 76 8 72 13 57 4 81 20 61 80 ## [58] 114 39 14 86 55 3 19
(my_selec <- which(OrchardSprays$treatment == "A"))
## [1] 8 14 19 26 36 41 53 63
OrchardSprays$decrease[my_selec]
## [1] 2 2 5 4 5 12 4 3
aggregate()
functionSummarizes a data object with FUN (a function you choose).
E.g. mean of variable for each factor level
aggregate(x = OrchardSprays$decrease, by = list(OrchardSprays$treatment), FUN = mean)
## Group.1 x ## 1 A 4.625 ## 2 B 7.625 ## 3 C 25.250 ## 4 D 35.000 ## 5 E 63.125 ## 6 F 69.000 ## 7 G 68.500 ## 8 H 90.250
Â