last update: 2021-08-08

Data sumo

 

Wrangling a big dataset is like sumo wresting -

you have to use leverage

What you will learn

 

  • Indexing concept
  • Using which() and subsetting
  • Selection on data.frame objects
  • Using aggregate()
  • Practice exercises

“Tidy Data” concept

 

Indexing concept

 

Vectors of data are like houses on a street. Each house contains a data value and each house has an address.

Indexing concept

 

A vector of data has ‘addresses’ 1 to i

my_vector[1:i] indicates all the addresses in a vector

my_vec <- c(1,7,3,5) # 4 addresses
length(my_vec)
## [1] 4
my_vec[1:4]
## [1] 1 7 3 5

Matrix indices

 

Matrices of data are like houses on multiple streets. Each house has an address and so does each “row” of houses.

Matrix indices

A matrix of data has rows and columns.

Row “addresses” 1 to i, column addresses 1 to j. my_vector[1:i, 1:j] indicates all the addresses in a matrix.

my_mat <- matrix(c(1,7,3,5), nrow = 2) # 2 rows
dim(my_mat)
## [1] 2 2
my_mat[1:2, 1:2]
##      [,1] [,2]
## [1,]    1    3
## [2,]    7    5

Array indices

 

Arrays of data are like skyscrapers (well, ones with 3 dimensions are…). Each array viewed from above has rows and columns of rooms, but also depth consisting of floors.

Array indices

 

So, arrays have more than 2 dimensions.

A 3 dimensional array has row “addresses” 1 to i, column addresses 1 to j, and depth of 1 to k.

my_arr[1:i, 1:j, 1:k] indicates all the addresses in the array.

Array indices

 

my_arr <- array(c(1,7,3,5,
                  2,5,3,4,
                  3,6,7,8), 
                dim = c(2,2,3)) # i,j,k
dim(my_arr)
## [1] 2 2 3

Array indices my_arr[1:2, 1:2, 1:3]

## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    7    5
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    2    3
## [2,]    5    4
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    3    7
## [2,]    6    8

data.frame indices

Data frames are like matrices, where the columns have names

data("OrchardSprays")
head(OrchardSprays)
##   decrease rowpos colpos treatment
## 1       57      1      1         D
## 2       95      2      1         E
## 3        8      3      1         B
## 4       69      4      1         H
## 5       92      5      1         G
## 6       90      6      1         F

data.frame indices

Access data with numerical indices

# rows 1:6 of column 1
OrchardSprays[1:6, 1]
## [1] 57 95  8 69 92 90
# by column name
OrchardSprays[1:6, "treatment"]
## [1] D E B H G F
## Levels: A B C D E F G H

The which() function

which() returns addresses of data objects based on some conditional value

OrchardSprays$treatment == "A"
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [13] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [37] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE  TRUE FALSE
which(OrchardSprays$treatment == "A")
## [1]  8 14 19 26 36 41 53 63

Powerful Sumo with which()

OrchardSprays$decrease
##  [1]  57  95   8  69  92  90  15   2  84   6 127  36  51   2  69  71  87  72   5
## [20]  39  22  16  72   4 130   4 114   9  20  24  10  51  43  28  60   5  17   7
## [39]  81  71  12  29  44  77   4  27  47  76   8  72  13  57   4  81  20  61  80
## [58] 114  39  14  86  55   3  19
(my_selec <- which(OrchardSprays$treatment == "A"))
## [1]  8 14 19 26 36 41 53 63
OrchardSprays$decrease[my_selec]
## [1]  2  2  5  4  5 12  4  3

The aggregate() function

Summarizes a data object with FUN (a function you choose).

E.g. mean of variable for each factor level

aggregate(x = OrchardSprays$decrease,
          by = list(OrchardSprays$treatment),
          FUN = mean)
##   Group.1      x
## 1       A  4.625
## 2       B  7.625
## 3       C 25.250
## 4       D 35.000
## 5       E 63.125
## 6       F 69.000
## 7       G 68.500
## 8       H 90.250

Live coding

Â