r - How to combine columns with identical names in a large sparse Matrix -

- March 15, 2014

i have sparse dgtmatrix matrix package, has picked duplicate colnames. want combine these summing columns same names, forming reduced matrix.

i found this post, adapted sparse matrix operations. but: it's still slow on large objects. wondering if has better solution operates directly on indexed elements of sparse matrix faster. instance, a@j indexes (from zero) labels in a@dimnames[[2]], compacted , used reindex a@j. (note: why used triplet sparse matrix form rather matrix default of column-sparse matrixes since figuring out p value makes head hurt every time.)

require(matrix)  # set (triplet) sparsematrix <- sparsematrix(i = c(1, 2, 1, 2, 1, 2), j = 1:6, x = rep(1:3, 2),                    givecsparse = false,                   dimnames = list(paste0("r", 1:2), rep(letters[1:3], 2))) ## 2 x 6 sparse matrix of class "dgtmatrix" ##    b c b c ## r1 1 . 3 . 2 . ## r2 . 2 . 1 . 3  str(a) ## formal class 'dgtmatrix' [package "matrix"] 6 slots ##   ..@       : int [1:6] 0 1 0 1 0 1 ##   ..@ j       : int [1:6] 0 1 2 3 4 5 ##   ..@ dim     : int [1:2] 2 6 ##   ..@ dimnames:list of 2 ##   .. ..$ : chr [1:2] "r1" "r2" ##   .. ..$ : chr [1:6] "a" "b" "c" "a" ... ##   ..@ x       : num [1:6] 1 2 3 1 2 3 ##   ..@ factors : list()  # matrix-based attempt op1 <- function(x) {     nms <- colnames(x)     if (any(duplicated(nms)))          x <- x %*% matrix(sapply(unique(nms),"==", nms))     x }  op1(a) ## 2 x 3 sparse matrix of class "dgcmatrix" ##    b c ## r1 1 2 3 ## r2 1 2 3

it worked fine, seems quite slow on huge sparse objects on intend use it. here's larger item:

# bigger, testing set.seed(10) nr <- 10000     # rows nc <- 26*100    # columns - 100 repetitions of a-z nonzeron <- round(nr * nc / 3)  # two-thirds sparse b <- sparsematrix(i = sample(1:nr, size = nonzeron, replace = true),                    j = sample(1:nc, size = nonzeron, replace = true),                   x = round(runif(nonzeron)*5+1),                   givecsparse = false,                    dimnames =  list(paste0("r", 1:nr), rep(letters, nc/26))) print(b[1:5, 1:10], col.names = true) ## 5 x 10 sparse matrix of class "dgtmatrix" ##     b c  d e f g h  j ## r1  . . 5  . . 2 . . .  . ## r2  . . .  . . . . . .  4 ## r4  . . .  . . . . 3 3  . ## r3  2 2 .  3 . . . 3 .  . ## r5  3 . .  1 . . . . .  5  require(microbenchmark) microbenchmark(opmatrixcombine1 = op1(b), times = 30) ## unit: milliseconds ##             expr      min       lq     mean   median       uq      max neval ## opmatrixcombine1 578.9222 619.3912 665.6301 631.4219 646.2716 1013.777    30

is there better way, better means faster and, if possible, not requiring construction of additional large objects?

here's attempt using index reindexing had in mind, figured out friend's (patrick you?). reindexes j values, , uses handy feature of sparsematrix() adds x values elements index positions same.

op2 <- function(x) {     nms <- colnames(x)     uniquenms <- unique(nms)     # build sparsematrix again: x's same index values automatically     # added together, keeping in mind indexes stored 0 built 1     sparsematrix(i = x@i + 1,                   j = match(nms, uniquenms)[x@j + 1],                  x = x@x,                  dimnames = list(rownames(x), uniquenms),                  givecsparse = false) }

results same:

op2(a) ## 2 x 3 sparse matrix of class "dgcmatrix" ##    b c ## r1 1 2 3 ## r2 1 2 3  all.equal(as(op1(b), "dgtmatrix"), op2(b)) ## [1] true

but faster:

require(microbenchmark) microbenchmark(opmatrixcombine1 = op1(b),                 opreindexsparse = op2(b),                times = 30) ## unit: relative ##              expr      min       lq     mean   median       uq      max neval ##  opmatrixcombine1 1.756769 1.307651 1.360487 1.341814 1.346864 1.460626    30 ##   opreindexsparse 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    30

Search This Blog

Arrya Code

r - How to combine columns with identical names in a large sparse Matrix -

Comments

Post a Comment

Popular posts from this blog

ios - Memory not freeing up after popping viewcontroller using ARC -

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -