r - How to combine columns with identical names in a large sparse Matrix -


i have sparse dgtmatrix matrix package, has picked duplicate colnames. want combine these summing columns same names, forming reduced matrix.

i found this post, adapted sparse matrix operations. but: it's still slow on large objects. wondering if has better solution operates directly on indexed elements of sparse matrix faster. instance, a@j indexes (from zero) labels in a@dimnames[[2]], compacted , used reindex a@j. (note: why used triplet sparse matrix form rather matrix default of column-sparse matrixes since figuring out p value makes head hurt every time.)

require(matrix)  # set (triplet) sparsematrix <- sparsematrix(i = c(1, 2, 1, 2, 1, 2), j = 1:6, x = rep(1:3, 2),                    givecsparse = false,                   dimnames = list(paste0("r", 1:2), rep(letters[1:3], 2))) ## 2 x 6 sparse matrix of class "dgtmatrix" ##    b c b c ## r1 1 . 3 . 2 . ## r2 . 2 . 1 . 3  str(a) ## formal class 'dgtmatrix' [package "matrix"] 6 slots ##   ..@       : int [1:6] 0 1 0 1 0 1 ##   ..@ j       : int [1:6] 0 1 2 3 4 5 ##   ..@ dim     : int [1:2] 2 6 ##   ..@ dimnames:list of 2 ##   .. ..$ : chr [1:2] "r1" "r2" ##   .. ..$ : chr [1:6] "a" "b" "c" "a" ... ##   ..@ x       : num [1:6] 1 2 3 1 2 3 ##   ..@ factors : list()  # matrix-based attempt op1 <- function(x) {     nms <- colnames(x)     if (any(duplicated(nms)))          x <- x %*% matrix(sapply(unique(nms),"==", nms))     x }  op1(a) ## 2 x 3 sparse matrix of class "dgcmatrix" ##    b c ## r1 1 2 3 ## r2 1 2 3 

it worked fine, seems quite slow on huge sparse objects on intend use it. here's larger item:

# bigger, testing set.seed(10) nr <- 10000     # rows nc <- 26*100    # columns - 100 repetitions of a-z nonzeron <- round(nr * nc / 3)  # two-thirds sparse b <- sparsematrix(i = sample(1:nr, size = nonzeron, replace = true),                    j = sample(1:nc, size = nonzeron, replace = true),                   x = round(runif(nonzeron)*5+1),                   givecsparse = false,                    dimnames =  list(paste0("r", 1:nr), rep(letters, nc/26))) print(b[1:5, 1:10], col.names = true) ## 5 x 10 sparse matrix of class "dgtmatrix" ##     b c  d e f g h  j ## r1  . . 5  . . 2 . . .  . ## r2  . . .  . . . . . .  4 ## r4  . . .  . . . . 3 3  . ## r3  2 2 .  3 . . . 3 .  . ## r5  3 . .  1 . . . . .  5  require(microbenchmark) microbenchmark(opmatrixcombine1 = op1(b), times = 30) ## unit: milliseconds ##             expr      min       lq     mean   median       uq      max neval ## opmatrixcombine1 578.9222 619.3912 665.6301 631.4219 646.2716 1013.777    30 

is there better way, better means faster and, if possible, not requiring construction of additional large objects?

here's attempt using index reindexing had in mind, figured out friend's (patrick you?). reindexes j values, , uses handy feature of sparsematrix() adds x values elements index positions same.

op2 <- function(x) {     nms <- colnames(x)     uniquenms <- unique(nms)     # build sparsematrix again: x's same index values automatically     # added together, keeping in mind indexes stored 0 built 1     sparsematrix(i = x@i + 1,                   j = match(nms, uniquenms)[x@j + 1],                  x = x@x,                  dimnames = list(rownames(x), uniquenms),                  givecsparse = false) } 

results same:

op2(a) ## 2 x 3 sparse matrix of class "dgcmatrix" ##    b c ## r1 1 2 3 ## r2 1 2 3  all.equal(as(op1(b), "dgtmatrix"), op2(b)) ## [1] true 

but faster:

require(microbenchmark) microbenchmark(opmatrixcombine1 = op1(b),                 opreindexsparse = op2(b),                times = 30) ## unit: relative ##              expr      min       lq     mean   median       uq      max neval ##  opmatrixcombine1 1.756769 1.307651 1.360487 1.341814 1.346864 1.460626    30 ##   opreindexsparse 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    30 

Comments

Popular posts from this blog

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -