r - How to combine columns with identical names in a large sparse Matrix -
i have sparse dgtmatrix matrix package, has picked duplicate colnames
. want combine these summing columns same names, forming reduced matrix.
i found this post, adapted sparse matrix operations. but: it's still slow on large objects. wondering if has better solution operates directly on indexed elements of sparse matrix faster. instance, a@j
indexes (from zero) labels in a@dimnames[[2]]
, compacted , used reindex a@j
. (note: why used triplet sparse matrix form rather matrix default of column-sparse matrixes since figuring out p
value makes head hurt every time.)
require(matrix) # set (triplet) sparsematrix <- sparsematrix(i = c(1, 2, 1, 2, 1, 2), j = 1:6, x = rep(1:3, 2), givecsparse = false, dimnames = list(paste0("r", 1:2), rep(letters[1:3], 2))) ## 2 x 6 sparse matrix of class "dgtmatrix" ## b c b c ## r1 1 . 3 . 2 . ## r2 . 2 . 1 . 3 str(a) ## formal class 'dgtmatrix' [package "matrix"] 6 slots ## ..@ : int [1:6] 0 1 0 1 0 1 ## ..@ j : int [1:6] 0 1 2 3 4 5 ## ..@ dim : int [1:2] 2 6 ## ..@ dimnames:list of 2 ## .. ..$ : chr [1:2] "r1" "r2" ## .. ..$ : chr [1:6] "a" "b" "c" "a" ... ## ..@ x : num [1:6] 1 2 3 1 2 3 ## ..@ factors : list() # matrix-based attempt op1 <- function(x) { nms <- colnames(x) if (any(duplicated(nms))) x <- x %*% matrix(sapply(unique(nms),"==", nms)) x } op1(a) ## 2 x 3 sparse matrix of class "dgcmatrix" ## b c ## r1 1 2 3 ## r2 1 2 3
it worked fine, seems quite slow on huge sparse objects on intend use it. here's larger item:
# bigger, testing set.seed(10) nr <- 10000 # rows nc <- 26*100 # columns - 100 repetitions of a-z nonzeron <- round(nr * nc / 3) # two-thirds sparse b <- sparsematrix(i = sample(1:nr, size = nonzeron, replace = true), j = sample(1:nc, size = nonzeron, replace = true), x = round(runif(nonzeron)*5+1), givecsparse = false, dimnames = list(paste0("r", 1:nr), rep(letters, nc/26))) print(b[1:5, 1:10], col.names = true) ## 5 x 10 sparse matrix of class "dgtmatrix" ## b c d e f g h j ## r1 . . 5 . . 2 . . . . ## r2 . . . . . . . . . 4 ## r4 . . . . . . . 3 3 . ## r3 2 2 . 3 . . . 3 . . ## r5 3 . . 1 . . . . . 5 require(microbenchmark) microbenchmark(opmatrixcombine1 = op1(b), times = 30) ## unit: milliseconds ## expr min lq mean median uq max neval ## opmatrixcombine1 578.9222 619.3912 665.6301 631.4219 646.2716 1013.777 30
is there better way, better means faster and, if possible, not requiring construction of additional large objects?
here's attempt using index reindexing had in mind, figured out friend's (patrick you?). reindexes j
values, , uses handy feature of sparsematrix()
adds x
values elements index positions same.
op2 <- function(x) { nms <- colnames(x) uniquenms <- unique(nms) # build sparsematrix again: x's same index values automatically # added together, keeping in mind indexes stored 0 built 1 sparsematrix(i = x@i + 1, j = match(nms, uniquenms)[x@j + 1], x = x@x, dimnames = list(rownames(x), uniquenms), givecsparse = false) }
results same:
op2(a) ## 2 x 3 sparse matrix of class "dgcmatrix" ## b c ## r1 1 2 3 ## r2 1 2 3 all.equal(as(op1(b), "dgtmatrix"), op2(b)) ## [1] true
but faster:
require(microbenchmark) microbenchmark(opmatrixcombine1 = op1(b), opreindexsparse = op2(b), times = 30) ## unit: relative ## expr min lq mean median uq max neval ## opmatrixcombine1 1.756769 1.307651 1.360487 1.341814 1.346864 1.460626 30 ## opreindexsparse 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 30
Comments
Post a Comment