Calculating intersection of lots of sets in R -

- March 15, 2015

i've got long list of authors , words like

author1,word1 author1,word2 author1,word3 author2,word2 author3,word1

the actual list has hundreds of authors , thousands of words. exists csv file have read dataframe , de-duplicated like

    > typeof(x)     [1] "list"     > colnames(x)     [1] "author"   "word"

the last bit of dput(head(x)) looks like

    ), class = "factor")), .names = c("author", "word"), row.names = c(na,      6l), class = "data.frame")

what i'm trying calculate how similar word lists between authors based on intersection of author's wordlists percentage of 1 authors total vocabulary. (i'm sure there proper terms i'm doing don't quite know are.)

in python or perl group words author , use nested loops compare everyone else i'm wondering how in r? have feeling "use apply" going answer- if can please explain in small words newbies me?

here's 1 way using data.table:

## 1: generate test data set.seed(1l); wordlist <- paste0('word',1:5); authorlist <- paste0('author',1:5); rs <- sample(1:5,length(authorlist),replace=t); aw <- data.table(     author=factor(rep(authorlist,rs)),     word=factor(do.call(c,lapply(rs,function(r) sort(sample(wordlist,r))))),     key='author' ); aw; ##      author  word ##  1: author1 word4 ##  2: author1 word5 ##  3: author2 word3 ##  4: author2 word4 ##  5: author3 word1 ##  6: author3 word4 ##  7: author3 word5 ##  8: author4 word1 ##  9: author4 word2 ## 10: author4 word3 ## 11: author4 word4 ## 12: author4 word5 ## 13: author5 word2 ## 14: author5 word5

## 2: initialize intersection table unique combinations of authors ai <- aw[,setkey(setnames(nm=c('a1','a2'),as.data.table(t(combn(unique(author),2l)))))];

## 3: compute word intersection size each combination of authors ai[,int:=length(intersect(aw[a1,word],aw[a2,word])),key(ai)]; ##          a1      a2 int ##  1: author1 author2   1 ##  2: author1 author3   2 ##  3: author1 author4   2 ##  4: author1 author5   1 ##  5: author2 author3   1 ##  6: author2 author4   2 ##  7: author2 author5   0 ##  8: author3 author4   3 ##  9: author3 author5   1 ## 10: author4 author5   2

## 4: compute percentages ai[,`:=`(p1=int/aw[a1,.n],p2=int/aw[a2,.n]),key(ai)]; ##          a1      a2 int        p1        p2 ##  1: author1 author2   1 0.5000000 0.5000000 ##  2: author1 author3   2 1.0000000 0.6666667 ##  3: author1 author4   2 1.0000000 0.4000000 ##  4: author1 author5   1 0.5000000 0.5000000 ##  5: author2 author3   1 0.5000000 0.3333333 ##  6: author2 author4   2 1.0000000 0.4000000 ##  7: author2 author5   0 0.0000000 0.0000000 ##  8: author3 author4   3 1.0000000 0.6000000 ##  9: author3 author5   1 0.3333333 0.5000000 ## 10: author4 author5   2 0.4000000 1.0000000

Search This Blog

Arrya Code

Calculating intersection of lots of sets in R -

Comments

Post a Comment

Popular posts from this blog

html - Styling progress bar with inline style -

java - Oracle Sql developer error: could not install some modules -

How to use autoclose brackets in Jupyter notebook? -