Calculating intersection of lots of sets in R -
i've got long list of authors , words like
author1,word1 author1,word2 author1,word3 author2,word2 author3,word1
the actual list has hundreds of authors , thousands of words. exists csv file have read dataframe , de-duplicated like
> typeof(x) [1] "list" > colnames(x) [1] "author" "word"
the last bit of dput(head(x)) looks like
), class = "factor")), .names = c("author", "word"), row.names = c(na, 6l), class = "data.frame")
what i'm trying calculate how similar word lists between authors based on intersection of author's wordlists percentage of 1 authors total vocabulary. (i'm sure there proper terms i'm doing don't quite know are.)
in python or perl group words author , use nested loops compare everyone else i'm wondering how in r? have feeling "use apply" going answer- if can please explain in small words newbies me?
here's 1 way using data.table:
## 1: generate test data set.seed(1l); wordlist <- paste0('word',1:5); authorlist <- paste0('author',1:5); rs <- sample(1:5,length(authorlist),replace=t); aw <- data.table( author=factor(rep(authorlist,rs)), word=factor(do.call(c,lapply(rs,function(r) sort(sample(wordlist,r))))), key='author' ); aw; ## author word ## 1: author1 word4 ## 2: author1 word5 ## 3: author2 word3 ## 4: author2 word4 ## 5: author3 word1 ## 6: author3 word4 ## 7: author3 word5 ## 8: author4 word1 ## 9: author4 word2 ## 10: author4 word3 ## 11: author4 word4 ## 12: author4 word5 ## 13: author5 word2 ## 14: author5 word5
## 2: initialize intersection table unique combinations of authors ai <- aw[,setkey(setnames(nm=c('a1','a2'),as.data.table(t(combn(unique(author),2l)))))];
## 3: compute word intersection size each combination of authors ai[,int:=length(intersect(aw[a1,word],aw[a2,word])),key(ai)]; ## a1 a2 int ## 1: author1 author2 1 ## 2: author1 author3 2 ## 3: author1 author4 2 ## 4: author1 author5 1 ## 5: author2 author3 1 ## 6: author2 author4 2 ## 7: author2 author5 0 ## 8: author3 author4 3 ## 9: author3 author5 1 ## 10: author4 author5 2
## 4: compute percentages ai[,`:=`(p1=int/aw[a1,.n],p2=int/aw[a2,.n]),key(ai)]; ## a1 a2 int p1 p2 ## 1: author1 author2 1 0.5000000 0.5000000 ## 2: author1 author3 2 1.0000000 0.6666667 ## 3: author1 author4 2 1.0000000 0.4000000 ## 4: author1 author5 1 0.5000000 0.5000000 ## 5: author2 author3 1 0.5000000 0.3333333 ## 6: author2 author4 2 1.0000000 0.4000000 ## 7: author2 author5 0 0.0000000 0.0000000 ## 8: author3 author4 3 1.0000000 0.6000000 ## 9: author3 author5 1 0.3333333 0.5000000 ## 10: author4 author5 2 0.4000000 1.0000000
Comments
Post a Comment