python - how can I divide the frequency of bigram pair by unigram word? -
below code.
from __future__ import division import nltk import re f = open('c:/python27/brown_a1_half.txt', 'ru') w = open('c:/python27/brown_a1_half_out.txt', 'w') #to read whole file using read() filecontents = f.read() nltk.tokenize import sent_tokenize sent_tokenize_list = sent_tokenize(filecontents) sentence in sent_tokenize_list: sentence = "start " + sentence + " end" tokens = sentence.split() bigrams = (tuple(nltk.bigrams(tokens))) bigrams_frequency = nltk.freqdist(bigrams) k,v in bigrams_frequency.items(): print k, v
then printing result "(bigrams), frequency ". here, want each bigram pair, divide bigram frequency first appearing unigram word frequency. (for example, if there bigram ('red', 'apple') , frequency "3", want divide frequency of 'red'). obtaining mle prob, "mle prob = counting of (w1, w2) / counting of (w1)" . me plz...
you can add following in loop (after print k, v):
number_unigrams = tokens.count(k[0]) prob = v / number_unigrams
that should give mle prob each bigram.
Comments
Post a Comment