scikit learn - CountVectorizer gives empty vocabulary error is document is cardinal number -
i have encountered problem in using sklearn countvectorizer document consists of word - 'one'. have worked out error occurs when document consists of words of pos tag cd (cardinal number). following documents result in empty vocabulary error: ['one', 'two'] ['hundred']
ngram_code=1 cv = countvectorizer(stop_words='english', analyzer='word', lowercase=true,\ token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code)) cv_array = cv.fit_transform(['one', 'two'])
get error: valueerror: empty vocabulary; perhaps documents contain stop words
the following not result in error because (i think) cardinal number words mixed other words: ['one', 'two', 'people']
interestingly though, in case, 'people' added vocabulary, 'one', 'two' not added:
cv_array = cv.fit_transform(['one', 'two', 'people']) cv.vocabulary_ out[143]: {'people': 0}
as example of single word document, ['hello'] works fine because not cardinal number:
cv_array = cv.fit_transform(['hello']) cv.vocabulary_ out[147]: {'hello': 0}
since words 'one', 'two' not stop words, them processed countvectorizer. how can work these words ?
addition: got same error word 'system'. why error out on word ?
cv_array = cv.fit_transform(['system'])
valueerror: empty vocabulary; perhaps documents contain stop words
they reason getting empty vocabulary because words belong list of stopwords sklearn uses. can check list here or test with:
>>> sklearn.feature_extraction.text import english_stop_words >>> 'one' in english_stop_words true >>> 'two' in english_stop_words true >>> 'system' in english_stop_words true
if want process words init countvectorizer this:
cv = countvectorizer(stop_words=none, ...
Comments
Post a Comment