scikit learn - CountVectorizer gives empty vocabulary error is document is cardinal number -

- August 15, 2014

i have encountered problem in using sklearn countvectorizer document consists of word - 'one'. have worked out error occurs when document consists of words of pos tag cd (cardinal number). following documents result in empty vocabulary error: ['one', 'two'] ['hundred']

ngram_code=1 cv = countvectorizer(stop_words='english', analyzer='word', lowercase=true,\ token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code)) cv_array = cv.fit_transform(['one', 'two'])

get error: valueerror: empty vocabulary; perhaps documents contain stop words

the following not result in error because (i think) cardinal number words mixed other words: ['one', 'two', 'people']

interestingly though, in case, 'people' added vocabulary, 'one', 'two' not added:

cv_array = cv.fit_transform(['one', 'two', 'people']) cv.vocabulary_ out[143]: {'people': 0}

as example of single word document, ['hello'] works fine because not cardinal number:

cv_array = cv.fit_transform(['hello']) cv.vocabulary_ out[147]: {'hello': 0}

since words 'one', 'two' not stop words, them processed countvectorizer. how can work these words ?

addition: got same error word 'system'. why error out on word ?

cv_array = cv.fit_transform(['system'])

valueerror: empty vocabulary; perhaps documents contain stop words

they reason getting empty vocabulary because words belong list of stopwords sklearn uses. can check list here or test with:

>>> sklearn.feature_extraction.text import english_stop_words  >>> 'one' in english_stop_words  true  >>> 'two' in english_stop_words  true  >>> 'system' in english_stop_words  true

if want process words init countvectorizer this:

cv = countvectorizer(stop_words=none, ...

Search This Blog

Arrya Code

scikit learn - CountVectorizer gives empty vocabulary error is document is cardinal number -

Comments

Post a Comment

Popular posts from this blog

ios - Memory not freeing up after popping viewcontroller using ARC -

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -