scikit learn - CountVectorizer gives empty vocabulary error is document is cardinal number -


i have encountered problem in using sklearn countvectorizer document consists of word - 'one'. have worked out error occurs when document consists of words of pos tag cd (cardinal number). following documents result in empty vocabulary error: ['one', 'two'] ['hundred']

ngram_code=1 cv = countvectorizer(stop_words='english', analyzer='word', lowercase=true,\ token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code)) cv_array = cv.fit_transform(['one', 'two']) 

get error: valueerror: empty vocabulary; perhaps documents contain stop words

the following not result in error because (i think) cardinal number words mixed other words: ['one', 'two', 'people']

interestingly though, in case, 'people' added vocabulary, 'one', 'two' not added:

cv_array = cv.fit_transform(['one', 'two', 'people']) cv.vocabulary_ out[143]: {'people': 0} 

as example of single word document, ['hello'] works fine because not cardinal number:

cv_array = cv.fit_transform(['hello']) cv.vocabulary_ out[147]: {'hello': 0} 

since words 'one', 'two' not stop words, them processed countvectorizer. how can work these words ?

addition: got same error word 'system'. why error out on word ?

cv_array = cv.fit_transform(['system']) 

valueerror: empty vocabulary; perhaps documents contain stop words

they reason getting empty vocabulary because words belong list of stopwords sklearn uses. can check list here or test with:

>>> sklearn.feature_extraction.text import english_stop_words  >>> 'one' in english_stop_words  true  >>> 'two' in english_stop_words  true  >>> 'system' in english_stop_words  true 

if want process words init countvectorizer this:

cv = countvectorizer(stop_words=none, ... 

Comments

Popular posts from this blog

html - Styling progress bar with inline style -

java - Oracle Sql developer error: could not install some modules -

How to use autoclose brackets in Jupyter notebook? -