python - beautiful soup parser can't find links -


i trying parse html document find links using beautiful soup , found weird behavior. page http://people.csail.mit.edu/gjtucker/ . here's code:

from bs4 import beautifulsoup import requests  user_agent = {'user-agent': 'mozilla/5.0 (x11; linux i686) applewebkit/537.17 (khtml, gecko) chrome/24.0.1312.52 safari/537.17'}  t=requests.get(url, headers = user_agent).text  soup=beautifulsoup(t, 'html.parser') link in soup.findall('a'):     print link['href'] 

this prints 2 links: http://www.amazon.jobs/team/speech-amazon , https://scholar.google.com/citations?user=-gjkphiaaaaj&hl=en, whereas there many more links in page.

can reproduce this? there specific reason happening url? few outher urls worked fine.

the html of page not well-formed, should use more lenient parser, html5lib:

soup = beautifulsoup(t, 'html5lib') link in soup.find_all('a'):     print(link['href']) 

prints:

http://www.amazon.jobs/team/speech-amazon https://scholar.google.com/citations?user=-gjkphiaaaaj&hl=en http://www.linkedin.com/pub/george-tucker/6/608/3ba ... http://www.hsph.harvard.edu/alkes-price/ ... http://www.nature.com/ng/journal/v47/n3/full/ng.3190.html http://www.biomedcentral.com/1471-2105/14/299 pdfs/journal.pone.0029095.pdf pdfs/es201187u.pdf pdfs/sigtrans.pdf 

Comments

Popular posts from this blog

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -