python - beautiful soup parser can't find links -
i trying parse html document find links using beautiful soup
, found weird behavior. page http://people.csail.mit.edu/gjtucker/ . here's code:
from bs4 import beautifulsoup import requests user_agent = {'user-agent': 'mozilla/5.0 (x11; linux i686) applewebkit/537.17 (khtml, gecko) chrome/24.0.1312.52 safari/537.17'} t=requests.get(url, headers = user_agent).text soup=beautifulsoup(t, 'html.parser') link in soup.findall('a'): print link['href']
this prints 2 links: http://www.amazon.jobs/team/speech-amazon
, https://scholar.google.com/citations?user=-gjkphiaaaaj&hl=en
, whereas there many more links in page.
can reproduce this? there specific reason happening url? few outher urls worked fine.
the html of page not well-formed, should use more lenient parser, html5lib
:
soup = beautifulsoup(t, 'html5lib') link in soup.find_all('a'): print(link['href'])
prints:
http://www.amazon.jobs/team/speech-amazon https://scholar.google.com/citations?user=-gjkphiaaaaj&hl=en http://www.linkedin.com/pub/george-tucker/6/608/3ba ... http://www.hsph.harvard.edu/alkes-price/ ... http://www.nature.com/ng/journal/v47/n3/full/ng.3190.html http://www.biomedcentral.com/1471-2105/14/299 pdfs/journal.pone.0029095.pdf pdfs/es201187u.pdf pdfs/sigtrans.pdf
Comments
Post a Comment