python - Searching unique web links -
i wrote program extract web links http://www.stevens.edu/. facing following problems program.
1- want links starting http , https
2 - getting parser warning bs4 concerning lack of specification on parser - solved
how can fix problems? not getting proper direction solve problem.
my code -
import urllib2 bs4 import beautifulsoup bs url = raw_input('please enter url want see unique web links -') print "\n" urls (mostly http) in complex world req = urllib2.request(url, headers={'user-agent': 'mozilla/5.0'}) html = urllib2.urlopen(req).read() soup = bs(html) tags = soup('a') count = 0 web_link = [] tag in tags: count = count + 1 store = tag.get('href', none) web_link.append(store) print "total no. of extracted web links are",count,"\n" print web_link print "\n" unique_list = set(web_link) unique_list = list(unique_list) print "no. of unique web links after using set method", len(unique_list),"\n"
for second problem, need specify parser while creating bs of page.
soup = bs(html,"html.parser")
this should remove warning.
Comments
Post a Comment