xml - How can I filter images out of HTML Scrapy with XPath? -
i'm trying html of various articles using scrapy. these articles include images want process separately.
if have article html looks this:
<div class="article> <p>this sentence.</p> <p>this sentence.</p> <img src="/path/to/image.jpg"/> <p>this sentence.</p> <p>this sentence.</p> </div> how can scrape non-image html, or this:
<div class="article> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> </div> i've tried:
article = response.xpath("//div[@class='article'][not(img)]").extract() ...but still includes images.
xpath selection, not transformation or rearrangement.
you can select div elements have no img children:
//div[@class='article' , not(img)] or have no img descendents:
//div[@class='article' , not(.//img)] or, can select contents of div elements p:
//div[@class='article']/p or not img:
//div[@class='article']/*[not(self::img)] but cannot select requested html,
<div class="article"> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> </div> because rearrangement, not selection, of markup exists in input document.
Comments
Post a Comment