xml - How can I filter images out of HTML Scrapy with XPath? -
i'm trying html of various articles using scrapy. these articles include images want process separately.
if have article html looks this:
<div class="article> <p>this sentence.</p> <p>this sentence.</p> <img src="/path/to/image.jpg"/> <p>this sentence.</p> <p>this sentence.</p> </div>
how can scrape non-image html, or this:
<div class="article> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> </div>
i've tried:
article = response.xpath("//div[@class='article'][not(img)]").extract()
...but still includes images.
xpath selection, not transformation or rearrangement.
you can select div
elements have no img
children:
//div[@class='article' , not(img)]
or have no img
descendents:
//div[@class='article' , not(.//img)]
or, can select contents of div
elements p
:
//div[@class='article']/p
or not img
:
//div[@class='article']/*[not(self::img)]
but cannot select requested html,
<div class="article"> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> <p>this sentence.</p> </div>
because rearrangement, not selection, of markup exists in input document.
Comments
Post a Comment