xml - How can I filter images out of HTML Scrapy with XPath? -


i'm trying html of various articles using scrapy. these articles include images want process separately.

if have article html looks this:

<div class="article>   <p>this sentence.</p>   <p>this sentence.</p>   <img src="/path/to/image.jpg"/>   <p>this sentence.</p>   <p>this sentence.</p> </div> 

how can scrape non-image html, or this:

<div class="article>   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p> </div> 

i've tried:

article = response.xpath("//div[@class='article'][not(img)]").extract() 

...but still includes images.

xpath selection, not transformation or rearrangement.

you can select div elements have no img children:

//div[@class='article' , not(img)] 

or have no img descendents:

//div[@class='article' , not(.//img)] 

or, can select contents of div elements p:

//div[@class='article']/p 

or not img:

//div[@class='article']/*[not(self::img)] 

but cannot select requested html,

<div class="article">   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p> </div> 

because rearrangement, not selection, of markup exists in input document.


Comments

Popular posts from this blog

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -