xml - How can I filter images out of HTML Scrapy with XPath? -

- May 15, 2014

i'm trying html of various articles using scrapy. these articles include images want process separately.

if have article html looks this:

<div class="article>   <p>this sentence.</p>   <p>this sentence.</p>   <img src="/path/to/image.jpg"/>   <p>this sentence.</p>   <p>this sentence.</p> </div>

how can scrape non-image html, or this:

<div class="article>   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p> </div>

i've tried:

article = response.xpath("//div[@class='article'][not(img)]").extract()

...but still includes images.

xpath selection, not transformation or rearrangement.

you can select div elements have no img children:

//div[@class='article' , not(img)]

or have no img descendents:

//div[@class='article' , not(.//img)]

or, can select contents of div elements p:

//div[@class='article']/p

or not img:

//div[@class='article']/*[not(self::img)]

but cannot select requested html,

<div class="article">   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p>   <p>this sentence.</p> </div>

because rearrangement, not selection, of markup exists in input document.

Search This Blog

Arrya Code

xml - How can I filter images out of HTML Scrapy with XPath? -

Comments

Post a Comment

Popular posts from this blog

ios - Memory not freeing up after popping viewcontroller using ARC -

Java JSoup error fetching URL -

webstorm - PhpStorm file cache conflict with TypeScript compiler -