scrapy: Remove elements from an xpath selector

问题

I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.

Here's the gist.

<div id="easy-id">
  <stuff I don't want>
  text I don't want
  <div id="another-easy-id" more stuff I don't want>

  text I want
  <stuff I want>
  ...
  <more stuff I want>
  text I want
  ...

  <div id="one-more-easy-id" more stuff I *don't* want>
  <more stuff I *don't* want>

NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"

Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.

Any suggestions?

回答1:

I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.

Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.

In that case you might want //div[@id = 'another-easy-id']/following:node() [not(preceding::div[@id = 'one-more-easy-id']) and not(@id = 'one-more-easy-id')]

If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.

EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)

//div[@id='easy-id']/div[@id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[@id='another-easy-id']]

来源：https://stackoverflow.com/questions/12179821/scrapy-remove-elements-from-an-xpath-selector

标签

xpath

lxml

scrapy