xpath: string manipulation

人走茶凉 提交于 2019-12-30 07:28:11

问题


So in my scrapy project I was able to isolate some particular fields, one of the field return something like:

[Rank Info] on 2013-06-27 14:26 Read 174 Times

which was selected by expression:

(//td[@class="show_content"]/text())[4]

I usually do post-processing to extract the datetime information, i.e., 2013-06-27 14:26 Now since I've learned a little more on the xpath substring manipulation, I am wondering if it is even possible to extract that piece of information in the first place, i.e., in the xpath expression itself?

Thanks,


回答1:


Scrapy uses XPath 1.0 which has very limited string manipulation capabilities, especially does not support regular expressions. There are two ways to cut down a string, I demonstrate both with an example to strip down to the substring you're looking for.

By Character Index

This is fine if the character indices do not change (but the contents could).

substring($string, $start, $len)
substring(//td[@class="show_content"]/text(), 16, 16)

By pre-/suffix Search

This is fine if the index can change, but the contents immediatly before and after the string stay the same:

substring-before($string, $needle)
substring-after($string, $needle)
substring-before(
  substring-after(//td[@class="show_content"]/text(), 'on '), ' Read')



回答2:


In all of the other answers so far, not only is the /text() not helpful, it is potentially (or even likely) a problem. For readers of the archive, they should be aware of the problems using /text() in addresses for arguments of a function. In my professional work, there are very (very!) few requirements for addressing text() directly.

I'm speaking of these expressions from the other posts:

substring-after(//td[@class='show_content']/text(), 'on ')

and

substring(//td[@class='show_content']/text(), 16, 10)

Let's put aside the issue that "//" is used when it shouldn't be used. In XSLT 1.0 only the first <td> would be considered and in XSLT 2.0 a run-time error would be triggered by more than a singleton for the first argument.

Consider this modified XML if it were the input:

   <td>[<emphasis>Rank Info</emphasis>] on 2013-06-27 14:26 Read 174 Times</td>

... where the " on " is on the second text node (the first text node has "[" in it). In XSLT 1.0, both expressions return the empty string. In XSLT 2.0 both expressions trigger run-time errors.

Consider this modified XML if it were the input:

   <td>[Rank Info]<emphasis> on </emphasis>2013-06-27 14:26 Read 174 Times</td>

In both cases the text() children of <td> do not include the string "on" because that is a descendant text node, not a child text node.

In both expressions, then, the following would work for both of the modified inputs because one is then dealing with the value of the element, not the value of the text nodes. The value of the element is the concatenation of all descendent text nodes.

So:

substring-after(td[@class='show_content'], 'on ')

and

substring(td[@class='show_content'], 16, 10)

would act on the entire string value found in the element. But even the above is going to have cardinality problems if there is more than one <td> child so the expression will have to be rewritten anyway.

My point is, the use of text() caught my eye and I tell my students if they think they need to use text() in an XPath expression, they need to think again because in most cases they do not.




回答3:


this should work

substring(//td[@class="show_content"]/text(), 16, 10)

But I agree with Blender, in-code postprocessing is better for this purpose.



来源:https://stackoverflow.com/questions/17374219/xpath-string-manipulation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!