Nokogiri Xpath to retrieve text after <BR> within <TD> and <SPAN>

血红的双手。 提交于 2019-12-04 20:29:34

Here's a concise way:

name, nick, email, *addr = doc.search('//td/text()[preceding-sibling::br]')

puts name, nick, email, "--", addr

The XPath does exactly what you stated: it takes all text nodes following a br. The address is slurped into one variable, but you can get the components separately if you want.

Output:

FirstName LastName
NickName
First.Last@SomeCompany.com
--
FirstName LastName
Attn: FirstName
1234 Main St.
TheCity, TheState, 12345
United States

<br> are a bit of a unique problem when dealing with HTML. They don't really get used for anything but formatting the content in the page, i.e., breaking lines like a new-line would in a *nix text file. So, my tactic when dealing with them while extracting text, is to transform them into new-lines.

Parse the content into a Nokogiri::HTML document:

doc = Nokogiri::HTML(html_doc_to_parse)

Convert the <br> to new-lines:

doc.search('br').each { |br| br.replace("\n") }

Then, find the cells you want:

doc.search('//td').map{ |td| td.content } 

which will return something like:

doc.search('//td').map(&:content)
=> ["\n  Buyer\nFirstName LastName\nNickName\nFirst.Last@SomeCompany.com",
 "\n  Shipping address - confirmed\nFirstName LastName\nAttn: FirstName\n1234 Main St.\nTheCity, TheState, 12345\nUnited States\n"]

which looks like this when printed:

puts doc.search('//td').map(&:content)

  Buyer
FirstName LastName
NickName
First.Last@SomeCompany.com

  Shipping address - confirmed
FirstName LastName
Attn: FirstName
1234 Main St.
TheCity, TheState, 12345
United States

From there it's a case of determining the correct array elements that you want, and then splitting on the new-lines i.e., String.split("\n").

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!