I have a snippet of code im trying to parse with nokogiri that looks like this:
-
If your data really is that regular and you don't need the attributes from the elements, then you could parse the text form of each table cell without having to worry about the
elements at all.
Given some HTML like this in html:
Link 1 (info1), Blah 1,
Link 2 (info1), Blah 1,
Link 3 (info2), Blah 1 Foo 2,
Link 4 (info1), Blah 2,
Link 5 (info1), Blah 2,
Link 6 (info2), Blah 2 Foo 2,
Link 7 (info1), Blah 3,
Link 8 (info1), Blah 3,
Link 9 (info2), Blah 3 Foo 2,
Link A (info1), Blah 4,
Link B (info1), Blah 4,
Link C (info2), Blah 4 Foo 2,
You could do this:
chunks = doc.search('.j').map { |td| td.text.strip.scan(/[^,]+,[^,]+/) }
and have this:
[
[ "Link 1 (info1), Blah 1", "Link 2 (info1), Blah 1", "Link 3 (info2), Blah 1 Foo 2" ],
[ "Link 4 (info1), Blah 2", "Link 5 (info1), Blah 2", "Link 6 (info2), Blah 2 Foo 2" ],
[ "Link 7 (info1), Blah 3", "Link 8 (info1), Blah 3", "Link 9 (info2), Blah 3 Foo 2" ],
[ "Link A (info1), Blah 4", "Link B (info1), Blah 4", "Link C (info2), Blah 4 Foo 2" ]
]
in chunks. Then you could convert that to whatever hash form you needed.
- 热议问题