问题
I'm trying to parse http://www.pro-medic.ru/index.php?ht=246&perpage=all with Nokogiri, but unfortunately I can't get all items from the page.
My simple test code is:
require 'open-uri'
require 'nokogiri'
html = Nokogiri::HTML open('http://www.pro-medic.ru/index.php?ht=246&perpage=all')
p html.css('ul.products-grid-compact li .goods_container').count
It returns only 83 items but the real count is about 186.
I thought that the problem could be in open
, but it seems that function reads the HTML page correctly.
Has anybody faced the same problem?
回答1:
The file seems to exceed Nokogiri's parser limits. You can relax the limits by adding the HUGE flag:
require 'open-uri'
require 'nokogiri'
url = 'http://www.pro-medic.ru/index.php?ht=246&perpage=all'
html = Nokogiri::HTML(open(url)) do |config|
config.options |= Nokogiri::XML::ParseOptions::HUGE
end
html.css('ul.products-grid-compact li .goods_container').count
#=> 186
Note that |=
is a bitwise OR assignment operator, don't confuse it with the logical operator ||=
According to Parse Options, you can also set this flag via config.huge
来源:https://stackoverflow.com/questions/37542491/parsing-large-html-files-with-nokogiri