Parsing large HTML files with Nokogiri

时光毁灭记忆、已成空白 提交于 2019-12-11 00:42:44

问题


I'm trying to parse http://www.pro-medic.ru/index.php?ht=246&perpage=all with Nokogiri, but unfortunately I can't get all items from the page.

My simple test code is:

require 'open-uri'
require 'nokogiri'

html = Nokogiri::HTML open('http://www.pro-medic.ru/index.php?ht=246&perpage=all')
p html.css('ul.products-grid-compact li .goods_container').count

It returns only 83 items but the real count is about 186.

I thought that the problem could be in open, but it seems that function reads the HTML page correctly.

Has anybody faced the same problem?


回答1:


The file seems to exceed Nokogiri's parser limits. You can relax the limits by adding the HUGE flag:

require 'open-uri'
require 'nokogiri'

url = 'http://www.pro-medic.ru/index.php?ht=246&perpage=all'
html = Nokogiri::HTML(open(url)) do |config|
  config.options |= Nokogiri::XML::ParseOptions::HUGE
end
html.css('ul.products-grid-compact li .goods_container').count
#=> 186

Note that |= is a bitwise OR assignment operator, don't confuse it with the logical operator ||=

According to Parse Options, you can also set this flag via config.huge



来源:https://stackoverflow.com/questions/37542491/parsing-large-html-files-with-nokogiri

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!