Nokogiri Scraping Misses HTML

微笑、不失礼 提交于 2020-01-25 10:25:07

问题


Nokogiri isn't grabbing anything beneath the iframe tag.

doc.search("iframe") returns only the iframe tag. doc.search("body.content-frame") returns empty. doc.errors returns empty also. Why isn't Nokogiri registering the HTML beneath the iframe? How can I grab it?

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">

    <head></head>
    <body onunload="clearMyTimeInterval()">
       <iframe id="content-frame" frameborder="0" src="/sportsbook/betting-lines/baseball/2014-08-21/?range=day" onload="javascript:checkLoadedFrame(this);" style="background-color: rgb(34, 34, 34); height: 1875px;" name="content-frame" scrolling="no" border="0">
           #document
           <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
           <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
            <head></head>
            <body class="content-frame">
             #ETC.......

回答1:


That's because the contents of the iframe are not part of the page. In fact, they are in a completely different location (note the src attribute of the iframe). You'll have to fetch that content separately, which is how a browser would do it.




回答2:


Here is code that handles it:

page = Mechanize.new.get "http://page_u_need"
page.iframe_with(id: 'beatles').content


来源:https://stackoverflow.com/questions/25436818/nokogiri-scraping-misses-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!