How to scrape data using Ruby which is generated by a Javascript function?

人盡茶涼 提交于 2019-12-11 16:31:57

问题


I am trying to scrape the data url link from the latest date (first row of the table) from this page. But it seems like the content of the table is generated by a Javascript function. I tried using Nokogiri to get it but in vain as nokogiri can not scrape Javascript. Then, I tried to get the script part only using Nokogiri by using:

url = "http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data"
doc = Nokogiri::HTML(open(url))
js = doc.css("script").text
puts js

In the output I found the table that I wanted with class name sgxTableGrid. But the problem is there is no clue about the data url link here in the Javascript function and everything is generating dynamically. So, I was wondering if someone knows any better way of approaching this problem.


回答1:


Looking through the HTML for that page, the table is generated by JSON received as the result of a JavaScript request.

You can figure out what's going on by tracing backwards through the source code of the page. Here's some of what you'll need if you want to retrieve the JSON outside of their JavaScript, however there'll still be work needed to actually do something with it:

  1. Starting with this code:

    require 'open-uri'
    require 'nokogiri'
    
    doc = Nokogiri::HTML(open('http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data'))
    scripts = doc.css('script').map(&:text)
    
    puts scripts.select{ |s| s['sgxTableGrid'] }
    

    Look at the text output in an editor. Search for sgxTableGrid. You'll see a line like:

    var tableHeader =  "<table width='100%' class='sgxTableGrid'>"
    

    Look down a little farther and you'll see:

    var totalRows = data.items.length - 1;
    

    data comes from the parameter to the function being called, so that's where we start.

  2. Get a unique part of the containing function's name loadGridns_ and search for it. Each time you find it, look for the parameter data, then look to see where data is defined. If it's passed into that method, then search to see what calls it. Repeat that process until you find that the variable isn't passed into the function, and at that point you'll know you're at the method that creates it.

  3. I found myself in a function that starts with loadGridDatans, where it's part of a block that does a xhrPost call to retrieve a URL. That URL is the target you're after, so grab the name of the containing function, and loop through the calls where the URL is passed in, like you did in the above step.

  4. That search ended up on a line that looks like:

    var url = viewByDailyns_7_2AA4H0C090FIE0I1OH2JFH20K1_...
    
  5. At that point you can start reconstructing the URL you need. Open a JavaScript debugger, like Firebug, and put a break point on that line. Reload the page and JavaScript should stop executing at that line. Single-step, or set breakpoints, and watch the url variable be created until it's in its final form. At that point you have something you can use in OpenURI, which should retrieve the JSON you want.

Notice, their function names might be generated dynamically; I didn't check to see, so trying to use the full name of the function might fail.

They might also be serializing the datetime stamp or using a session-key that's serialized to make the function names unique/more opaque, doing it for a number of reasons.

Even though it's a pain to take this stuff apart, it's also a good lesson in how dynamic pages work.



来源:https://stackoverflow.com/questions/19713224/how-to-scrape-data-using-ruby-which-is-generated-by-a-javascript-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!