Data scraping multiple page clicks loops

回眸只為那壹抹淺笑 提交于 2019-12-08 07:13:00

问题


Trying to figure out a way to use one mechanise to scrape and add to arrays all of the data we want from the UCAS website. Currently we're struggling with coding in the link clicks for mechanise. Wondering if anyone can help, there are three successive link clicks amidst loops to progress through all search result pages. The first link to display all courses for university is within div class morecourseslink

the second link to display course names, duration and qual is in div class coursenamearea

the third link is in div coursedetailsshowable and the a id is coursedetailtab_entryreqs

currently we are scraping uninames with the below:

class PagesController < ApplicationController
  def home


require 'mechanize'
mechanize = Mechanize.new

@uninames_array = []

   page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')


page.search('li.result h3').each do |h3|
  name = h3.text
  @uninames_array.push(name)
end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

  page.search('li.result h3').each do |h3|
    name = h3.text
    @uninames_array.push(name)
  end
end

puts @uninames_array.to_s
  end
end

And course names duration and qualification from the below:

require 'mechanize'


mechanize = Mechanize.new
@duration_array = []
@qual_array = []
@courses_array = []

page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41')


page.search('div.courseinfoduration').each do |x|
puts x.text.strip
page.search('div.courseinfooutcome').each do |y|
puts y.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfoduration').each do |x|
    name = x
    @duration_array.push(name)
    puts x.text.strip
  end
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfooutcome').each do |y|
    name = y
    @qual_array.push(name)
    puts y.text.strip
  end
end
page.search('div.coursenamearea h4').each do |h4|
puts h4.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.coursenamearea h4').each do |h4|
    name = h4.text
    @courses_array.push(name)
    puts h4.text.strip
  end
end
end

回答1:


If you want to do this with one Mechanize instance why not just string them all together and store the pages you need to jump to and from in variables?

If all your code works then you can simply string them together into one method call:

def home


  require 'mechanize'
  mechanize = Mechanize.new

  @uninames_array = []

  page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')


  page.search('li.result h3').each do |h3|
    name = h3.text
    @uninames_array.push(name)
  end

  while next_page_link = page.at('.pager a[text()=">"]')
    page = mechanize.get(next_page_link['href'])

    page.search('li.result h3').each do |h3|
      name = h3.text
      @uninames_array.push(name)
    end
  end


@duration_array = []
@qual_array = []
@courses_array = []

page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41')


page.search('div.courseinfoduration').each do |x|
puts x.text.strip
page.search('div.courseinfooutcome').each do |y|
puts y.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfoduration').each do |x|
    name = x
    @duration_array.push(name)
    puts x.text.strip
  end
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfooutcome').each do |y|
    name = y
    @qual_array.push(name)
    puts y.text.strip
  end
end
page.search('div.coursenamearea h4').each do |h4|
puts h4.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.coursenamearea h4').each do |h4|
    name = h4.text
    @courses_array.push(name)
    puts h4.text.strip
  end
end


来源:https://stackoverflow.com/questions/37681359/data-scraping-multiple-page-clicks-loops

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!