How to scrape images from eBay and Amazon using XPath in Nokogiri from JSON

懵懂的女人 提交于 2019-12-05 07:29:06
shota

This is alternative way to solve what you want; you can use Capybara and Poltergeist.

I assume you don't have to dive into JavaScript with this solution.

If you scrape, I recommend that you consider Capybara with Poltergeist, you can find many sources to reference.

This is the code I tried:

require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'

Capybara.register_driver :poltergeist_debug do |app|
  Capybara::Poltergeist::Driver.new(app, inspector: true)
end

Capybara.javascript_driver = :poltergeist_debug
Capybara.current_driver = :poltergeist_debug 

# Amazon Case
visit_site('https://www.amazon.com/dp/B00T46V758/?tag=stackoverfl08-20')
doc_amazon = Nokogiri::HTML.parse(page.html)
doc_amazon.xpath("//img/@src").each do |src|
  p src.value  
end 

#ebay case
visit_site('https://www.ebay.com/itm/Summer-Women-Casual-Chiffon-Loose-Tops-Batwing-Short-Sleeve-Loose-T-Shirt-Blouse-/351411949784?pt=LH_DefaultDomain_0&var=&hash=item51d1c8d0d8')    
doc_ebay = Nokogiri::HTML.parse(page.html)
doc_ebay.xpath("//img/@src").each do |src|
  p src.value  
end 

If you want to dig into it:

doc.xpath("//div[@id='imgTagWrapperId']/img").attribute('src').value
# => "https://images-na.ssl-images-amazon.com/images/I/81%2BTW8762BL._UX453_.jpg"

 doc.xpath("//div[@id='mainImgHldr']/img[@id='icImg']").attribute('src').value
# => "https://i.ebayimg.com/images/g/dtAAAOSwpdpVZuU~/s-l300.jpg"

Are you trying to generate a database of competitors items with pricing, etc.?
Are you trying to grab entire categories or individual sellers? The reason why I ask is you can get an RSS feed of items each seller lists if they have turned that feature on. This way, you do not have to waste time scraping a page when you can get the central data from an RSS feed.

When parsing webpages, depending upon where you are in the webpage (you mentioned carousel) the indices you are encountering are from the stash of thumbnails representing the larger images.
I recommend looking at the eBay API and the Amazon API and finding the RSS feeds for the sellers first.

As far as getting past any Javascript issues, the webpage loads rotating slideshows and carousels dynamically, so you will have to use Mechanize (as RAJ suggested above) or Beautiful Soup or Selenium to get fully rendered web pages in which all images are in a scrapable state.

Feel free to post your source if there is anything else I can help with.

Sorry, as I am posting the answer from mobile phone, I can't write full code right away, however, I can give you a way. You should use Mechanize with selenium-webdriver & watir instead of only Nokogiri.

Using Mechanize, you will be able to handle elements coming from JavaScript. You can mock the actual moves on browser i.e. you can code for clicking on links/buttons, you can wait for image load and then can scrape it. And all this can be done using Mechanize very easily.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!