Selenium code is not able to scrape ofashion.com.cn

心不动则不痛 提交于 2020-04-11 18:12:06

问题


I was building a web scraper by using python selenium. The script scraped sites like amazon, stack overflow and flipcart but wasn't able to scrape ofashion. It is always returning me a blank .csv file.

Here is my code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
             'Chrome/80.0.3987.132 Safari/537.36'

driver_exe = 'chromedriver'
options = Options()
#options.add_argument("--headless")
options.add_argument(f'user-agent={user_agent}')
options.add_argument("--disable-web-security")
options.add_argument("--allow-running-insecure-content")
options.add_argument("--allow-cross-origin-auth-prompt")

driver = webdriver.Chrome(executable_path=r"C:\Users\intel\Downloads\Setups\chromedriver.exe", options=options)
driver.get("https://www.ofashion.com.cn/goods/10001?t=15777838840003")
class_Name = "." + "ellipsis-single ware-brand"
x = driver.find_elements_by_css_selector(class_Name.replace(' ','.'))
web_content_list = []

for i in x:
    web_content_dict = {}
    web_content_dict["Title"] = i.text
    web_content_list.append(web_content_dict)

df = pd.DataFrame(web_content_list)
df.to_csv(r'C:\Users\intel\Desktop\data_file.csv',
         index=False, mode='a', encoding='utf-8')

Any help would be appreciated!


回答1:


This is because the website is loaded through javascript. You see that loading sign (with the clothes hanger)? If you look at the top of the tab, you can see that the page isn't loading anymore. To wait for it to load completely, you can use Selenium's Wait Methods.

NOTE: Please put driver.close() at the end of your code to close the chromedriver window properly.




回答2:


You should check out BeautifulSoup, for me at least I've never enjoyed scraping with selenium. I'd encourage you check out requests or BeautifulSoup for webscraping.

from bs4 import BeautifulSoup
import requests

HEADERS = requests.utils.default_headers()
HEADERS.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

url = "https://www.ofashion.com.cn/goods/10001?t=15777838840003"
load = requests.get(url, headers=HEADERS)
page_content = load.content
soup = BeautifulSoup(page_content,'lxml')
print(soup)

I wrote a sample script for you as above, this scrapes the page for all its text, this is just something I quickly wrote up however it scrapes your page. I'd encourage you read more into BS4 and requests. Ill also link some previous web scraper projects I've made in case you want a reference. BS4 Docs = https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Request Docs = https://2.python-requests.org/en/v1.1.0/api Some Webscrapers I've Written = https://github.com/backslash/WebScrapers/



来源:https://stackoverflow.com/questions/60573846/selenium-code-is-not-able-to-scrape-ofashion-com-cn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!