Python爬虫_selenium + Firefox的使用

驱动下载：http://chromedriver.storage.googleapis.com/index.html

https://github.com/mozilla/geckodriver/releases

一、selenium启动Firefox浏览器。


 1 from selenium import webdriver
 2 # from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
 3 
 4 
 5 user_agent = 'Mozilla/5.0 (Linux; Android 7.0; BND-AL10 Build/HONORBND-AL10; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044304 Mobile Safari/537.36 MicroMessenger/6.7.3.1340(0x26070331) NetType/4G Language/zh_CN Process/tools'
 6 
 7 proxy = '127.0.0.1:5000'
 8 proxy = proxy.split(':')
 9 
10 
11 # selenium headless 启动无头模式
12 options = webdriver.FirefoxOptions()
13 options.add_argument('-headless')    
14 
15 
16 # 第一步：创建一个FirefoxProfile实例
17 profile = webdriver.FirefoxProfile()
18 # 第二步：开启“手动设置代理”
19 profile.set_preference('network.proxy.type',1)
20 # 第三步：设置代理IP
21 profile.set_preference('network.proxy.http', proxy[0])
22 # 第四步：设置代理端口，注意端口是int类型，不是字符串
23 profile.set_preference('network.proxy.http_port', int(proxy[1]))
24 
25 # 第五步：设置htpps协议也使用该代理
26 # profile.set_preference('network.proxy.ssl', proxy[0])
27 # profile.set_preference('network.proxy.ssl_port', proxy[1])
28 
29 # 第六步：所有协议共用一种 ip 及端口，如果单独配置，不必设置该项，因为其默认为 False
30 profile.set_preference("network.proxy.share_proxy_settings", True)
31 # 第七步：设置请求header
32 profile.set_preference("general.useragent.override", user_agent)
33 
34 # 默认本地地址（localhost）不使用代理，如果有些域名在访问时不想使用代理可以使用类似下面的参数设置
35 # profile.set_preference("network.proxy.no_proxies_on", "localhost")
36 
37 
38 # 以代理方式启动 firefox
39 firefox = webdriver.Firefox(profile,executable_path='/opt/geckodriver',options=options)
40
42 firefox.get('http://www.baidu.com')
43 
44 
45 # 退出
46 firefox.quit()

二、设置加载超时处理。

1. pageLoadTimeout；

pageLoadTimeout方法用来设置页面完全加载的超时时间，完全加载即页面全部渲染，异步同步脚本都执行完成。没有设置超时时间默认是等待页面全部加载完成才会进入下一步骤，设置超时时间3s是加载到3s时中断操作抛出异常；

driver.manage().timeouts().setScriptTimeout(3,TimeUnit.SECONDS);

2. setScriptTimeout
设置异步脚本到超时时间，用法同pageLoadTimeout一样，异步脚本也就是有async属性的异步脚本，可以在页面解析的同时执行；
（我一般会使用这个解决读取超时问题，用pageLoadTimeout不知道为什么不起作用）

3. implicitlyWait
识别对象的超时时间，如果在设置的时间内没有找到就抛出一个NoSuchElement异常，用法参数和上述一样；

4. driver.set_page_load_timeout

使用selenium爬取人大经济论坛，登陆的时候，页面一直不加载完成，一直在刷新，应该是强制登陆页面一直刷新。
用webdrive的get方法，只能在页面加载完成后才能后续操作，所以，设置了强制加载时间。
这个设置会抛出一个timeout错误，使用pass处理后，继续后边的操作。
后面的操作里又要在webdriver的返回值里查找，
查找的时候，又抛出异常

driver = webdriver.Chrome()
driver.set_page_load_timeout(10)

try:
    driver.get('http://bbs.pinggu.org/plugin.php?id=dsu_paulsign:sign')
except:    # 异常处理
    pass                # 当页面加载时间超过设定时间，执行后续动作