Web scrapping using selenium and beautifulsoup.. trouble in parsing and selecting button

﹥>﹥吖頭↗ 提交于 2019-12-11 06:20:53

问题


I am trying to web srape the following website "url='https://angel.co/life-sciences' ". The website contains more than 8000 data. From this page I need the information like company name and link, joined date and followers. Before that I need to sort the followers column by clicking the button. then load more information by clicking more hidden button. The page is clickable (more hidden) content at the max 20 times, after that it doesn't load more information. But I can take only top follower information by sorting it. Here I have implemented the click() event but it's showing error.

Unable to locate element: {"method":"xpath","selector":"//div[@class="column followers sortable sortable"]"} #before edit this was my problem, using wrong class name

So do I need to give more sleep time here?(tried giving that but same error)

I need to parse all the above information then visit individual link of those website to scrape content div of that html page only.

Please suggest me a way to do it

Here is my current code, I have not added html parsing part using beautifulsoup.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
from selenium import webdriver 
from bs4 import BeautifulSoup
#import urlib2
driver = webdriver.Chrome()
url='https://angel.co/life-sciences'
driver.get(url)
sleep(10)

driver.find_element_by_xpath('//div[@class="column followers sortable"]').click()#edited
sleep(5)
for i in range(2):
    driver.find_element_by_xpath('//div[@class="more hidden"]').click()
    sleep(8)

sleep(8)
element = driver.find_element_by_id("root").get_attribute('innerHTML')
#driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'more hidden')))
'''
results = html.find_elements_by_xpath('//div[@class="name"]')
 # wait for the page to load

for result in results:
    startup = result.find_elements_by_xpath('.//a')
    link = startup.get_attribute('href')
    print(link)
'''
page_source = driver.page_source

html = BeautifulSoup(element, 'html.parser')
#for link in html.findAll('a', {'class': 'startup-link'}):
#       print link

divs = html.find_all("div", class_=" dts27 frw44 _a _jm")

The above code was working and was giving me html source before I have added the Followers click event.

My final goal is to import all these five information like Name of the company, Its link, Joined date, No of Followers and the company description (which to be obtained after visiting their individual links) into a CSV or xls file.

Help and comments are apprecieted. This is my first python work and selenium, so little confused and need guidance.

Thanks :-)


回答1:


The click method is intended to emulate a mouse click; it's for use on elements that can be clicked, such as buttons, drop-down lists, check boxes, etc. You have applied this method to a div element which is not clickable. Elements like div, span, frame and so on are used to organise HTML and provide for decoration of fonts, etc.

To make this code work you will need to identify the elements in the page that are actually clickable.




回答2:


Oops my typing mistake or some silly mistake here, I was using the div class name wrong it is "column followers sortable" instead I was using "column followers sortable selected". :-( Now the above works pretty good.. but can anyone guide me with beautifulsoup HTML parsing part?



来源:https://stackoverflow.com/questions/46752309/web-scrapping-using-selenium-and-beautifulsoup-trouble-in-parsing-and-selectin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!