问题
Hello I am trying to scrape some data from a website that has data in its 'dl' tag here is how the website structure looks
<div class="ecord-overview col-md-5">
<h2><span itemprop="name">Donald Duck</span></h2>
dl class="row">
</dd>
<dt class="col-md-4">Email</dt>
<dd class="col-md-8">myemail.com</dd>
</dl>
<div class="ecord-overview col-md-5">
<h2><span itemprop="name">Mickey mouse</span></h2>
dl class="row">
</dd>
<dt class="col-md-4">Email</dt>
<dd class="col-md-8">youremail.com</dd>
</dl>
... data goes on but value differs
To scrape this i am using selenium:
my code for scraping
for element in driver.find_elements_by_class_name('ThatsThem-record-overview'): # here im scraping name
#print(Style.RESET_ALL)
print(Fore.RED + element.text + Style.RESET_ALL)
#print(Style.RESET_ALL)
time.sleep(1)
dl= driver.find_element_by_tag_name('dl') # scraping data under dl tag
print(dl.text)
print('-----------------------')# seperator
So what is happening that whenever i execute the program it prints the dl stuff same for every name and data like this
donald duck
Email
myemail.com
-------------
mickey mouse
Email
myemail.com
I have already tried putting dl
in for loop the same way i am doing to print name but it prints other things as well that i don't want
what can i do?
回答1:
driver.find_element_by_tag_name('dl')
will always return the first matching element. You need to use element
to locate the <dl>
s
for element in driver.find_elements_by_class_name('ThatsThem-record-overview'):
dl = element.find_element_by_tag_name('dl') # scraping data under dl tag
print(dl.text)
Or just locate those elements directly
for element in driver.find_elements_by_css_selector('.ThatsThem-record-overview dl'):
print(element.text)
回答2:
Seems you were close. Using the class record-overview
should have fetched you all the required data. However it would be better to target the individual name and email by traversing to the child tags. Additionally inducing WebDriverWait will optimize your program performance.
So, ideally you need to induce WebDriverWait for the visibility_of_all_elements_located()
and you can use either of the following Locator Strategies:
Using
CSS_SELECTOR
:names[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.record-overview>h2>span")))] emails[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.record-overview dl.row dd")))] for name, email in zip(names, emails): print("{} Email is {}".format(name, email))
Using
XPATH
:names[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'record-overview')]/h2/span")))] emails[] = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'record-overview')]//dl[@class='row']//dd")))] for name, email in zip(names, emails): print("{} Email is {}".format(name, email))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
来源:https://stackoverflow.com/questions/59751428/selenium-printing-same-information-repeatedly