问题
My issue is that I need all the data within the grid containing subdomains from the website https://applipedia.paloaltonetworks.com - (data containing NAME , CATEGORY, SUBCATEGORY, RISK, TECHNOLOGY). What I require is [Example: In line number 5: 2ch has 2 subdomains |_2ch-base and 2ch-posting. Like this I only want to get the list of all apps having subdomains]
Right not whenever I have tried adding anything in the line:
table =wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'tbody#bodyScrollingTable tr')))
I am getting a timeout error.
Below is the script I have as of now which fetches all the data from the grid but I need only the apps and it's containing subdomains.[Example 2ch, 2ch-base, 2ch-posting]. I have found out a pattern through inspect element which is all apps that doesn't have subdomains have ( ) or we can go by the () field which is common for all apps having subdomains. Any help on solving this problem will be much appreciated.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path = r'/Users/am/Downloads/chromedriver')
driver.maximize_window()
driver.get("https://applipedia.paloaltonetworks.com/")
wait = WebDriverWait(driver,30)
table =wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'tbody#bodyScrollingTable tr')))
for tab in table:
print(tab.text)
回答1:
As per the url https://applipedia.paloaltonetworks.com/ to get the list of all apps having subdomains you need to induce WebDriverWait for the desired elements to be visible and you can use the following solution:
Code Block:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = Options() options.add_argument("start-maximized") options.add_argument("disable-infobars") options.add_argument("--disable-extensions") options.add_argument("--disable-gpu") driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\ChromeDriver\chromedriver_win32\chromedriver.exe') driver.get('https://applipedia.paloaltonetworks.com/') elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='btmTable' and @id='dataTable']//tbody[@id='bodyScrollingTable']//tr[not(@ottawagroup='0') and not(@ottawagroup='2')]/td/a"))) for element in elements: print(element.get_attribute("innerHTML"))Console Output:
DevTools listening on ws://127.0.0.1:12927/devtools/browser/d4a5d576-a4b0-4a3d-959b-9d37aff36fc6 2ch 51.com adobe-connect adobe-connectnow adobe-creative-cloud aim aim-express ali-wangwang amazon-cloud-drive amazon-music ameba-now assembla autodesk360 avaya-webalive bacnet baidu-hi bebo bitbucket boxnet buddybuddy chinaren cisco-spark cloudapp cloudforge cloudinary concur confluence convo cyph daum dcinside diameter dnp3 dochub docstoc docusign draw.io dropbox egnyte evernote facebook fetion filestack flickr flixwagon fuze-meeting gatherplace genesys git github gitlab glassdoor globalmeet gmail google-calendar google-cloud-storage google-docs google-hangouts google-plus google-spaces google-talk google-translate google-video gotomypc gotowebinar gtp hadoop hightail hipchat hootsuite huddle hulu hyves iccp icloud iec-60870-5-104 imeet imgur instagram instan-t ip-messenger ipsec irc issuu itunes jira join-me jumpshare kaixin kaixin001 kakaotalk laiwang landesk linkedin live-mesh lotus-notes lotuslive lucidpress mail.ru mail.ru-agent maytech meebo meetup mega mendeley mercurial mixi modbus ms-ds-smb ms-lync ms-office365 ms-onedrive msn myspace nateon-im netease-webdisk netflix ning noteworthy now-tv odnoklassniki onehub owncloud paltalk pastebin pcanywhere pinterest pivotaltracker powow prezi proofhub qik qliksense-cloud qq quip quora rally-software readytalk reddit rediffbol renren rtp salesforce sap-jam screencast scribd second-life secure-data-space sendthisfile service-now sharefile sharepoint sharevault showmax siemens-s7 signiant sina-uc sina-weibo skydrive slack slideshare smartsheet snmp softros-messenger solarwinds soundcloud sourceforge spark-im ss7-map stocktwits storify subversion surveymonkey syncplicity tableau teamdrive teamup-calendar teamviewer thwapr torch-browser trello tumblr twitter uc-yun viber vimeo vine virustotal vkontakte vnc watchdox webex wechat weiyun whatsapp windows-azure windows-defender-atp workday yahoo-im yammer youku yousendit youtube yunpan360 yy-voice zalo zendesk zenefits zettahost
回答2:
With code below you can get list of domains with subdomains fast and clear:
WebDriverWait(driver, 20).until(EC. visibility_of_element_located((By.CSS_SELECTOR, "[ottawagroup='1'] a")))
domains = driver.execute_script("return [...document.querySelectorAll(\"[ottawagroup='1'] a\")].map(e=>e.textContent.trim())")
来源:https://stackoverflow.com/questions/52337888/scraping-javascript-data-within-a-grid-of-a-webpage-using-selenium-and-python