scrape | 易学教程

Scrape yt formatted strings with beautiful soup

阅读更多关于 Scrape yt formatted strings with beautiful soup

问题 I've tried to scrape yt-formatted strings with BeautifulSoup, but it always gives me an error. Here is my code: import requests import bs4 from bs4 import BeautifulSoup r = requests.get('https://www.youtube.com/channel/UCPyMcv4yIDfETZXoJms1XFA') soup = bs4.BeautifulSoup(r.text, "html.parser") def onoroff(): onoroff = soup.find('yt-formatted-string',{'id','subscriber-count'}).text return onoroff print("Subscribers: "+str(onoroff().strip())) This is the error I get AttributeError: 'NoneType'

Scrapy Body Text Only

阅读更多关于 Scrapy Body Text Only

问题 I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the <body> tag. 回答1: Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body> ? (assuming it's nested in <html> ). It might be even simpler to use the //body selector: x.select("//body").extract() # extract body You can find more information

Scrapy Body Text Only

阅读更多关于 Scrapy Body Text Only

Extract HTML Table Based on Specific Column Headers - Python

阅读更多关于 Extract HTML Table Based on Specific Column Headers - Python

问题 I am trying to extract html tables from the following URL . For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table. Is there an easy way to extract these tables based on column names? Or maybe an easier way? Thanks! I am relatively new at

How can i scrape the google search result 2nd page only?

阅读更多关于 How can i scrape the google search result 2nd page only?

问题 I'm having trouble on scraping the second page of the google search result, After scraping the result it will save the page on the file. It only scrape the first page. Here is my code: $url = 'http://www.google.com/search?q='.$in; $datenbank = "proxy_work.php"; $datei = fopen($datenbank,"w+"); $datenbank = "proxy_work.php"; fwrite($datei, $url); fwrite ($datei,"\r\n"); fclose($datei); // echo file_get_contents("proxy_work.php"); $html = file_get_html("proxy_work.php"); foreach($html->find('a'

How can i scrape the google search result 2nd page only?

阅读更多关于 How can i scrape the google search result 2nd page only?

PHP scrape an html page

阅读更多关于 PHP scrape an html page

问题 So I am just trying to scrape an HTML page with PHP. I looked on Google for how to do it, and I uuse the file_get_contents() method. I wrote a little bit of code, but I am already getting an error that I cannot figure out: $page = file_get_contents( 'http://php.net/supported-versions.php' ); $doc = new DOMDocument( $page ); //print_r( $page ); foreach ( $doc->getElementsByTagName( 'table' ) as $node ) { print_r( $node ); } The first, commented out print_r statement DOES print the page, but

nodejs web scraper for password protected website

阅读更多关于 nodejs web scraper for password protected website

问题 I am trying to scrape a website using nodejs and it works perfectly on sites that do not require any authentication. But whenever I try to scrape a site with a form that requires username and password I only get the HTML from the authentication page (that is, if you would click 'view page source' on the authentication page it self, that is the HTML I get). I am able to get the desired HTML using curl curl -d "username=myuser&password=mypw&submit=Login" URL Here is my code... var express =

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

阅读更多关于 Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

问题 I want to get all the titles() in the website. http://www.shyan.gov.cn/zwhd/web/webindex.action Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape. For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0). My code is here

Python Scraper Unable to scrape img src

阅读更多关于 Python Scraper Unable to scrape img src

问题 I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src". SRC: from bs4 import BeautifulSoup import requests scraper = cfscrape.create_scraper() url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206" response = requests.get(url) soup2 = BeautifulSoup(response.text, 'html.parser') divImage = soup2.find('div',{"id": "divImage"}) for img in divImage.findAll('img'):