scrape

Scrape yt formatted strings with beautiful soup

寵の児 提交于 2020-07-22 21:34:33
问题 I've tried to scrape yt-formatted strings with BeautifulSoup, but it always gives me an error. Here is my code: import requests import bs4 from bs4 import BeautifulSoup r = requests.get('https://www.youtube.com/channel/UCPyMcv4yIDfETZXoJms1XFA') soup = bs4.BeautifulSoup(r.text, "html.parser") def onoroff(): onoroff = soup.find('yt-formatted-string',{'id','subscriber-count'}).text return onoroff print("Subscribers: "+str(onoroff().strip())) This is the error I get AttributeError: 'NoneType'

Scrapy Body Text Only

六月ゝ 毕业季﹏ 提交于 2020-06-11 20:12:23
问题 I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the <body> tag. 回答1: Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body> ? (assuming it's nested in <html> ). It might be even simpler to use the //body selector: x.select("//body").extract() # extract body You can find more information

Scrapy Body Text Only

情到浓时终转凉″ 提交于 2020-06-11 20:11:40
问题 I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the <body> tag. 回答1: Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body> ? (assuming it's nested in <html> ). It might be even simpler to use the //body selector: x.select("//body").extract() # extract body You can find more information

Extract HTML Table Based on Specific Column Headers - Python

こ雲淡風輕ζ 提交于 2020-05-28 06:56:20
问题 I am trying to extract html tables from the following URL . For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table. Is there an easy way to extract these tables based on column names? Or maybe an easier way? Thanks! I am relatively new at

How can i scrape the google search result 2nd page only?

情到浓时终转凉″ 提交于 2020-03-05 03:23:18
问题 I'm having trouble on scraping the second page of the google search result, After scraping the result it will save the page on the file. It only scrape the first page. Here is my code: $url = 'http://www.google.com/search?q='.$in; $datenbank = "proxy_work.php"; $datei = fopen($datenbank,"w+"); $datenbank = "proxy_work.php"; fwrite($datei, $url); fwrite ($datei,"\r\n"); fclose($datei); // echo file_get_contents("proxy_work.php"); $html = file_get_html("proxy_work.php"); foreach($html->find('a'

How can i scrape the google search result 2nd page only?

旧街凉风 提交于 2020-03-05 03:23:08
问题 I'm having trouble on scraping the second page of the google search result, After scraping the result it will save the page on the file. It only scrape the first page. Here is my code: $url = 'http://www.google.com/search?q='.$in; $datenbank = "proxy_work.php"; $datei = fopen($datenbank,"w+"); $datenbank = "proxy_work.php"; fwrite($datei, $url); fwrite ($datei,"\r\n"); fclose($datei); // echo file_get_contents("proxy_work.php"); $html = file_get_html("proxy_work.php"); foreach($html->find('a'

PHP scrape an html page

ⅰ亾dé卋堺 提交于 2020-01-30 03:29:29
问题 So I am just trying to scrape an HTML page with PHP. I looked on Google for how to do it, and I uuse the file_get_contents() method. I wrote a little bit of code, but I am already getting an error that I cannot figure out: $page = file_get_contents( 'http://php.net/supported-versions.php' ); $doc = new DOMDocument( $page ); //print_r( $page ); foreach ( $doc->getElementsByTagName( 'table' ) as $node ) { print_r( $node ); } The first, commented out print_r statement DOES print the page, but

nodejs web scraper for password protected website

我们两清 提交于 2020-01-29 05:21:05
问题 I am trying to scrape a website using nodejs and it works perfectly on sites that do not require any authentication. But whenever I try to scrape a site with a form that requires username and password I only get the HTML from the authentication page (that is, if you would click 'view page source' on the authentication page it self, that is the HTML I get). I am able to get the desired HTML using curl curl -d "username=myuser&password=mypw&submit=Login" URL Here is my code... var express =

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

梦想与她 提交于 2020-01-14 05:57:09
问题 I want to get all the titles() in the website. http://www.shyan.gov.cn/zwhd/web/webindex.action Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape. For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0). My code is here

Python Scraper Unable to scrape img src

こ雲淡風輕ζ 提交于 2020-01-13 06:52:14
问题 I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src". SRC: from bs4 import BeautifulSoup import requests scraper = cfscrape.create_scraper() url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206" response = requests.get(url) soup2 = BeautifulSoup(response.text, 'html.parser') divImage = soup2.find('div',{"id": "divImage"}) for img in divImage.findAll('img'):