web-crawler

I get no data when download images

笑着哭i 提交于 2021-02-11 14:53:27
问题 so iam trying to download images by this code i successfully download the images but they without any data and corrupted like the images have 0 bytes function get_chapter_images(){ include('simple_html_dom.php'); $url = 'http://localhost/wordpress/manga/manga-name-ain/chapter-4/'; $html = file_get_html($url); $images_url = array(); foreach($html->find('.page-break img') as $e){ $image_links = $e->src; array_push( $images_url, $image_links); } return $images_url; } $images_links = get_chapter

crawlSpider seems not to follow rule

柔情痞子 提交于 2021-02-11 14:32:22
问题 here's my code. Actually I followed the example in "Recursively Scraping Web Pages With Scrapy" and it seems I have included a mistake somewhere. Can someone help me find it, please? It's driving me crazy, I only want all the results from all the result pages. Instead it gives me the results from page 1. Here's my code: import scrapy from scrapy.selector import Selector from scrapy.spiders import CrawlSpider, Rule from scrapy.http.request import Request from scrapy.contrib.linkextractors.sgml

crawl javascript table but considered abnormal requests

我怕爱的太早我们不能终老 提交于 2021-02-10 15:35:42
问题 I am trying to crawl this currency rate website: https://banking.nonghyup.com/servlet/PGEF0011I.view (Just click the exchange rate check and you'll see the pic like below) I want to crawl this dynamical table and I find its url via "Inspect->Network is : https://banking.nonghyup.com/servlet/PGEF0012R.frag However, when I start to request it by selenium, it returns error due to abnormal requests, which comes up "We're sorry for causing you any inconvenience. You will not be able to use the

How can i make sure that i am on About us page of a particular website

笑着哭i 提交于 2021-02-08 11:50:10
问题 Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage. import requests from BeautifulSoup import BeautifulSoup url = "https://www.udacity.com" response = requests.get(url) page = str(BeautifulSoup(response.content)) def getURL(page): start_link = page.find("a href") if start_link == -1: return None, 0 start_quote = page.find('"', start_link) end_quote = page.find('"', start_quote + 1) url = page[start_quote + 1: end_quote]

DatabaseError : “not all arguments converted during string formatting” when I use pandas.io.sql.to_sql()

时光总嘲笑我的痴心妄想 提交于 2021-02-08 11:26:59
问题 I have a table: And I try to use this import this table by sqlalchemy , the code is: import sqlalchemy as db import pandas.io.sql as sql username = 'root' password = 'root' host = 'localhost' port = '3306' database = 'classicmodels' engine = db.create_engine(f'mysql+pymysql://{username}:{password}@{host}:{port}/{database}') con = engine.raw_connection() #readinto dataframe df = pd.read_sql(f'SELECT * FROM `{database}`.`offices`;', con) print(df[:2]) df_append = pd.DataFrame([{'officeCode': 8,

Loop url from dataframe and download pdf files in Python

☆樱花仙子☆ 提交于 2021-02-08 10:16:36
问题 Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here. Now I would like to go further and click the url link: For each url , I will need to open and save pdf format files: How could I do that in Python? Any help would be greatly appreciated. Code for references: import shutil from bs4 import BeautifulSoup import requests import os from urllib.parse import urlparse url = 'xxx' for page in range(6): r = requests

Loop url from dataframe and download pdf files in Python

£可爱£侵袭症+ 提交于 2021-02-08 10:15:21
问题 Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here. Now I would like to go further and click the url link: For each url , I will need to open and save pdf format files: How could I do that in Python? Any help would be greatly appreciated. Code for references: import shutil from bs4 import BeautifulSoup import requests import os from urllib.parse import urlparse url = 'xxx' for page in range(6): r = requests

Extract text from 200k domains with scrapy

喜欢而已 提交于 2021-02-08 07:51:28
问题 My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file. I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection

How to get full web address with BeautifulSoup

时光怂恿深爱的人放手 提交于 2021-02-08 07:05:13
问题 I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution. Here is the code : from bs4 import BeautifulSoup import requests url ="https://en.wikipedia.org/wiki/WKIK" r = requests.get(url) data = r

How to get full web address with BeautifulSoup

删除回忆录丶 提交于 2021-02-08 07:03:20
问题 I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution. Here is the code : from bs4 import BeautifulSoup import requests url ="https://en.wikipedia.org/wiki/WKIK" r = requests.get(url) data = r