scrapy

Unbalanced parenthesis error with Regex

主宰稳场 提交于 2021-02-16 05:32:24
问题 I am using the following regex to obtain all data from a website Javascript data source that is contained within the following character pattern [[]]); The code I am using is this: regex = r'\[\[.*?\]]);' match2 = re.findall(regex, response.body, re.S) print match2 This is throwing up an error message of: raise error, v # invalid expression sre_constants.error: unbalanced parenthesis I think I am fairly safe in assuming that this is being caused by the closing bracket within my regex. How can

Unbalanced parenthesis error with Regex

老子叫甜甜 提交于 2021-02-16 05:32:08
问题 I am using the following regex to obtain all data from a website Javascript data source that is contained within the following character pattern [[]]); The code I am using is this: regex = r'\[\[.*?\]]);' match2 = re.findall(regex, response.body, re.S) print match2 This is throwing up an error message of: raise error, v # invalid expression sre_constants.error: unbalanced parenthesis I think I am fairly safe in assuming that this is being caused by the closing bracket within my regex. How can

scrapy爬取美女图片

此生再无相见时 提交于 2021-02-15 13:26:40
使用scrapy爬取整个网站的图片数据。并且使用 CrawlerProcess 启动。 1 # -*- coding: utf-8 -* 2 import scrapy 3 import requests 4 from bs4 import BeautifulSoup 5 6 from meinr.items import MeinrItem 7 8 9 class Meinr1Spider(scrapy.Spider): 10 name = ' meinr1 ' 11 # allowed_domains = ['www.baidu.com'] 12 # start_urls = ['http://m.tupianzj.com/meinv/xiezhen/'] 13 headers = { 14 ' User-Agent ' : ' Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 ' , 15 } 16 def num(self,url,headers): #获取网页每个分类的页数和URL格式 17 html = requests.get(url=url,headers= headers) 18 if html

Python-Scrapy爬取妹子图

↘锁芯ラ 提交于 2021-02-15 12:25:19
写在开始: 学习网址>>>> https://segmentfault.com/a/1190000003870052 scrapy.ItemLoad>>>http://docs.pythontab.com/scrapy/scrapy0.24/topics/loaders.html 流程>>>http://www.jianshu.com/p/5b6fbf9245f8 翻页>>>http://blog.sina.com.cn/s/blog_737463190102wk8x.html Log>>>http://blog.csdn.net/arbel/article/details/7781121 一些注意点>>>> 网站: 网站每页有多个meizi主题,每页页码格式:http://www.meizitu.com/a/list_1_1.html 进入某个meizi内容页面,页码格式:http://www.meizitu.com/a/5460.html 每个页面包括多张图片,具体图片地址:http://mm.howkuai.com/wp-content/uploads/2016a/09/02/01.jpg 网站列表页格式有更改,len(pages)会进入无限循环 网站增加了检查,需增加headers item: 使用ItemLoader来进行数据归集 spider: yield有迭代功能

scrapy框架爬取妹子图片

孤人 提交于 2021-02-15 12:09:52
首先,建立一个项目#可在github账户下载完整代码:https://github.com/connordb/scrapy-jiandan2 scrapy startproject jiandan2 打开pycharm,把建立的此项目的文件打开,在中断新建一个爬虫文件 scrapy genspide jiandan jandan.net/ooxx 在Items中配置我们需要的信息 import scrapy class Jiandan2Item(scrapy.Item): # define the fields for your item here like: img_url = scrapy.Field() # 图片的链接 img_name = scrapy.Field() 在jian_pan 文件开始我们对网页的解析 import base64 from jiandan2 import item class JiandanSpider(scrapy.Spider): name = 'jiandan' allowed_domains = ['jandan.net'] start_urls = ['http://jandan.net/ooxx'] def parse(self, response): img = response.xpath('//div[@id="comments

拉勾网爬取全国python职位并数据分析薪资,工作经验,学历等信息

柔情痞子 提交于 2021-02-14 07:44:29
首先前往 拉勾网“爬虫”职位相关页面 确定网页的加载方式是JavaScript加载 通过谷歌浏览器开发者工具分析和寻找网页的真实请求,确定真实数据在position.Ajax开头的链接里,请求方式是POST 使用requests的post方法获取数据,发现并没有返回想要的数据,说明需要加上headers和每隔多长时间爬取 我们可以看到拉勾网列表页的信息一般js加载的都在xhr和js中,通过发送ajax加载POST请求,获取页面信息。 这个是ajax的头信息,通过Form Data中的的信息获取页面 下面是scrapy爬虫的 代码部分 1 import scrapy 2 import json 3 from lagou.items import LagouItem 4 class LagoupositionSpider(scrapy.Spider): 5 name = ' lagouposition ' 6 allowed_domains = [ ' lagou.com ' ] 7 kd = input( ' 请输入你要搜索的职位信息: ' ) 8 ct =input( ' 请输入要搜索的城市信息 ' ) 9 page=1 10 start_urls = [ " https://www.lagou.com/jobs/list_ " +str(kd)+ " &city= " + str

python多线程

假如想象 提交于 2021-02-13 12:02:35
python的标准库提供两个模块:thread和threading,thread是低级模块,threading是高级模块,对thread进行了封装 1 用threading模块创建多线程 第一种方法是把一个函数传入并创建Thread实例,然后调用start方法执行; #!coding:utf-8 import random import time,threading #新线程执行的代码 def thread_run(urls): print 'Current %s in running...' %threading.current_thread().name for url in urls: print '%s --->>> %s ' %(threading.current_thread().name,url) time.sleep(random.random()) print('%s ended. '%threading.current_thread().name) print('%s is running... ' %threading.current_thread().name) t1 = threading.Thread(target=thread_run,name='Thread_1',args=(['url_1','url_2','url_3'],)) t2 =

Scrapy - how to manage pagination without 'Next' button?

烈酒焚心 提交于 2021-02-11 18:03:36
问题 I'm scraping the content of articles from a site like this where there is no 'Next' button to follow. ItemLoader is passed from parse_issue in the response.meta object as well as some additional data like section_name . Here is the function: def parse_article(self, response): self.logger.info('Parse function called parse_article on {}'.format(response.url)) acrobat = response.xpath('//div[@class="txt__lead"]/p[contains(text(), "Plik do pobrania w wersji (pdf) - wymagany Acrobat Reader")]')

i want to print a proper table out of data scrapped using scrapy

匆匆过客 提交于 2021-02-11 17:20:38
问题 so i have written all the code to scrap table from [http://www.rarityguide.com/cbgames_view.php?FirstRecord=21][1] but i am getting output like # the output that i get {'EXG': (['17.00', '10.00', '90.00', '9.00', '13.00', '17.00', '16.00', '43.00', '125.00', '16.00', '11.00', '150.00', '17.00', '24.00', '15.00', '24.00', '21.00', '36.00', '270.00', '280.00'],), 'G': ['8.00', '5.00', '38.00', '2.00', '6.00', '7.00', '6.00', '20.00', '40.00', '7.00', '5.00', '70.00', '6.00', '12.00', '7.00',

parse xpath from xml file should contain '

烂漫一生 提交于 2021-02-11 15:55:49
问题 this is my xml file <Item name="Date" xpath='p[@class="date"]/text()' defaultValue="Date Not Found"></Item> i parse it like this: self.doc=etree.parse(xmlFile) masterItemsFromXML = self.doc.findall('MasterPage/MasterItems/Item') for oneItem in masterItemsFromXML: print 'master item xpath = {0}'.format(oneItem.attrib['xpath']) and I can see the result printed in the cmd like this: master item xpath =p[@class="date"]/text() my problem the xpath is not valid because it should start with ' and