web-crawler

Naver Crawler: Combining DataFrame per each loop Python

放肆的年华 提交于 2019-12-14 03:02:14
问题 I am working on my Naver Crawler (its a Korea Google :P). I have working on this code for a week now, and I have one last task to solve! So my code below shows Data Crawling through Naver API and receiving data to "js" in each loop. All I need to do is combine each dataframe (dfdfdf) and combine at the bottom. But my result always shows the last looped data. Bottom line is that I want to add DataFrame for each loop that I am taking. I tried merge, join but it seems to be not working. Please

Parsing webpages to extract contents

被刻印的时光 ゝ 提交于 2019-12-14 00:03:47
问题 I want to design a crawler, using java, that crawls a webpage and extract certain contents of the page. How should I do this? I am new and I need guidance to start designing crawlers. For example, I want to access the content "red is my favorite color" from a webpage which is embedded something like below: < div > red is my favorite color < / div > 回答1: Suggested readings Static pages: java.net.URLConnection and java.net.HttpURLConnection jsoup - HTML parser and content manipulation library

Trouble getting correct Xpath

无人久伴 提交于 2019-12-13 22:22:49
问题 I am trying to pull all product links and image links out of a shopping widget using general xpaths. This is the site: http://www.stopitrightnow.com/ This is the xpath I have: xpath('.//*[@class="shopthepost-widget"]/a/@href').extract() I would of thought this would pull all links but it does nothing. Following is the beginning of the widget source for reference. class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls

Cannot crawl and access a particular div in the text file

☆樱花仙子☆ 提交于 2019-12-13 21:21:13
问题 I have the following code and I want to access the text of a particular div. from bs4 import BeautifulSoup import requests import urlparse example = open('example.txt') html = example.read() def gettext(htmltext): soup=BeautifulSoup(htmltext, "lxml") for div in soup.findAll('div', attrs={'class':'_5pbx userContent'}): print div.text gettext(html) At first, I tried it through a link to a facebook profile but it didn't work. But now I copied the whole source code and saved it in example.txt

Web crawler Using Twisted

半世苍凉 提交于 2019-12-13 20:08:44
问题 I am trying to create a web crawler with python and twisted.What happend is that at the time of calling reactor.run() I don't know all the link to get. so the code goes like: def crawl(url): d = getPage(url) d.addCallback(handlePage) reactor.run() and the handle page has something like: def handlePage(output): urls = getAllUrls(output) So now I need to apply the crawl() on each of the url in urls.How do I do that?Should I stop the reactor and start again?If I am missing something obvious

How to ignore already crawled URLs in Scrapy

家住魔仙堡 提交于 2019-12-13 20:05:39
问题 I have a crawler that looks something like this: def parse: ....... ........ Yield(Request(url=nextUrl,callback=self.parse2)) def parse2: ....... ........ Yield(Request(url=nextUrl,callback=self.parse3)) def parse3: ....... ........ I want to add a rule wherein I want to ignore if a URL has crawled while invoking function parse2, but keep the rule for parse3. I am still exploring the requests.seen file to see if I can manipulate that. 回答1: check out dont_filter request parameter at http://doc

Apache Nutch REST api

自闭症网瘾萝莉.ら 提交于 2019-12-13 20:04:28
问题 I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default", "args":{ "path/to/seedlist/directory"} } My problem is in the args, I think more is needed but I'm not sure. In the NutchRESTAPI page this is the sample it gives

Failed to crawl element of specific website with scrapy spider

狂风中的少年 提交于 2019-12-13 19:25:03
问题 I want to get website addresses of some jobs, so I write a scrapy spider, I want to get all of the value with xpath://article/dl/dd/h2/a[@class="job-title"]/@href, but when I execute the spider with command : scrapy spider auseek -a addsthreshold=3 the variable "urls" used to preserve values is empty, can someone help me to figure it, here is my code: from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import

Scrapy - How to crawl website & store data in Microsoft SQL Server database?

别来无恙 提交于 2019-12-13 18:08:03
问题 I'm trying to extract content from a website created by our company. I've created a table in MSSQL Server for Scrapy data. I've also set up Scrapy and configured Python to crawl & extract webpage data. My question is, how do I export the data crawled by Scrapy into my local MSSQL Server database? This is Scrapy's code for extracting data: import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/',

How to use Python's HTMLParser to extract specific links

醉酒当歌 提交于 2019-12-13 17:42:53
问题 I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this: def handle_starttag(self, tag, attrs): if tag == 'a': for (key, value) in attrs: if key == 'href': newUrl = urljoin(self.baseUrl, value) self.links = self.links + [newUrl] This worked very well when I wanted to find every link on the page. Now I only want to fetch certain links. How would I go about only fetching links that are between