web-crawler

I want ro get all article content from all links inside from an website

╄→尐↘猪︶ㄣ 提交于 2019-12-11 11:53:13
问题 I want to extract all article content from an website using any web crawling/scraping methods. The problem is I can get content from a single page but not its redirecting links. Anyone please give me the proper solutions import java.io.FileOutputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.Reader; import java.net.URI; import java.net.URL; import java.net.URLConnection; import javax.swing.text.EditorKit; import javax.swing.text.html.HTMLDocument;

Scrapy crawl only part of a website

。_饼干妹妹 提交于 2019-12-11 11:47:45
问题 Hello there I have the following code to scan all links in a give site. from scrapy.item import Field, Item from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class SampleItem(Item): link = Field() class SampleSpider(CrawlSpider): name = "sample_spider" allowed_domains = ["domain.com"] start_urls = ["http://domain.com"] rules = ( Rule(LinkExtractor(), callback='parse_page', follow=True), ) def parse_page(self, response): item =

Having trouble understanding where to look in source code, in order to create a web scraper

≯℡__Kan透↙ 提交于 2019-12-11 11:37:34
问题 I am noob with python, been on and off teaching myself since this summer. I am going through the scrapy tutorial, and occasionally reading more about html/xml to help me understand scrapy. My project to myself is to imitate the scrapy tutorial in order to scrape http://www.gamefaqs.com/boards/916373-pc. I want to get a list of the thread title along with the thread url, should be simple! My problem lies in not understanding xpath, and also html i guess. When viewing the source code for the

Problem with Ruby Regular Expression

独自空忆成欢 提交于 2019-12-11 10:53:44
问题 I have this HTML code, that's on a single line: <h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3><h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3> Here is the line-friendly version (that i can't use) <h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3> <h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3> And i'm trying to extract just the URLs, with this REGEX /<h3 class="r"><a href="(.*)">(.*)<\/a>/ And it returns www.google.com">fkdsafjldsajl</a><

Getting all href from a code

人盡茶涼 提交于 2019-12-11 10:33:52
问题 I'm making a web-crawler. For finding the links in a page I was using xpath in selenium driver = webdriver.Firefox() driver.get(side) Listlinker = driver.find_elements_by_xpath("//a") This worked fine. Testing the crawler however, I found that not all links come under the a tag. href is sometimes used in area or div tags as well. Right now I'm stuck with driver = webdriver.Firefox() driver.get(side) Listlinkera = driver.find_elements_by_xpath("//a") Listlinkerdiv = driver.find_elements_by

file_get_content get the wrong web

你。 提交于 2019-12-11 10:29:24
问题 I am learning to spider website contents with PHP- file_get_contents ,but something is wrong.The web I want is "http://www.jandan.net". But use file_get_content() ,I get the contents from "http://i.jandan.net" (it's phone page, they are different pages). user_agent is also unusable. <?php ini_set("user_agent","Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6"); $url = 'http://www.jandan.net/'; /* $opt = array( 'http'=>array( 'method'=>"GET",

scrapy isn't working right in extracting the title

我怕爱的太早我们不能终老 提交于 2019-12-11 10:23:00
问题 In this code I want to scrape title,subtitle and data inside the links but having issues on pages beyond 1 and 2 as getting only 1 item scraped.I want to extract only those entries having title as delhivery only import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin from delhivery.items import DelhiveryItem class criticspider(CrawlSpider): name = "delh

Scrapy crawler spider doesn't follow links

…衆ロ難τιáo~ 提交于 2019-12-11 10:17:29
问题 For this, I used example in Scrapy crawl spider example: http://doc.scrapy.org/en/latest/topics/spiders.html I want to get links from a web page and follow them to parse table with statistics, but somehow I don't see that any links would be grabbed and followed to web page that has data. Here is my script: from basketbase.items import BasketbaseItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import

How to prevent bots from creating sessions in CodeIgniter?

Deadly 提交于 2019-12-11 09:43:35
问题 I am using CodeIgniter with sessions stored in my database. Over a short period of time, a large amount of sessions are created by bots/spiders, etc. Is there a way of preventing this? Perhaps via .htaccess? 回答1: First and foremost you should create a robots.txt file in the web root of the domain to address two issues. First to control the rate at which the website is being crawled which can help prevent a bot/spider from creating a massive number of database connections at the same time.

Programmatically download text that doesn't appear in the page source

戏子无情 提交于 2019-12-11 09:32:39
问题 I'm writing a crawler in Python. Given a single web page, I extract it's Html content in the following manner: import urllib2 response = urllib2.urlopen('http://www.example.com/') html = response.read() But some text components don't appear in the Html page source, for example in this page (redirected to the index, please access one of the dates and view a specific mail) if you view page source you will see that the mail text doesn't appear in the source but seems to be loaded by JS. How can