web-crawler

Determining an a priori ranking of what sites a user has most likely visited

雨燕双飞 提交于 2019-12-11 06:51:15
问题 This is for http://cssfingerprint.com I have a largish database (~100M rows) of websites. This includes both main domains (both 2LD and 3LD) and particular URLs scraped from those domains (whether hosted there [like most blogs] or only linked from it [like Digg], and with a reference to the host domain). I also scrape the Alexa top million, Bloglines top 1000, Google pagerank, Technorati top 100, and Quantcast top million rankings. Many domains will have no ranking though, or only a partial

Write text file to pipeline

六眼飞鱼酱① 提交于 2019-12-11 06:51:05
问题 I have multiple spiders in a single scrapy project. I want to write a separate output text file for each spider with spider name and time stamp. When I had a single spider I was creating file in __init method but now I am trying like this, upromise will generate two output files while other will only one. class MallCrawlerPipeline(object): def spider_opened(self, spider): self.aWriter = csv.writer(open('../%s_%s.txt' % (spider.name, datetime.now().strftime("%Y%m%d_%H%M%S")), 'wb'), delimiter=

How to open new url on the same open tab?

会有一股神秘感。 提交于 2019-12-11 06:42:38
问题 i am use this code in order to open all link from specific URL, each link will open with new tab, this cause a huge memory use, how can i open the new link on the existing Tab ? static void Main(string[] args) { ProcessStartInfo processStartInfo = null; string googleChoromePath = "C:\\Users\\Dandin\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe"; string argument = ""; string url = "http://edition.cnn.com/"; WebClient wClient = new WebClient(); string st = wClient.DownloadString(url)

How to get all webpages on a domain

梦想的初衷 提交于 2019-12-11 06:31:43
问题 I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain... e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get: https://stackoverflow.com/questions/ask pulling webpages from an adult site -- how to get past the site agreement? https://stackoverflow.com/questions/1234214/ Best Rails HTML Parser And all the links. How can I get that. Or is there an API or

How to get web-page-title with CURL in PHP from web-sites of different CHARSET?

孤街浪徒 提交于 2019-12-11 06:28:48
问题 I want to store the title in UTF—8,but the pages comes up with many different charset,such as GBK,ISO,unicode…… Could you give me some help? Thanks. 回答1: Identify or detect the character encoding and convert the data to UTF-8 if necessary. For HTML (i.e. text/html) there are three ways to specify the character encoding: An HTTP "charset" parameter in a "Content-Type" field. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". The charset attribute set on

Web Crawler in Grails to calculate page depth

浪尽此生 提交于 2019-12-11 06:28:26
问题 I am making a crawler application. I wish to crawl websites and find the depth of the webpages retrieved. I read about different crawling and parsing tools but to no avail. None of them seem to provide support to calculate the depth. I am also unsure about which crawler tool to use which can get closest to desired functionality. Any help is appreciated. 回答1: The most important thing is probably the mapping of your Domain (and not the parser). Because, if you are using a tree (More information

selexbox present check error at selenium

耗尽温柔 提交于 2019-12-11 06:23:38
问题 Code at Selenium by python from selenium import webdriver from selenium.webdriver.support.select import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time from selenium.webdriver.common.action_chains import ActionChains driver = webdriver.Chrome() driver.maximize_window() driver.get("https://motul.lubricantadvisor.com/Default.aspx?data=1&lang=ENG&lang=eng") def

scrapy is slow (60 pages/min)

时间秒杀一切 提交于 2019-12-11 05:57:19
问题 my crawler seems to be working really slow, not sure why. ill try to explain how it works. keep in mind I use inline requests first i have 31 different starting URLs. each URL is a category in amazon. settings: USER_AGENT = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201" ROBOTSTXT_OBEY = False CONCURRENT_REQUESTS = 2048 DOWNLOAD_DELAY = 1 CONCURRENT_REQUESTS_PER_DOMAIN = 2048 on each URL i do for on all the items in that page(16 items). on each item i send a request to book

Web crawler links/page logic in PHP

泄露秘密 提交于 2019-12-11 05:47:28
问题 I'm writing a basic crawler that simply caches pages with PHP. All it does is use get_file_contents to get contents of a webpage and regex to get all the links out <a href="URL">DESCRIPTION</a> - at the moment it returns: Array { [url] => URL [desc] => DESCRIPTION } The problem I'm having is figuring out the logic behind determining whether the page link is local or sussing out whether it may be in a completely different local directory. It could be any number of combinations: i.e. href="..

About the appropriate usage of the Google Custom Search API

主宰稳场 提交于 2019-12-11 05:39:48
问题 I'm in the middle of writing a commercial application that takes a list of URLs as input (in this case from Google Custom Search), process the pages pointed to by the URLs and storing the processed information alongside the URLs. I was just wondering if anyone knows whether this breaks the rule in its TOS which states that: "You may not in any way frame, cache or modify the Results produced by Google".. Source: http://www.google.com/cse/docs/tos.html I would also be interested to know if