web-crawler

Difference between find and filter in jquery

谁说我不能喝 提交于 2019-11-26 22:53:15
问题 I'm working on fetching data from wiki pages. I'm using a combination of php and jquery to do this. First I am using curl in php to fetch page contents and echoing the content. The filename is content.php : $url = $_GET['url']; $url = trim($url," "); $url = urldecode($url); $url = str_replace(" ","%20",$url); echo "<a class='urlmax'>".$_GET['title']."</a>"; echo crawl($url); Then jQuery is used to find the matched elements. $.get("content.php",{url:"http://en.wikipedia.org/w/index.php?action

Python: maximum recursion depth exceeded while calling a Python object

旧城冷巷雨未停 提交于 2019-11-26 22:47:40
问题 I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need. after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url. you can see the subtrahends quite fast (a little ex' of the few first "good IDs") - 510000011 # +8 510000029 # +18

What are some good Ruby-based web crawlers? [closed]

狂风中的少年 提交于 2019-11-26 22:30:35
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby. Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a

Crawling the Google Play store

妖精的绣舞 提交于 2019-11-26 22:24:23
问题 I want to crawl the Google Play store to download the web pages of all the android application (All the webpages with the following base url: https://play.google.com/store/apps/). I checked the robots.txt file of the play store and it disallows crawling these URLs. Also, when I browse the Google Play store I can only see top applications up to 3 pages for each of the categories. How can I get the other application pages? If anyone has tried crawling the Google Play please let me know the

How do I save the origin html file with Apache Nutch

蹲街弑〆低调 提交于 2019-11-26 21:40:27
问题 I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch? Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.) 回答1: Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will

Detecting 'stealth' web-crawlers

感情迁移 提交于 2019-11-26 21:08:58
What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it: identifies itself as a bot in the user agent string reads robots.txt (and obeys it) I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and

Scrapy - how to identify already scraped urls

*爱你&永不变心* 提交于 2019-11-26 20:18:01
问题 Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor . 回答1: You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/ To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to

getting Forbidden by robots.txt: scrapy

孤人 提交于 2019-11-26 19:56:53
问题 while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/> ERROR: No response downloaded for: https://www.netflix.com/ 回答1: In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY ROBOTSTXT_OBEY=False Here are the release notes 回答2: First thing you need to ensure is that you change your user agent in the request,

how to extract links and titles from a .html page?

≯℡__Kan透↙ 提交于 2019-11-26 19:51:10
for my website, i'd like to add a new functionality. I would like user to be able to upload his bookmarks backup file (from any browser if possible) so I can upload it to their profile and they don't have to insert all of them manually... the only part i'm missing to do this it's the part of extracting title and URL from the uploaded file.. can anyone give a clue where to start or where to read? used search option and ( How to extract data from a raw HTML file? ) this is the most related question for mine and it doesn't talk about it.. I really don't mind if its using jquery or php Thank you

Simple web crawler in C#

…衆ロ難τιáo~ 提交于 2019-11-26 19:26:16
问题 I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can do that and i want also to include threads to make it faster here it is my code namespace Crawler { public partial class Form1 : Form { String Rstring; public Form1() { InitializeComponent(); } private void button1_Click(object sender, EventArgs e) { WebRequest myWebRequest; WebResponse myWebResponse; String URL = textBox1