web-crawler | 易学教程

Small preview when sharing link on Social media Ruby On Rails

阅读更多关于 Small preview when sharing link on Social media Ruby On Rails

问题 I'm working on a site whose front end is in angularjs and backend in ROR , Same ROR API is used in an android app also . Now I have a situation here . I need to share my Web's post on the Social media like facebook , twitter and google plus . And along with the link to the single post there should be a small preview also (a preview of the post which crawls before posting e.g in facebook) .I did it using angular Plugins . But when it comes to Android side , what they share and what displays on

BeautifulSoup does not work for some web sites

阅读更多关于 BeautifulSoup does not work for some web sites

问题 I have this sript: import urrlib2 from bs4 import BeautifulSoup url = "http://www.shoptop.ru/" page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) divs = soup.findAll('a') print divs For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04 回答1: Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser So, just try to use other couple of parsers

How to crawl links on all pages of a web site with Scrapy

阅读更多关于 How to crawl links on all pages of a web site with Scrapy

问题 I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site: http://www.ppgcc.ufv.br/ The links that I want are on this page: http://www.ppgcc.ufv.br/?page_id=697 What could I do? I'm trying to use rules but I don't know how to use regular expressions correctly. Thank you 1

HTTP over C# sockets

阅读更多关于 HTTP over C# sockets

问题 I am trying to send HTTP request and recieve responce from the server over C# sockets, and i'm new with this language. I've wrote following code (IP resolved correctly): IPEndPoint RHost = new IPEndPoint(IP, Port); Socket socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); socket.Connect(RHost); String HTTPRequestHeaders_String = "GET ?q=fdgdfg HTTP/1.0 Host: google.com Keep-Alive: 300 Connection: Keep-Alive User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1

Make a web crawler/spider

阅读更多关于 Make a web crawler/spider

问题 I'm looking into making a web crawler/spider but I need someone to point me in the right direction to get started. Basically, my spider is going to search for audio files and index them. I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in PHP would be extremely slow. I know vb.net so could that come in handy? I was thinking about using Googles filetype search to get links to crawl. Would that be ok? 回答1: In VB.NET you will need to get the HTML first

How to extend Nutch for article crawling

阅读更多关于 How to extend Nutch for article crawling

问题 I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each: 1 Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages? 2 Write a plugin to parse out the 'author', 'date', 'article body',

How to scrape the first paragraph from a wikipedia page?

阅读更多关于 How to scrape the first paragraph from a wikipedia page?

问题 Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar? Is there any php library for that? I don't want to use the api because it's a bit complex. Note: i just need that to add a widget under my pages that displays related info from Wikipedia. 回答1: Use the following XPath expression: /*/h:body//h:h1 | /*/h:body//h:h1/following::node() [count(. | //h:table[@id='toc']

Download images from google image search (python)

阅读更多关于 Download images from google image search (python)

问题 I am web scraping beginner. I am firstly refer to https://www.youtube.com/watch?v=ZAUNEEtzsrg to download image with the specific tag(e.g. cat), and it works! But I encountered new problem which only can download about 100 images, and this problem seems like "ajax" which only load the first page html and not load all. Therefore, it seem like we must simulate scroll down to download next 100 images or more. My code: https://drive.google.com/file/d/0Bwjk-LKe_AohNk9CNXVQbGRxMHc/edit?usp=sharing

Unable to use proxies in Scrapy project

阅读更多关于 Unable to use proxies in Scrapy project

问题 I have been trying to crawl a website that has seemingly identified and blocked my IP and is throwing a 429 Too many requests response. I installed scrapy-proxies from this link: https://github.com/aivarsk/scrapy-proxies and followed the given instructions. I got a list of proxies from here: http://www.gatherproxy.com/ and now here is how my settings.py and proxylist.txt look like: Settings.py BOT_NAME = 'project' SPIDER_MODULES = ['project.spiders'] NEWSPIDER_MODULE = 'project.spiders' #

How to select data from specific tags in nutch

阅读更多关于 How to select data from specific tags in nutch

问题 I am a newbie in Apache Nutch and I would like to know whether it's possible to crawl selected area of a web page. For instance, select a particular div and crawl contents in that div only. Any help would be appreciated. Thanks! 回答1: You will have to write a plugin that will extend HtmlParseFilter to achieve your goal. I reckon you will be doing some of the stuff yourself like parsing the html's specific section, extracting the URLs that you want and add them as outlinks. HtmlParseFilter