web-crawler

Python: Sqlalchemy.exc.OperationalError: <unprintable OperationalError object>

柔情痞子 提交于 2019-12-13 04:28:54
问题 I made the following code that I will present below to create a web crawler (elaborated in scrapy) and I want to put this data in a database, the one being mysql. For this I used the pipeline file and made the following configurations: pipeline.py: class ScrapySpiderPipeline(object): def __init__(self): engine = db_connect() create_table(engine) self.Session = sessionmaker(bind=engine) def process_item(self, item, spider): session = self.Session() quotedb = QuoteDB() quotedb.Titulo = item[

how to navigate to other pages when pagination exists in the URL

夙愿已清 提交于 2019-12-13 03:47:56
问题 I have a URL(http://myURL.com) from which I'm reading the content of the webpage. An issue is I can able to read the page1 content only. Using jsoup API when the page2 content is read given the page2 URL of the pagination pages, still, it is showing the content of page1 when printed instead of showing page2 content, but when the page2 URL is opened in the browser it is showing the contents of page2 in a web browser. Any suggestions on how to read the contents of other pages when the

How to scrape specific part of online english dictionary?

感情迁移 提交于 2019-12-13 03:37:56
问题 Hello experts and friends from all over the world. I'm a non-english speaker who's learning english recently, and from last week I have been making huge english vocabulary list which contains phonetic alphabet, meaning etc. Problem is some of word's phonetic alphabet were missing, due to those are non-exist on few of online dictionaries I have checked. But after struggling for a while, I found oxford online dictionary having those phonetic alphabet I couldn't found out before. So, here's what

Save data to csv file using Python

坚强是说给别人听的谎言 提交于 2019-12-13 02:59:53
问题 My data, that I've extracted from a webpage looks like below after using print statement. [[u'Neoplasms', u'Medical Subject Headings', u'direct', u'cancer', u'Neoplasms', u'Medical Subject Headings', u'Malignant Neoplasm', u'National Cancer Institute Thesaurus', u'direct', u'cancer', u'Malignant Neoplasm', u'National Cancer Institute Thesaurus']] I'd like to write it to a csv file like this, with each row containing six elements. Neoplasms, Medical Subject Headings, direct, cancer, Neoplasms,

scrapy getting values from multiple sites

蓝咒 提交于 2019-12-13 02:59:00
问题 I'm trying to pass a value from a function. i looked up the docs and just didn't understand it. ref: def parse_page1(self, response): item = MyItem() item['main_url'] = response.url request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item yield request def parse_page2(self, response): item = response.meta['item'] item['other_url'] = response.url yield item here is a psudo code of what i want to achive: import scrapy class

Is Scrapy's asynchronicity what is hindering my CSV results file from being created straightforwardly?

◇◆丶佛笑我妖孽 提交于 2019-12-13 01:59:00
问题 My scrapy project "drills down" from list pages, retrieving data for the listed items at varying levels up to several deep. There could be many pages of listed items with handfuls of different items/links on each page. I'm collecting details (and storing them in a single CSV Excel file) of each of the items from: the page it is listed on, the page link in that list ("more details" page), and yet another page - the original listing by the item's manufacturer, let's say. Because I am building a

Hide web pages to the search engines robots

拥有回忆 提交于 2019-12-13 01:30:04
问题 I need to hide all my sites pages to ALL the spider robots, except for the home page (www.site.com) that should be parsed from robots. Does anyone knows how can i do that? 回答1: add to all pages you do not want to index tag <meta name="robots" content="noindex" /> or you can create robots.txt in your document root and put there something like: User-agent: * Allow: /$ Disallow: /* 来源: https://stackoverflow.com/questions/12807657/hide-web-pages-to-the-search-engines-robots

Unable to verify crawled data stored in hbase

故事扮演 提交于 2019-12-13 01:25:38
问题 I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial . Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1 Here are the steps I used- ./runtime/local/bin/nutch inject urls ./runtime/local/bin/nutch generate -topN 100 -adddays 30 ./runtime/local/bin/nutch fetch -all ./runtime/local/bin/nutch fetch -all ./runtime/local/bin/nutch updatedb ./runtime/local/bin/nutch solrindex http:/

sogou spider still hitting our website even after blocking it

北慕城南 提交于 2019-12-13 01:14:00
问题 Our website was getting many hits from "Sogou web spider", So we thought of blocking it using htaccess rules. We created below rules - RewriteCond %{HTTP_USER_AGENT} Sogou [NC] RewriteRule ^.*$ - [L] However we are still getting hits from Sogou. I would like to know what changes should I make in this rule to block Sogou. Thanking you, 回答1: As @faa mentioned, you're not actually blocking anything: RewriteEngine On RewriteCond %{HTTP_USER_AGENT} Sogou [NC] RewriteRule ^.*$ map.txt [R=403] Make

How to get the right source code with Python from the URLs using my web crawler?

瘦欲@ 提交于 2019-12-13 00:57:15
问题 I'm trying to use python to write a web crawler. I'm using re and requests module. I want to get urls from the first page (it's a forum) and get information from every url. My problem now is, I already store the URLs in a List. But I can't get further to get the RIGHT source code of these URLs. Here is my code: import re import requests url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1' sourceCode = getsourse(url) # source