scraper

Scrapy InIt self.initialized() — not initializing

a 夏天 提交于 2019-12-08 00:21:44
问题 I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated. I can get it up to "================Successfully logged in=================" but I can not get to "==========================PARSE ITEM==========================" from scrapy.contrib

Scrape data from HTML pages using Java, output to database [closed]

…衆ロ難τιáo~ 提交于 2019-12-06 08:22:38
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :) 回答1: First you need to

Scrapy InIt self.initialized() — not initializing

China☆狼群 提交于 2019-12-06 07:38:53
I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated. I can get it up to "================Successfully logged in=================" but I can not get to "==========================PARSE ITEM==========================" from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib

Scrapy Python Craigslist Scraper

冷暖自知 提交于 2019-12-06 05:35:41
问题 I am trying to scrape Craigslist classifieds using Scrapy to extract items that are for sale. I am able to extract date, post title, and post url but am having trouble extracting price . For some reason the current code extracts all of the prices, but when I remove the // before the price span look up the price field returns as empty. Can someone please review the code below and help me out? from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from craigslist

Python selenium get inside a #document

若如初见. 提交于 2019-12-05 19:18:00
How can I keep looking for elements in a #document: <div> <iframe> #document <html> <body> <div> Element I want to find </div> </body> </html> </iframe> </div> I think your problem is not with the a# document but with iframe . from selenium import webdriver driver = webdriver.Firefox() iframe = driver.find_elements_by_tag_name('iframe')[0] driver.switch_to_frame(iframe) driver.find_element_by_xpath("//div") 来源: https://stackoverflow.com/questions/38363643/python-selenium-get-inside-a-document

beautifulsoup and mechanize to get ajax call result

社会主义新天地 提交于 2019-12-05 01:18:08
问题 hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mechanize module but it seems to me that this is mostly used to work with forms ... i would appreciate any links or some code samples thanks 回答1: ok so i have

Scrape data from HTML pages using Java, output to database [closed]

夙愿已清 提交于 2019-12-04 14:54:46
I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :) First you need to get familiar with a HTML DOM parser in Java like JTidy . This will help you to extract the stuff you want from a HTML file. Once you have the essential stuff, you can use JDBC to put in the database . It might be tempting to use regular expression for this job. But don't. HTML is not a regular language so regex are not the

Scrapy Python Craigslist Scraper

泪湿孤枕 提交于 2019-12-04 10:18:49
I am trying to scrape Craigslist classifieds using Scrapy to extract items that are for sale. I am able to extract date, post title, and post url but am having trouble extracting price . For some reason the current code extracts all of the prices, but when I remove the // before the price span look up the price field returns as empty. Can someone please review the code below and help me out? from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from craigslist_sample.items import CraigslistSampleItem class MySpider(BaseSpider): name = "craig" allowed_domains = [

beautifulsoup and mechanize to get ajax call result

≯℡__Kan透↙ 提交于 2019-12-03 16:59:24
hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mechanize module but it seems to me that this is mostly used to work with forms ... i would appreciate any links or some code samples thanks ok so i have figured it out ... it was quite simple after i realised that i could use combination of urllib, ulrlib2 and

XPath to select between two HTML comments?

Deadly 提交于 2019-12-03 14:31:44
问题 I have a big HTML page. But I want to select certain nodes using Xpath: <html> ........ <!-- begin content --> <div>some text</div> <div><p>Some more elements</p></div> <!-- end content --> ....... </html> I can select HTML after the <!-- begin content --> using: "//comment()[. = ' begin content ']/following::*" Also I can select HTML before the <!-- end content --> using: "//comment()[. = ' end content ']/preceding::*" But do I have to have XPath to select all the HTML between the two