scraper | 易学教程

Scrapy InIt self.initialized() — not initializing

阅读更多关于 Scrapy InIt self.initialized() — not initializing

问题 I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated. I can get it up to "================Successfully logged in=================" but I can not get to "==========================PARSE ITEM==========================" from scrapy.contrib

Scrape data from HTML pages using Java, output to database [closed]

阅读更多关于 Scrape data from HTML pages using Java, output to database [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :) 回答1: First you need to

Scrapy InIt self.initialized() — not initializing

阅读更多关于 Scrapy InIt self.initialized() — not initializing

I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated. I can get it up to "================Successfully logged in=================" but I can not get to "==========================PARSE ITEM==========================" from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib

Scrapy Python Craigslist Scraper

阅读更多关于 Scrapy Python Craigslist Scraper

问题 I am trying to scrape Craigslist classifieds using Scrapy to extract items that are for sale. I am able to extract date, post title, and post url but am having trouble extracting price . For some reason the current code extracts all of the prices, but when I remove the // before the price span look up the price field returns as empty. Can someone please review the code below and help me out? from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from craigslist

Python selenium get inside a #document

阅读更多关于 Python selenium get inside a #document

How can I keep looking for elements in a #document: <div> <iframe> #document <html> <body> <div> Element I want to find </div> </body> </html> </iframe> </div> I think your problem is not with the a# document but with iframe . from selenium import webdriver driver = webdriver.Firefox() iframe = driver.find_elements_by_tag_name('iframe')[0] driver.switch_to_frame(iframe) driver.find_element_by_xpath("//div") 来源： https://stackoverflow.com/questions/38363643/python-selenium-get-inside-a-document

beautifulsoup and mechanize to get ajax call result

阅读更多关于 beautifulsoup and mechanize to get ajax call result

问题 hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mechanize module but it seems to me that this is mostly used to work with forms ... i would appreciate any links or some code samples thanks 回答1: ok so i have

Scrape data from HTML pages using Java, output to database [closed]

阅读更多关于 Scrape data from HTML pages using Java, output to database [closed]

I need to know how to create a scraper (in Java) to gather data from HTML pages and output to a database...do not have a clue where to start so any information you can give me on this would be great. Also, you can't be too basic or simple here...thanks :) First you need to get familiar with a HTML DOM parser in Java like JTidy . This will help you to extract the stuff you want from a HTML file. Once you have the essential stuff, you can use JDBC to put in the database . It might be tempting to use regular expression for this job. But don't. HTML is not a regular language so regex are not the

Scrapy Python Craigslist Scraper

阅读更多关于 Scrapy Python Craigslist Scraper

I am trying to scrape Craigslist classifieds using Scrapy to extract items that are for sale. I am able to extract date, post title, and post url but am having trouble extracting price . For some reason the current code extracts all of the prices, but when I remove the // before the price span look up the price field returns as empty. Can someone please review the code below and help me out? from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from craigslist_sample.items import CraigslistSampleItem class MySpider(BaseSpider): name = "craig" allowed_domains = [

beautifulsoup and mechanize to get ajax call result

阅读更多关于 beautifulsoup and mechanize to get ajax call result

hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mechanize module but it seems to me that this is mostly used to work with forms ... i would appreciate any links or some code samples thanks ok so i have figured it out ... it was quite simple after i realised that i could use combination of urllib, ulrlib2 and

XPath to select between two HTML comments?

阅读更多关于 XPath to select between two HTML comments?

问题 I have a big HTML page. But I want to select certain nodes using Xpath: <html> ........  <div>some text</div> <div><p>Some more elements</p></div>  ....... </html> I can select HTML after the  using: "//comment()[. = ' begin content ']/following::*" Also I can select HTML before the  using: "//comment()[. = ' end content ']/preceding::*" But do I have to have XPath to select all the HTML between the two