web-crawler | 易学教程

How to scrape xml feed with xmlfeedspider

阅读更多关于 How to scrape xml feed with xmlfeedspider

问题 I am trying to scrape an xml file with the below format file_sample.xml: <rss version="2.0"> <channel> <item> <title>SENIOR BUDGET ANALYST (new)</title> <link>https://hr.example.org/psp/hrapp&SeqId=1</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All Open Jobs</category> </item> <item> <title>BUDGET ANALYST (healthcare)</title> <link>https://hr.example.org/psp/hrapp&SeqId=2</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All category</category> </item> <

How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

阅读更多关于 How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

问题 I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start... I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website After doing some research, I know that crawling ajax web is nothing different from those simple ideas: •open browser developer tools, network tab •go to the target site •click submit button and see what XHR request is going to the server •simulate this XHR request in your spider The last one

How to send crawler data to PHP via command line?

阅读更多关于 How to send crawler data to PHP via command line?

问题 Can I send the results rather than stored in the JSON file, send it to PHP? I have this two files settings.json { "outputFile" : "C:\\wamp\\www\\drestip\\admin\\crawls\\mimshoes.json", "logFile" : "C:\\wamp\\www\\drestip\\admin\\crawls\\mimshoes.tsv", "pause" : 1, "local" : false, "connections" : 3, "cookiesEnabled" : false, "robotsDisabled" : false, "advancedMode" : true, "crawlTemplate" : [ "www.mimshoes.com/" ], "startUrls" : [ PAGES ], "maxDepth" : 10, "dataTemplate" : [ "www.mimshoes.com

Google crawler, cron and codeigniter sessions

阅读更多关于 Google crawler, cron and codeigniter sessions

问题 I am running a Codeigniter 2.0 web app, and I am using the sessions library with the DB option on, so for every connection on my website I have a MySQL table called 'ci_sessions' that stores: Session_id IP address User_agent Last_activity User_data And I have two issues: Google bot with IP 66.249.72.152 Cron with IP 0.0.0.0 Everytime my server runs the cron or every time the google bot crawls my page a new session is created. So I have hundreds of identical sessions with the IP 0.0.0.0 and

How do I make my AJAX content crawlable by Google?

阅读更多关于 How do I make my AJAX content crawlable by Google?

问题 I've been working on a site that uses jQuery heavily and loads in content via AJAX like so: $('#newPageWrapper').load(newPath + ' .pageWrapper', function() { //on load logic } It has now come to my attention that Google won't index any dynamically loaded content via Javascript and so I've been looking for a solution to the problem. I've read through Google's Making AJAX Applications Crawlable document what seems like 100 times and I still don't understand how to implement it (due in the most

Testing a website using C# [closed]

阅读更多关于 Testing a website using C# [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 months ago . Folks, I need to accomplish some sophisticated web crawling. The goal in simple words: Login to a page, enter some values in some text fields, click Submit, then extract some values from the retrieved page. What is the best approach? Some Unit testing 3rd party lib? Manual

Is there a better approach to use BeautifulSoup in my python web crawler codes?

阅读更多关于 Is there a better approach to use BeautifulSoup in my python web crawler codes?

问题 I'm trying to crawl information from urls in a page and save them in a text file. I have recieved great help in the question How to get the right source code with Python from the URLs using my web crawler? and I try to use what I have learned in BeautifulSoup to finish my codes based on that question. But when I look at my codes, although they have satisfied my need, but they look pretty messed up. Can anyone help me to optimize them a little, especially on the BeautifulSoup part? Such as the

How to recursively crawl subpages with Scrapy

阅读更多关于 How to recursively crawl subpages with Scrapy

问题 So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name

Using scrapy to find specific text from multiple websites

阅读更多关于 Using scrapy to find specific text from multiple websites

问题 I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation, but I can't seem to find this. Thank you. class FinalSpider(scrapy.Spider): name = "final" allowed_domains = [

How do I scrape HTML between two HTML comments using Nokogiri?

阅读更多关于 How do I scrape HTML between two HTML comments using Nokogiri?

问题 I have some HTML pages where the contents to be extracted are marked with HTML comments like below. <html> .....  <div>some text</div> <div><p>Some more elements</p></div>  ... </html> I am using Nokogiri and trying to extract the HTML between the  and  comments. I want to extract the full elements between these two HTML comments: <div>some text</div> <div><p>Some more elements</p></div> I can get the text