web-crawler

How to scrape xml feed with xmlfeedspider

ぐ巨炮叔叔 提交于 2019-12-23 02:51:18
问题 I am trying to scrape an xml file with the below format file_sample.xml: <rss version="2.0"> <channel> <item> <title>SENIOR BUDGET ANALYST (new)</title> <link>https://hr.example.org/psp/hrapp&SeqId=1</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All Open Jobs</category> </item> <item> <title>BUDGET ANALYST (healthcare)</title> <link>https://hr.example.org/psp/hrapp&SeqId=2</link> <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate> <category>All category</category> </item> <

How to simulate xhr request using Scrapy when trying to crawl data from an ajax-based webstie?

夙愿已清 提交于 2019-12-23 02:43:18
问题 I am new to crawl the webpage using Scrapy and unfortunately chose a dynamic one to start... I've successfully crawled part (120 links), thanks to someone helping me here, but not links in target website After doing some research, I know that crawling ajax web is nothing different from those simple ideas: •open browser developer tools, network tab •go to the target site •click submit button and see what XHR request is going to the server •simulate this XHR request in your spider The last one

How to send crawler data to PHP via command line?

杀马特。学长 韩版系。学妹 提交于 2019-12-23 02:42:29
问题 Can I send the results rather than stored in the JSON file, send it to PHP? I have this two files settings.json { "outputFile" : "C:\\wamp\\www\\drestip\\admin\\crawls\\mimshoes.json", "logFile" : "C:\\wamp\\www\\drestip\\admin\\crawls\\mimshoes.tsv", "pause" : 1, "local" : false, "connections" : 3, "cookiesEnabled" : false, "robotsDisabled" : false, "advancedMode" : true, "crawlTemplate" : [ "www.mimshoes.com/" ], "startUrls" : [ PAGES ], "maxDepth" : 10, "dataTemplate" : [ "www.mimshoes.com

Google crawler, cron and codeigniter sessions

≯℡__Kan透↙ 提交于 2019-12-23 02:34:20
问题 I am running a Codeigniter 2.0 web app, and I am using the sessions library with the DB option on, so for every connection on my website I have a MySQL table called 'ci_sessions' that stores: Session_id IP address User_agent Last_activity User_data And I have two issues: Google bot with IP 66.249.72.152 Cron with IP 0.0.0.0 Everytime my server runs the cron or every time the google bot crawls my page a new session is created. So I have hundreds of identical sessions with the IP 0.0.0.0 and

How do I make my AJAX content crawlable by Google?

会有一股神秘感。 提交于 2019-12-23 02:17:17
问题 I've been working on a site that uses jQuery heavily and loads in content via AJAX like so: $('#newPageWrapper').load(newPath + ' .pageWrapper', function() { //on load logic } It has now come to my attention that Google won't index any dynamically loaded content via Javascript and so I've been looking for a solution to the problem. I've read through Google's Making AJAX Applications Crawlable document what seems like 100 times and I still don't understand how to implement it (due in the most

Testing a website using C# [closed]

旧城冷巷雨未停 提交于 2019-12-23 01:10:52
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 months ago . Folks, I need to accomplish some sophisticated web crawling. The goal in simple words: Login to a page, enter some values in some text fields, click Submit, then extract some values from the retrieved page. What is the best approach? Some Unit testing 3rd party lib? Manual

Is there a better approach to use BeautifulSoup in my python web crawler codes?

陌路散爱 提交于 2019-12-23 00:42:12
问题 I'm trying to crawl information from urls in a page and save them in a text file. I have recieved great help in the question How to get the right source code with Python from the URLs using my web crawler? and I try to use what I have learned in BeautifulSoup to finish my codes based on that question. But when I look at my codes, although they have satisfied my need, but they look pretty messed up. Can anyone help me to optimize them a little, especially on the BeautifulSoup part? Such as the

How to recursively crawl subpages with Scrapy

徘徊边缘 提交于 2019-12-22 18:36:27
问题 So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name

Using scrapy to find specific text from multiple websites

只愿长相守 提交于 2019-12-22 18:08:32
问题 I would like to crawl/check multiple websites(on same domain) for a specific keyword. I have found this script, but I can't find how to add the specific keyword to be search for. What the script needs to do is find the keyword, and give the result in which link it was found. Could anyone point me to where i could read more about this ? I have been reading scrapy's documentation, but I can't seem to find this. Thank you. class FinalSpider(scrapy.Spider): name = "final" allowed_domains = [

How do I scrape HTML between two HTML comments using Nokogiri?

随声附和 提交于 2019-12-22 18:08:24
问题 I have some HTML pages where the contents to be extracted are marked with HTML comments like below. <html> ..... <!-- begin content --> <div>some text</div> <div><p>Some more elements</p></div> <!-- end content --> ... </html> I am using Nokogiri and trying to extract the HTML between the <!-- begin content --> and <!-- end content --> comments. I want to extract the full elements between these two HTML comments: <div>some text</div> <div><p>Some more elements</p></div> I can get the text