web-crawler

Is it safe to use the same CookieContainer across multiple HttpWebRequests?

房东的猫 提交于 2019-12-07 07:49:42
问题 I am doing a kind of WebCrawler and I need to persist the Cookies state between requests. I download all pages async creating new HttpWebRequest instances, but setting the same CookieContainer. The pages can write and read cookies. Can I do it safely? There is any alternative that isn´t subclass the CookieContainer and put locks at all method? The MSDN says that this class isn´t thread safe, but in practice, can I do it? 回答1: According to the documentation: Any public static (Shared in Visual

Post Username and Password to login page programmatically

泪湿孤枕 提交于 2019-12-07 06:28:51
问题 I want to post username and password to a login page of a remote website using asp.net and pass it, to access login-required pages on website. In the other words suppose there is a page on a website that i would to surf it an grab something from it ,but login is required before it . how to call that Login page and post username and password from the asp.net application to pass it. Thanks in advance 回答1: The comment with passing them as the query string only works for GET parameters. This

Python multithreading crawler

橙三吉。 提交于 2019-12-07 06:19:45
问题 Hello! I am trying to write web crawler with python. I wanted to use python multithreading. Even after reading earlier suggested papers and tutorials, I still have problem. My code is here (whole source code is here): class Crawler(threading.Thread): global g_URLsDict varLock = threading.Lock() count = 0 def __init__(self, queue): threading.Thread.__init__(self) self.queue = queue self.url = self.queue.get() def run(self): while 1: print self.getName()+" started" self.page = getPage(self.url)

What does it mean to say a web crawler is I/O bound and not CPU bound?

那年仲夏 提交于 2019-12-07 05:26:50
问题 I've seen this in some answers on S/O where the point is made that the programming language doesn't matter as much for a crawler and so C++ is overkill vs say Python. Can someone please explain this in layman's terms so that there's no ambiguity about what is implied? Clarification of the underlying assumption here is also appreciated. Thanks 回答1: It means that I/O is the bottleneck here. The act of going out to the net to retrieve a page (I/O) is slower than analysing the page (CPU). So,

Formatting Scrapy's output to XML

扶醉桌前 提交于 2019-12-07 04:07:13
问题 So I am attempting to export data scraped from a website using Scrapy to be in a particular format when I export it to XML. Here is what I would like my XML to look like: <?xml version="1.0" encoding="UTF-8"?> <data> <row> <field1><![CDATA[Data Here]]></field1> <field2><![CDATA[Data Here]]></field2> </row> </data> I am running my scrape by using the command: $ scrapy crawl my_scrap -o items.xml -t xml The current output I am getting is along the lines of: <?xml version="1.0" encoding="utf-8"?

Login into Linkedin with JSoup

若如初见. 提交于 2019-12-07 04:05:27
问题 I need to login into Linkedin with Jsoup, preferably. This is what I'm using to login to another website but it isn't working for Linkedin. Connection.Response res = Jsoup .connect("https://www.linkedin.com/uas/login?goback=&trk=hb_signin") .data("session_key", mail, "session_password", password) .method(Connection.Method.POST) .timeout(60000). // Also tried "https://www.linkedin.com/uas/login-submit" Map<String, String> loginCookies = res.cookies(); //Checking a profile to see if it was

Scrapy Deploy Doesn't Match Debug Result

给你一囗甜甜゛ 提交于 2019-12-07 04:04:27
I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic: Go to the homepage, and there are some categorylist that to be used to build the second wave of links. For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber . And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object. I tested these two steps separately by using the parse

Bingpreview invalidates one time links in email

对着背影说爱祢 提交于 2019-12-07 01:53:49
问题 It seems that Outlook.com uses the BingPreview crawler to crawl links in emails. But the one-time links are marked as used/expired after opening the email and before the user gets the chance to use them. I try to add a rel="nofollow" in the <a> but without success. How can I block the crawler for each links in email ? Thanks 回答1: I did the same. $user_agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : ''; // Deny access for the BingPreview bot, used by outlook.com on

Scrapy - Select specific link based on text

我只是一个虾纸丫 提交于 2019-12-06 23:09:05
问题 This should be easy but I'm stuck. <div class="paginationControl"> <a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&powerunit=2">Link Text 2</a> | <a href="/en/overview/0-All_manufactures/0-All_models.html?page=3&powerunit=2">Link Text 3</a> | <a href="/en/overview/0-All_manufactures/0-All_models.html?page=4&powerunit=2">Link Text 4</a> | <a href="/en/overview/0-All_manufactures/0-All_models.html?page=5&powerunit=2">Link Text 5</a> | <!-- Next page link --> <a href="/en

Is there anyway of making json data readable by a Google spider?

风格不统一 提交于 2019-12-06 22:20:29
问题 Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.) My concern here is that, the