web-crawler

Selective data extraction from forum site using DOM PHP web crawler

二次信任 提交于 2019-12-13 08:26:55
问题 I have this PHP dom web crawler which works fine. it extracts mentioned tag along with its link from a (external) forum site to my page. But recently i ran into a problem. Like this is the HTML of the forum data:: <tbody> <tr> <td width="1%" height="25"> </td> <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td> <td width="1%" height="25"> </td> <td width="14%" height="25" class=

How to fix scrapy rules when only one rule is followed

心不动则不痛 提交于 2019-12-13 07:36:56
问题 This code is not working: name="souq_com" allowed_domains=['uae.souq.com'] start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"] rules = ( #categories Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'),unique=True)), Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div/a'),unique=True),callback='parse_item'), Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'),unique=True)), ) The first rule is getting responses

Itertools within web_crawler giving wrong triples

血红的双手。 提交于 2019-12-13 07:24:52
问题 I have written some code to parse name, link and price from craigslist. When I print the result, these are getting scraped as list. I tried like the pasted code below to get a workaround but it gives wrong triples specially when a value is none it gets the next available value from another triples and so on. For this reason, it is of no use in this case. Hope I'm gonna have any suggestion as to how I can get this accomplished whether it is Itertools or any other methods. import requests from

Requests: Explanation of the .text format

最后都变了- 提交于 2019-12-13 07:17:15
问题 I'm using the requests module along with Python 2.7 to build a basic web crawler. source_code = requests.get(url) plain_text = source_code.text Now, in the above lines of code, I'm storing the source code of the specified URL and other metadata inside the source_code variable. Now, in source_code.text , what exactly is the .text attribute? It is not a function. I couldn't find anything in the documentation which explains the origin or feature of .text either. 回答1: requests.get() returns a

How to optimize the scroll-down code in java using selenium

家住魔仙堡 提交于 2019-12-13 07:09:20
问题 I am working a project in MAVEN using Java . I have to get a URL, scroll them down ,and get all the links of other items in this given web page. Till now, I get the page dynamically using Selenium , and scrolling them down, and fetch the links also. But it takes too much time. Please help me in optimize that. Example:-, I am working on a page , whose link is here. My Questions :- Scrolling web page using selenium is very slow. How can I optimize this? (Suggest any other method to do the same

Web Crawler not working in nested divs

孤街浪徒 提交于 2019-12-13 06:07:51
问题 I am trying to make a web crawler that picks the interest of the people. Here is the code: import requests from bs4 import BeautifulSoup def facebook_spider(): url = 'https://www.facebook.com/abhas.mittal7' source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text , "html.parser") for div in soup.findAll('div', attrs={'class':'mediaRowWrapper'}): print div.text facebook_spider() It is not showing any results. However if I type in a different class of div

how Google crawl content with jQuery's load function?

為{幸葍}努か 提交于 2019-12-13 05:19:33
问题 I have a question regarding SEO, when you use the .load functionality in jQuery. You can load a document by referring to the href value of the link you clicked. In this first case, the folder name where the html documents are stored (../ajax/) is mentioned in the tag, not in jQuery: Code: <a href="ajax/test.html">test</a> var thelink = $(this).attr('href'); $('#content').load(thelink); Or you can load a document by adding the folder name of your html documents in your jQuery and not in your

Scrapy returning a null output when extracting an element from a table using xpath

孤人 提交于 2019-12-13 04:48:53
问题 I have been trying to scrape this website that has details of oil wells in Colorado https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELL Scrapy scrapes the website, and returns the URL when I scrape it, but when I need to extract an element inside a table using it's XPath (County of the oil well), all i get is a null output, ie []. This happens for any element I try to access in the page. Here is my spider: import scrapy import json class coloradoSpider(scrapy.Spider):

Invalid Cookie Header and then it ask's for Authorization

久未见 提交于 2019-12-13 04:46:49
问题 I am trying to crawl a page that requires Siteminder Authentication, So I am trying to pass my username and password in the code itself to access that page and keep on crawling all the links that are there in that page. This is my Controller.java code. And from this MyCrawler class is getting called. public class Controller { public static void main(String[] args) throws Exception { CrawlController controller = new CrawlController("/data/crawl/root"); controller.addSeed("http://ho.somehost

Create dynamic sitemap from URL with Ruby on Rails

白昼怎懂夜的黑 提交于 2019-12-13 04:38:59
问题 I am currently working on an application where I scrape information from a number of different sites. To get the deeplink for the desired topic on a site I rely on the sitemap that is provided (e.g. "Forum"). As I am expanding I came across some sites that don't provide a sitemap themselves, so I was wondering if there was any way to generate it within Rails from the top level domain? I am using Nokogiri and Mechanize to retrieve data, so if there is any functionality that could help to