screen-scraping

Submitting queries to, and scraping results from aspx pages using python?

心已入冬 提交于 2019-12-11 05:27:54
问题 I am trying to get results for a batch of queries to this demographics tools page: http://adlab.microsoft.com/Demographics-Prediction/DPUI.aspx The POST action on the form calls the same page (_self) and is probably posting some event data. I read on another post here at stackoverflow that aspx pages typically need some viewstate and validation data. Do I simply save these from a request, re-send in a POST request? Or is there a cleaner way to do this? One of those aspx viewstate parameters

Sockets receive hangs

喜欢而已 提交于 2019-12-11 04:18:34
问题 I am trying to download, search page of bing, and ask using sockets, i have decided to use sockets, instead of webclient. The socket.Receive(); hangs after few loops in case of bing, yahoo, google but works for ask. for google loop will receive for 4 - 5 times, then freeze on the call. I am not able to figure out why? public string Get(string url) { Uri requestedUri = new Uri(url); string fulladdress = requestedUri.Host; IPHostEntry entry = Dns.GetHostEntry(fulladdress); StringBuilder sb =

HTML Agility Pack or HTML Screen Scraping libraries for Java, Ruby, Python?

Deadly 提交于 2019-12-11 04:16:42
问题 I found the HTML Agility Pack useful and easy to use for screen scraping web sites. What's the equivalent library for HTML screen scraping in Java, Ruby, Python? 回答1: Found what I was looking for: Options for HTML scraping? 回答2: BeautifulSoup is the standard Python screen scraping tool. Recently, however, I used the (incomplete at the moment) pyQuery, which is more or less a rewrite of jQuery into python, and found it to be very useful. 来源: https://stackoverflow.com/questions/1060484/html

curl not working for getting a web page content, why?

喜你入骨 提交于 2019-12-11 03:58:22
问题 i am using a curl script to go to a link and get its content for further manipulation. following is the link and curl script: <?php $url = 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&templateName=detail.htm&requestingHandler=WebNSORDetailHandler&ID=368343543'; //curl script to get content of given url $ch = curl_init(); // set the target url curl_setopt($ch, CURLOPT_URL,$url); // request as if Firefox curl_setopt($ch, CURLOPT_HTTPHEADER, Array("User

Cannot seem to scrape a div class tag in Node.js

我的未来我决定 提交于 2019-12-11 03:34:30
问题 I'm new to node.js. My experience has been in Java and VBA. I'm trying to scrape a website for a friend and all is going well until I can't get what I’m after. <div class="gwt-Label ADC2X2-c-q ADC2X2-b-nb ADC2X2-b-Zb">Phone: +4576 102900</div> That tag just has a text. no attr or anything. Yet I cannot scrape it using cheerio. if(!err && resp.statusCode == 200){ var $ = cheerio.load(body); var number = $('//tried everything here!').text(); console.log(number); this function I also played

How to Obtain Lat/Lng of Google Maps Query String?

大憨熊 提交于 2019-12-11 03:15:06
问题 I have this link: http://www.google.com/maps?cid=0,0,612446611849848549&f=q&source=embed&hl=en&geocode=&q=Универзална+Сала+&sll=,&&ie=UTF8&hq=&hnear=Универзална+Сала+&ll=,&z=15&iwloc=near What I want is to retrieve the Lat Lng of the pinpointed place. I have already tried to use the geocoding API: http://maps.googleapis.com/maps/api/geocode/xml?q=Универзална+Сала+&sensor=false but I am getting no results because the pinpoint refers to a place and not an address. How do I obtain the Lat Lng of

Pandas scraped data not working in pandas

試著忘記壹切 提交于 2019-12-11 02:35:31
问题 Why is when I enter data manually into an excel, pandas works. Yet when I scrape data, put it in to a csv. It gives me: zz = df1.WE=np.where(df3.AL.isin(df1.EW),df1.WE,np.nan) ValueError: operands could not be broadcast together with shapes (148,) (537,) () It has not occurred for other sites. Am I missing something obvious here? Is the excel formatted incorrectly or the data is different here somehow? df3 df3 = pd.DataFrame(columns=['DAT', 'G', 'TN', 'O1', 'L1', 'TN2', 'O2', 'L2', 'D', 'AJ',

Scraping landing pages of a list of domains [closed]

最后都变了- 提交于 2019-12-11 02:05:25
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework).

Beautifulsoup - scrape webpage - dynamically loading page

邮差的信 提交于 2019-12-11 01:58:11
问题 i want to scrape a webpage : https://www.justdial.com/Mumbai/Dairy-Product-Retailers-in-Thane/nct-10152687 i need a data of all store name, tel- num and their address But i can only do it upto 10 cause to load other items you need to scroll the webpage my code : import requests import bs4 crawl_url = requests.get('https://www.justdial.com/Mumbai/Dairy-Product- Retailers-in-Thane/nct-10152687', headers={'User-Agent': 'Mozilla/5.0'}) crawl_url.raise_for_status() soup = bs4.BeautifulSoup(crawl

Scraping a plain text file with no HTML?

橙三吉。 提交于 2019-12-11 01:29:22
问题 I have the following data in a plain text file: 1. Value Location : Value Owner: Value Architect: Value 2. Value Location : Value Owner: Value Architect: Value ... upto 200+ ... The numbering and the word Value changes for each segment. Now I need to insert this data in to a MySQL database. Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ? Seems hard to do with DOM scraping