screen-scraping | 易学教程

Submitting queries to, and scraping results from aspx pages using python?

阅读更多关于 Submitting queries to, and scraping results from aspx pages using python?

问题 I am trying to get results for a batch of queries to this demographics tools page: http://adlab.microsoft.com/Demographics-Prediction/DPUI.aspx The POST action on the form calls the same page (_self) and is probably posting some event data. I read on another post here at stackoverflow that aspx pages typically need some viewstate and validation data. Do I simply save these from a request, re-send in a POST request? Or is there a cleaner way to do this? One of those aspx viewstate parameters

Sockets receive hangs

阅读更多关于 Sockets receive hangs

问题 I am trying to download, search page of bing, and ask using sockets, i have decided to use sockets, instead of webclient. The socket.Receive(); hangs after few loops in case of bing, yahoo, google but works for ask. for google loop will receive for 4 - 5 times, then freeze on the call. I am not able to figure out why? public string Get(string url) { Uri requestedUri = new Uri(url); string fulladdress = requestedUri.Host; IPHostEntry entry = Dns.GetHostEntry(fulladdress); StringBuilder sb =

HTML Agility Pack or HTML Screen Scraping libraries for Java, Ruby, Python?

阅读更多关于 HTML Agility Pack or HTML Screen Scraping libraries for Java, Ruby, Python?

问题 I found the HTML Agility Pack useful and easy to use for screen scraping web sites. What's the equivalent library for HTML screen scraping in Java, Ruby, Python? 回答1: Found what I was looking for: Options for HTML scraping? 回答2: BeautifulSoup is the standard Python screen scraping tool. Recently, however, I used the (incomplete at the moment) pyQuery, which is more or less a rewrite of jQuery into python, and found it to be very useful. 来源： https://stackoverflow.com/questions/1060484/html

curl not working for getting a web page content, why?

阅读更多关于 curl not working for getting a web page content, why?

问题 i am using a curl script to go to a link and get its content for further manipulation. following is the link and curl script: <?php $url = 'http://criminaljustice.state.ny.us/cgi/internet/nsor/fortecgi?serviceName=WebNSOR&templateName=detail.htm&requestingHandler=WebNSORDetailHandler&ID=368343543'; //curl script to get content of given url $ch = curl_init(); // set the target url curl_setopt($ch, CURLOPT_URL,$url); // request as if Firefox curl_setopt($ch, CURLOPT_HTTPHEADER, Array("User

Cannot seem to scrape a div class tag in Node.js

阅读更多关于 Cannot seem to scrape a div class tag in Node.js

问题 I'm new to node.js. My experience has been in Java and VBA. I'm trying to scrape a website for a friend and all is going well until I can't get what I’m after. <div class="gwt-Label ADC2X2-c-q ADC2X2-b-nb ADC2X2-b-Zb">Phone: +4576 102900</div> That tag just has a text. no attr or anything. Yet I cannot scrape it using cheerio. if(!err && resp.statusCode == 200){ var $ = cheerio.load(body); var number = $('//tried everything here!').text(); console.log(number); this function I also played

How to Obtain Lat/Lng of Google Maps Query String?

阅读更多关于 How to Obtain Lat/Lng of Google Maps Query String?

问题 I have this link: http://www.google.com/maps?cid=0,0,612446611849848549&f=q&source=embed&hl=en&geocode=&q=Универзална+Сала+&sll=,&&ie=UTF8&hq=&hnear=Универзална+Сала+&ll=,&z=15&iwloc=near What I want is to retrieve the Lat Lng of the pinpointed place. I have already tried to use the geocoding API: http://maps.googleapis.com/maps/api/geocode/xml?q=Универзална+Сала+&sensor=false but I am getting no results because the pinpoint refers to a place and not an address. How do I obtain the Lat Lng of

Pandas scraped data not working in pandas

阅读更多关于 Pandas scraped data not working in pandas

问题 Why is when I enter data manually into an excel, pandas works. Yet when I scrape data, put it in to a csv. It gives me: zz = df1.WE=np.where(df3.AL.isin(df1.EW),df1.WE,np.nan) ValueError: operands could not be broadcast together with shapes (148,) (537,) () It has not occurred for other sites. Am I missing something obvious here? Is the excel formatted incorrectly or the data is different here somehow? df3 df3 = pd.DataFrame(columns=['DAT', 'G', 'TN', 'O1', 'L1', 'TN2', 'O2', 'L2', 'D', 'AJ',

Scraping landing pages of a list of domains [closed]

阅读更多关于 Scraping landing pages of a list of domains [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework).

Beautifulsoup - scrape webpage - dynamically loading page

阅读更多关于 Beautifulsoup - scrape webpage - dynamically loading page

问题 i want to scrape a webpage : https://www.justdial.com/Mumbai/Dairy-Product-Retailers-in-Thane/nct-10152687 i need a data of all store name, tel- num and their address But i can only do it upto 10 cause to load other items you need to scroll the webpage my code : import requests import bs4 crawl_url = requests.get('https://www.justdial.com/Mumbai/Dairy-Product- Retailers-in-Thane/nct-10152687', headers={'User-Agent': 'Mozilla/5.0'}) crawl_url.raise_for_status() soup = bs4.BeautifulSoup(crawl

Scraping a plain text file with no HTML?

阅读更多关于 Scraping a plain text file with no HTML?

问题 I have the following data in a plain text file: 1. Value Location : Value Owner: Value Architect: Value 2. Value Location : Value Owner: Value Architect: Value ... upto 200+ ... The numbering and the word Value changes for each segment. Now I need to insert this data in to a MySQL database. Do you have a suggestion on how can I traverse and scrape it so I can get the value of the text beside the number, and the value of "location", "owner", "architect" ? Seems hard to do with DOM scraping