screen-scraping | 易学教程

web scraping to fill out (and retrieve) search forms?

阅读更多关于 web scraping to fill out (and retrieve) search forms?

问题 I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs.org/search/advanced), type in the authors/title/volume (etc.) and then find the article out of its list of returned results, and pick out the DOI and paste that

Screen Scraping HTML with C# [closed]

阅读更多关于 Screen Scraping HTML with C# [closed]

问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 4 years ago . I have been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer

How do you login to a webpage and retrieve its content in C#?

阅读更多关于 How do you login to a webpage and retrieve its content in C#?

问题 How do you login to a webpage and retrieve its content in C#? 回答1: That depends on what's required to log in. You could use a webclient to send the login credentials to the server's login page (via whatever method is required, GET or POST), but that wouldn't persist a cookie. There is a way to get a webclient to handle cookies, so you could just POST the login info to the server, then request the page you want with the same webclient, then do whatever you want with the page. 回答2: Look at

Can I scrape flash?

阅读更多关于 Can I scrape flash?

问题 I'd like to scrape a website to programmatically collect any external links within any flash elements on the page. I'd also like to collect any other text, if possible, but the links are the important part. Is this possible? A freeware library/service to accomplish this task would be preferable, but if none is, how can I accomplish the task on my own? Is it possible to get the source code and pull from that? 回答1: Decompiling the Flash source would let you see the ActionScript part of the

Indian Railway Train Search API [closed]

阅读更多关于 Indian Railway Train Search API [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . Is there any API provided by Indian Railways to search its train network, time-tables etc. There are many sites out there which show time-table etc. I searched Google but couldn't find any info on Web services or APIs provided by Railways. Is data scraping the only way? 回答1: You need to be a big shot to use their

Screen-scraping a windows application in c#

阅读更多关于 Screen-scraping a windows application in c#

问题 I need to scrape data from a windows application to run a query in another program. Does anyone know of a good starting point for me to do this in .NET? 回答1: Check out ManagedSpy, source code is provided. (link) 回答2: You may want to look into the WM_GETTEXT message. This can be used to read text from other windows -- it's an archaic part of the Windows API, and if you're in C#, you'll need to p/invoke for it. Check out this page for an example of doing this in C#. Basically, you first

How to run multiple Tor processes at once with different exit IPs?

阅读更多关于 How to run multiple Tor processes at once with different exit IPs?

问题 I am brand new to Tor and I feel like multiple Tors should be considered. The multiple tors I mentioned here are not only multiple instances, but also using different proxy ports for each, like what has been done here http://www.howtoforge.com/ultimate-security-proxy-with-tor) I am trying to get started with 4 Tors. However, the tutorial applies only to Arch Linux and I am using a headless EC2 ubuntu 64bits. It is really a pain going through the differences between Arch and Ubuntu. And here I

Parse a .Net Page with Postbacks

阅读更多关于 Parse a .Net Page with Postbacks

问题 I need to read data from an online database that's displayed using an aspx page from the UN. I've done HTML parsing before, but it was always by manipulating query-string values. In this case, the site uses asp.net postbacks. So, you click on a value in box one, then box two shows, click on a value in box 2 and click a button to get your results. Does anybody know how I could automate that process? Thanks, Mike 回答1: You may still only need to send one request, but that one request can be

Get instagram followers

阅读更多关于 Get instagram followers

问题 I want to parse a website's followers count with BeautifulSoup. This is what I have so far: username_extract = 'lazada_my' url = 'https://www.instagram.com/'+ username_extract r = requests.get(url) soup = BeautifulSoup(r.content,'lxml') f = soup.find('head', attrs={'class':'count'}) This is the part I want to parse: Something within my soup.find() function is wrong, but I can't wrap my head around it. When returning f, it is empty. Any idea what I am doing wrong? 回答1: I think you can use re

Why does scrapy throw an error for me when trying to spider and parse a site?

阅读更多关于 Why does scrapy throw an error for me when trying to spider and parse a site?

问题 The following code class SiteSpider(BaseSpider): name = "some_site.com" allowed_domains = ["some_site.com"] start_urls = [ "some_site.com/something/another/PRODUCT-CATEGORY1_10652_-1__85667", ] rules = ( Rule(SgmlLinkExtractor(allow=('some_site.com/something/another/PRODUCT-CATEGORY_(.*)', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(SgmlLinkExtractor(allow=('some_site.com/something/another/PRODUCT-DETAIL(.*)', )), callback="parse_item"), )