scrape | 易学教程

PHP Curl following redirects

阅读更多关于 PHP Curl following redirects

问题 I'm trying to be a bit sneeky and as part of a learning process try and improve my page scraping skills. One thing i've come across that I have yet to be able to solve is that certain sites will use an internal link which then redirects to an external link. What I want to do is modify some curl code to follow the redirects until they stop and then obtain the final resting place URL. Anyone recommend some code for me? I have this at the moment, but it's not following the redirects properly at

Python web scraping for javascript generated content

阅读更多关于 Python web scraping for javascript generated content

问题 I am trying to use python3 to return the bibtex citation generated by http://www.doi2bib.org/. The url's are predictable so the script can work out the url without having to interact with the web page. I have tried using selenium, bs4, etc but cant get the text inside the box. url = "http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9" import urllib.request from bs4 import BeautifulSoup text = BeautifulSoup(urllib.request.urlopen(url).read()) print(text) Can anyone suggest a way of

Parse Web Site HTML with JAVA [duplicate]

阅读更多关于 Parse Web Site HTML with JAVA [duplicate]

问题 This question already has answers here : Which HTML Parser is the best? [closed] (3 answers) Closed 3 years ago . I want to parse a simple web site and scrape information from that web site. I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop. URL url = new URL("http://www.deneme.com"); URLConnection uc = url.openConnection(); InputStreamReader input = new InputStreamReader(uc.getInputStream());

PHP how to set color to certain keywords (text) in scraped data

阅读更多关于 PHP how to set color to certain keywords (text) in scraped data

问题 Trying to do something a bit tricky, where I set a color for given keywords for an echo, that is gathered from web scraping. I was given an answer once, but unable to get it to actually change any colors. Here's the code I'm working with. <?php $doc = new DOMDocument; // djia/nas/sp current values $doc->preserveWhiteSpace = false; // Most HTML Developers are chimps and produce invalid markup... $doc->strictErrorChecking = false; $doc->recover = true; $doc->loadHTMLFile('http://www.nbcnews.com

VBA to change dropdown value in internet explorer

阅读更多关于 VBA to change dropdown value in internet explorer

问题 I am looking to automate internet explorer using Excel VBA to extract football results from a website and am really struggling with getting the data to update when I change the dropdown value. The website is: http://www.whoscored.com/Regions/250/Tournaments/30/Seasons/3871/Stages/8209/Fixtures/Europe-UEFA-Europa-League-2013-2014 I am looking to change the value of the 'stages' dropdown and scrape the match results. My code works fine for opening IE, changing the value of the 'scrape' dropdown

Use Ruby Mechanize to scrape all successive pages

阅读更多关于 Use Ruby Mechanize to scrape all successive pages

问题 I'm looking for assistance on the best way to loop through successive pages on a website while scraping relevant data off of each page. For example, I want to go to a specific site (craigslist in below example), scrape the data from the first page, go to the next page, scrape all relevant data, etc, until the very last page. In my script I'm using a while loop since it seemed to make the most sense to me. However, it doesn't appear to be working properly and is only scraping data from the

Using Regex to get multiple data on single line by scraping stocks from yahoo [closed]

阅读更多关于 Using Regex to get multiple data on single line by scraping stocks from yahoo [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . import urllib import re stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft'] for i in range(len(stocks_symbols)): htmlfile = urllib.urlopen("https://finance.yahoo.com/q?s=" + stocks_symbols[i]) htmltext = htmlfile.read(htmlfile) regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>' pattern = re

I cant seem able to scrape data form a website that is constantly changing its prices using VBA in excel

阅读更多关于 I cant seem able to scrape data form a website that is constantly changing its prices using VBA in excel

问题 I cant seem to find the ID when i inspect the source of the website "rofex.primary.ventures". All i want to do is grab all the data below the Ult column and put it into an excel worksheet. Ive used firefox because it shows the HTLM code in a nicer way but i would like to scrape it from chrome using an excel Macro. How would i do this? Sub Rofex() Dim appIE As Object Set appIE = CreateObject("internetexplorer.application") With appIE .Navigate "https://rofex.primary.ventures" .Visible = True

How to add new colum to Scrapy output from csv?

阅读更多关于 How to add new colum to Scrapy output from csv?

问题 I parse websites and it works fine but I need to add new colum with IDs to output. That column is saved in csv with urls: https://www.ceneo.pl/48523541, 1362 https://www.ceneo.pl/46374217, 2457 Code of my spider: import scrapy from ceneo.items import CeneoItem import csv class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): start_urls = [] f = open('urls.csv', 'r') for i in f: u = i.split(',') start_urls.append(u[0]) for url in start_urls: yield scrapy.Request(url=url,

Check if page contains specific word

阅读更多关于 Check if page contains specific word

问题 How can I check if a page contains a specific word. Example: I want to return true or false if the page contains the word "candybar". Notice that the "candybar" could be in between tags (candybar) sometimes and sometimes not. How do I accomplish this? Here is my code for "grabing" the site (just dont now how to check through the site): #!/usr/bin/perl -w use utf8; use RPC::XML; use RPC::XML::Client; use Data::Dumper; use Encode; use Time::HiRes qw(usleep); print "Content-type:text/html\n\n";