screen-scraping | 易学教程

Scraping Google Front Page Results with php

阅读更多关于 Scraping Google Front Page Results with php

问题 i can with php code Scraping title and url from google search results now how to get descriptions $url = 'http://www.google.com/search?hl=en&safe=active&tbo=d&site=&source=hp&q=Beautiful+Bangladesh&oq=Beautiful+Bangladesh'; $html = file_get_html($url); $linkObjs = $html->find('h3.r a'); foreach ($linkObjs as $linkObj) { $title = trim($linkObj->plaintext); $link = trim($linkObj->href); // if it is not a direct link but url reference found inside it, then extract if (!preg_match('/^https?/',

Scraping multiple paginated links with BeautifulSoup and Requests

阅读更多关于 Scraping multiple paginated links with BeautifulSoup and Requests

问题 Python Beginner here. I'm trying to scrape all products from one category on dabs.com. I've managed to scrape all products on a given page, but I'm having trouble iterating over all the paginated links. Right now, I've tried to isolate all the pagination buttons with the span class='page-list" but even that isn't working. Ideally, I would like to make the crawler keep clicking next until it has scraped all products on all pages. How can I do this? Really appreciate any input from bs4 import

writing and saving CSV file from scraping data using python and Beautifulsoup4

阅读更多关于 writing and saving CSV file from scraping data using python and Beautifulsoup4

问题 I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data from the website but I am having difficulty on writing the script to

Extracting specific data from a web page using PHP [duplicate]

阅读更多关于 Extracting specific data from a web page using PHP [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: HTML Scraping in Php I would like to know if is there any way to get from a webpage a specific string of text wich is updated every now and then using PHP. I´ve searched "all over the internet" and have found nothing. Just saw that preg_match could do it, but I didn't understand how to use it. imagine that a webpage contains this: <div name="changeable_text">**GET THIS TEXT**</div> How can I do it using PHP,

Get data from a facebook page wall or group wall for use on personal website

阅读更多关于 Get data from a facebook page wall or group wall for use on personal website

问题 I want to connect to public facebook page or group and list all entries from the wall on a personal website. I will use PHP on my server so that would be the best solution for me. Or javascript. Could anyone explain or perhaps give a working code on how to do this? Or just all steps nessesary for making this? If its possible to handle information about person, date, description ... for each post, that would be great! So my layout could be customized. Thanks for helping me out here! 回答1: You

Grabbing each frame of an HTML5 canvas

阅读更多关于 Grabbing each frame of an HTML5 canvas

问题 These palette cycle images are breathtaking: http://www.effectgames.com/demos/canvascycle/?sound=0 I'd like to make some (or all) of these into desktop backgrounds. I could use an animated gif version, but I have no idea how to get that from the canvas "animation". Is there anything available yet that can do something along these lines (speficially for that link and generally speaking). 回答1: I have a solution but it is dependent on you being familiar with the Javascript Console in Firefox

Issue with html tags while scraping data using beautiful soup

阅读更多关于 Issue with html tags while scraping data using beautiful soup

问题 Common piece of code: # -*- coding: cp1252 -*- import csv import urllib2 import sys import time from bs4 import BeautifulSoup from itertools import islice page = urllib2.urlopen('http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html').read() soup = BeautifulSoup(page) prices = soup.findAll('div', {"class": "price"}) After this I am trying following codes to get data: Code 1: for price in prices: print unicode(price.string).encode('utf8') Output1: No Output, code runs without any

HtmlAgilityPack - Grab data from html table

阅读更多关于 HtmlAgilityPack - Grab data from html table

问题 My program uses HtmlAgilityPack and grabs a HTML web page, stores it in a variable and I'm trying to get from the HTML two tables which are under specific Div Class tags (boardcontainer). With my current code it searches through the whole web page for every table and displays them but when a cell is empty it throws an exception: "NullReferenceException was unhandled - Object reference not set to an instance of an object.". A snippet of the HTML (In this case I'm searching 'Microsoft' on the

Screen scraping a mainframe screen in C# without 3rd-party utilities

阅读更多关于 Screen scraping a mainframe screen in C# *without* 3rd-party utilities

问题 I'm looking to screen scrape a 3270 mainframe application in C#, but I've got to do so without Attachmate or other 3rd party plugins. Are there free managed libraries to do so in C#? 回答1: http://www.elink.ibmlink.ibm.com/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GA23-0059-07 This is the document you are looking for if you plan on doing all of the heavy lifting yourself. It doesn't print out well but is the best source of information on the protocol. I am about to embark on this road

Python urllib2.open Connection reset by peer error

阅读更多关于 Python urllib2.open Connection reset by peer error

问题 I'm trying to scrape a page using python The problem is, I keep getting Errno54 Connection reset by peer. The error comes when I run this code - urllib2.urlopen("http://www.bkstr.com/webapp/wcs/stores/servlet/CourseMaterialsResultsView?catalogId=10001&categoryId=9604&storeId=10161&langId=-1&programId=562&termId=100020629&divisionDisplayName=Stanford&departmentDisplayName=ILAC&courseDisplayName=126&sectionDisplayName=01&demoKey=d&purpose=browse") this happens for all the urls on this pag- what