screen-scraping

PHP equivalent of PyQuery or Nokogiri? [closed]

社会主义新天地 提交于 2019-12-03 18:05:57
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Basically, I want to do some HTML screen scraping, but figuring out if it is possible in PHP. In Python, I would use PyQuery. In Ruby, I would use Nokogiri. 回答1: In PHP you can use phpQuery P.S. it's kinda ironic, I came to this page looking for phpQuery equivalent in Python :) 回答2: In PHP for screen scraping

How to use Goutte

限于喜欢 提交于 2019-12-03 17:28:51
问题 Issue : Cannot fully understand the Goutte web scraper. Request : Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form? Webpage Layout attempting to be scraped : Step 1 : The webpage

Using Ruby with Mechanize to log into a website

匆匆过客 提交于 2019-12-03 17:03:09
I need to scrape data from a site, but it requires my login first. I've been using hpricot to successfully scrape other sites, but I'm new to using mechanize, and I'm truly baffled by how to work it. I see this example commonly quoted: require 'rubygems' require 'mechanize' a = Mechanize.new a.get('http://rubyforge.org/') do |page| # Click the login link login_page = a.click(page.link_with(:text => /Log In/)) # Submit the login form my_page = login_page.form_with(:action => '/account/login.php') do |f| f.form_loginname = ARGV[0] f.form_pw = ARGV[1] end.click_button my_page.links.each do |link|

Scrape HTML tables from a given URL into CSV

不问归期 提交于 2019-12-03 16:19:08
I seek a tool that can be run on the command line like so: tablescrape 'http://someURL.foo.com' [n] If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV. Potential additional features: To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. An option to asciify any unicode. An option to apply an

Why is python decode replacing more than the invalid bytes from an encoded string?

北战南征 提交于 2019-12-03 14:38:34
问题 Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome. The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX' >>> fragment = 'PREFIX\xe3\xabSUFFIX' >>> fragment.decode('utf-8', 'strict') ... UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data UPDATE : This question concluded in a bug report to Python unicode component. The Issue is reported to be fixed in Python 2.7.11 and 3.5.2. What

How can I Programmatically perform a search without using an API?

╄→гoц情女王★ 提交于 2019-12-03 13:13:18
I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? Would I be forced to use a public API? Is something like this just not feasible? Would I have to figure out query strings/parameters? Thanks Theory What I would do is create a little program that can automatically submit any form data to any place and come back

Scraping sites that require login with Python

北慕城南 提交于 2019-12-03 13:03:24
问题 I use several ad networks for my sites, and to see how much money I made I need to log in to each daily to add up the values. I was thinking of making a Python script that would do this for me to get a quick total. I know I need to do a POST request to log in, then store the cookies that I get back and then GET request the report page while passing in those cookies. What's the most convenient way to replicate in Python what I'm doing when I browse the sites manually? 回答1: See if this work for

Excluding unwanted results of findAll using BeautifulSoup

风格不统一 提交于 2019-12-03 12:48:32
Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook: <p class="review_comment"> So, using the simple code as follows, content = page.read() soup = BeautifulSoup(content) results = soup.find_all("p", "review_comment") I am happily parsing the text that is living here: <p class="review_comment"> This place is terrible!</p> The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated: <p class="review_comment"> It's 1999, and I will

Websites that are particularly challenging to crawl and scrape? [closed]

萝らか妹 提交于 2019-12-03 12:35:16
I'm interested in public facing sites (nothing behind a login / authentication) that have things like: High use of internal 301 and 302 redirects Anti-scraping measures (but not banning crawlers via robots.txt) Non-semantic, or invalid mark-up Content loaded via AJAX in the form of onclicks or infinite scrolling Lots of parameters used in urls Canonical problems Convoluted internal link structure and anything else that generally makes crawling a website a headache! I have built a crawler / spider that performs a range of analysis on a website, and I'm on the lookout for sites that will make it

BeautifulSoup and ASP.NET/C#

回眸只為那壹抹淺笑 提交于 2019-12-03 12:29:40
问题 Has anyone integrated BeautifulSoup with ASP.NET/C# (possibly using IronPython or otherwise)? Is there a BeautifulSoup alternative or a port that works nicely with ASP.NET/C# The intent of planning to use the library is to extract readable text from any random URL. Thanks 回答1: Html Agility Pack is a similar project, but for C# and .NET EDIT: To extract all readable text: document.DocumentNode.InnerText Note that this will return the text content of <script> tags. To fix that, you can remove