html-parsing | 易学教程

Error 410 (“resource no longer available”) while getting html code of an url in Python

阅读更多关于 Error 410 (“resource no longer available”) while getting html code of an url in Python

问题 I am trying to get the html of the following link: http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html To do so, I proceeded as follows: import requests try: from BeautifulSoup import BeautifulSoup except ImportError: from bs4 import BeautifulSoup url='http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html' html=requests.get(url) And the html code I get ( print(html.text) ) is the following: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head>

Why Jsoup cannot select td element?

阅读更多关于 Why Jsoup cannot select td element?

问题 I have made little test (with Jsoup 1.6.1): String s = "" +Jsoup.parse("<td></td>").select("td").size(); System.out.println("Selected elements count : " + s); It outputs: Selected elements count : 0 But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup? 回答1: Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table> . int size = Jsoup.parse("<table><td></td></table>

Temporary removal of HTML from string for Google Translate API to reduce cost

阅读更多关于 Temporary removal of HTML from string for Google Translate API to reduce cost

问题 I have to translate some details using a Google API which we're paying for. The details contain HTML, and Google charges for each character. I don't want to send the complete content, but only the English text instead, with the HTML removed. I can remove HTML tags and entities using PHP functions, but I have to place the English content back in the HTML tags after translation for proper display. It will also include CSS. Example: <strong>This is a test</strong><br /> <custom tag>This is a

DOM parser: remove certain attributes only

阅读更多关于 DOM parser: remove certain attributes only

问题 How can I use DOM parser to remove all attributes in span tags but except these two attributes, <span style="text-decoration: underline;">cultura</span> accept <span style="text-decoration: line-through;">heart</span> accept reject this, <span style="font-family: " lang="EN-US">May</span> accept Is it possible? My working code from the other post I made, $content = ' <span style="text-decoration: underline;">cultura</span>l <span style="text-decoration: line-through;">heart</span> <span style

jQuery: Risk of not closing tags in constructors

阅读更多关于 jQuery: Risk of not closing tags in constructors

问题 Is there any reason I would use $('<div></div>') instead of $('<div>') ? Or $('<div><b></b></div>') instead of $('<div><b>') ? I like the latter in both cases because it is shorter. 回答1: That depends on whether you use a single tag or multiple tags to create the element/elements. If you use a single tag, jQuery will use the document.createElement method to create the element, so it doesn't matter if you use "<div/>" or "<div></div>" . If you have several elements, jQuery will create the

HTML parsing in Android

阅读更多关于 HTML parsing in Android

问题 I am trying to learn how to parse HTML, but as I don't have a lot of experience in either Java or Android, it's a little complicated. I have read the IBM XML parsing tutorial and have learned to parse an RSS feed. My problem is: I would like to get data from an HTML site. I have read some information on HTML cleaner, JSON, etc., but I can't find a good tutorial to help me. Do you have any tutorials that might be helpful? Thanks. 回答1: Check out the following HTML parsers. There are more out

Python Beautifulsoup Find_all except

阅读更多关于 Python Beautifulsoup Find_all except

问题 I'm struggling to find a simple to solve this problem and hope you might be able to help. I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below: <div class="product_item0 ">...</div> <div class="product_item1 ">...</div> <div class="product_item2 ">...</div> <div class="product_item0 ">...</div> <div class="product_item1 ">...</div> <div class="product_item2 ">...</div> <div class="product_item0 ">...</div> <div

JSoup.connect throws 403 error while apache.httpclient is able to fetch the content

阅读更多关于 JSoup.connect throws 403 error while apache.httpclient is able to fetch the content

问题 I am trying to parse HTML dump of any given page. I used HTML Parser and also tried JSoup for parsing. I found useful functions in Jsoup but I am getting 403 error while calling Document doc = Jsoup.connect(url).get(); I tried HTTPClient, to get the html dump and it was successful for the same url. Why is JSoup giving 403 for the same URL which is giving content from commons http client? Am I doing something wrong? Any thoughts? 回答1: Working solution is as follows (Thanks to Angelo

Accessing html generated by Javascript with htmlunit -Java

阅读更多关于 Accessing html generated by Javascript with htmlunit -Java

问题 I am trying to be able to test a website that uses javascript to render most of the HTML. With the HTMLUNIT browser how would you be able to access the html generated by the javascript? I was looking through their documentation but wasn't sure what the best approach might be. WebClient webClient = new WebClient(); HtmlPage currentPage = webClient.getPage("some url"); String Source = currentPage.asXml(); System.out.println(Source); This is an easy way to get back the html of the page but would

Web Scraping (in R?)

阅读更多关于 Web Scraping (in R?)

问题 I want to get the names of the companies in the middle column of this page (written in bold in blue), as well as the location indicator of the person who is registering the complaint (e.g. "India, Delhi", written in green). Basically, I want a table (or data frame) with two columns, one for company, and the other for location. Any ideas? 回答1: You can easily do this using the XML package in R . Here is the code url = "http://www.consumercomplaints.in/bysubcategory/mobile-service-providers/page