html-parsing | 易学教程

How to extract the content attribute of the meta name=generator tag?

阅读更多关于 How to extract the content attribute of the meta name=generator tag?

问题 I am using the below code to extract meta 'generator' tag content from a web page using Jsoup: Elements metalinks = doc.select("meta[name=generator]"); boolean metafound=false; if(metalinks.isEmpty()==false) { metatagcontent = metalinks.first().select("content").toString(); metarequired=metatagcontent; metafound=true; } else { metarequired="NOT_FOUND"; metafound=false; } The problem is that for a page that does contain the meta generator tag, no value is shown (when I output the value of

How to get text & Other tags between specific tags using Jericho HTML parser?

阅读更多关于 How to get text & Other tags between specific tags using Jericho HTML parser?

问题 I have a HTML file which contains a specific tag, e.g. <TABLE cellspacing=0> and the end tag is </TABLE> . Now I want to get everything between those tags. I am using Jericho HTML parser in Java to parse the HTML. Is it possible to get the text & other tags between specific tags in Jericho parser? For example: <TABLE cellspacing=0> <tr><td>HELLO</td> <td>How are you</td></tr> </TABLE> Answer: <tr><td>HELLO</td> <td>How are you</td></tr> 回答1: Once you have found the Element of your table, all

Wikipedia Data Scraping with Python

阅读更多关于 Wikipedia Data Scraping with Python

问题 I am trying to retrieve 3 columns (NFL Team, Player Name, College Team) from the following wikipedia page. I am new to python and have been trying to use beautifulsoup to get this done. I only need the columns that belong to QB's but I haven't even been able to get all the columns despite position. This is what I have so far and it outputs nothing and I'm not entirely sure why. I believe it is due to the a tags but I do not know what to change. Any help would be greatly appreciated.' wiki =

BeautifulSoup -ing a website with login and site search engine

阅读更多关于 BeautifulSoup -ing a website with login and site search engine

问题 I'm trying to scrape International Maritime Organization's data (https://gisis.imo.org/Public/PAR/Search.aspx) on shipping vessel attacks between the dates ("is between" in the site's search engine) 2002-01-01, 2005-12-31. I've used bs4 and requests modules in python previously to scrape financial data from yahoo, and weather data from wunderground, but this site requires a login and password (under the "public" account type). Furthermore, as I said the data requires a search / filter before

How to convert an HTML content to PDF without losing the formatting using Java?

阅读更多关于 How to convert an HTML content to PDF without losing the formatting using Java?

问题 I have some HTML content (including formatting tags such as strong , images etc).In my Java code, I want to convert this HTML content into a PDF document without losing the HTML formatting. Is there anyway to do it in Java (using iText or any other library)? 回答1: I would try DocRaptor.com. It converts html to pdf or html to xls in any language, and since it uses Prince XML (without making you pay the expensive license fee), the quality is a lot better than the other options out there. It's

Strange encoding behaviour with jsoup

阅读更多关于 Strange encoding behaviour with jsoup

问题 I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion). The page that contains the error is: http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html I read the needed String with the following piece of code: Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html")

Regex to parse a multiline HTML

阅读更多关于 Regex to parse a multiline HTML

问题 am trying to parse a multi-line html file using regex. HTML code: <td>Details</td></tr> <tr class=d1> <td>uss_vod_translator</td> Regex Expression: if ($line =~ m/Details<\/td>\s*<\/tr>\s*<tr\s*class=d1>\s*<td>(\w*)<\/td>/) { print "$1"; } I am using /s* (space) for multi-line, but it is not working. I searched about it, even used /\? for multi-line but that too did not work. Can any one please suggest me how to parse a multiline HTML? I know regex is a poor solution to parse HTML. But i have

Extracting links from HTML

阅读更多关于 Extracting links from HTML

问题 I am trying to extract links from HTML. I am using the following regular expression href=\"([^\"]*)\" Which is extracting unnecessary links. How can I write a regular expression to extract only links with class="l" like <a href="http://users.elite.net/runner/jennifers/hello.htm" class="l"> <a href="http://www.hellodesign.com/" class="l"> <a href="http://www.ipl.org/div/hello/" class="l"> 回答1: Parsing HTML with regex is unnecessarily overcomplicated. Regex is the wrong tool for the job. Just

PHP Regex to remove last paragraph and contents

阅读更多关于 PHP Regex to remove last paragraph and contents

问题 I have the following stored in a MySQL table: <p>First paragraph</p><p>Second paragraph</p><p>Third paragraph</p><div class="item"><p>Some paragraph here</p><p><strong><u>Specs</u>:</strong><br /><br /><strong>Weight:</strong> 10kg<br /><br /><strong>LxWxH:</strong> 5mx1mx40cm</p><p>This is the paragraph I am trying to remove with regex.</p></div> I'm trying to remove the last paragraph tags and content on every row in the table. I can loop through the table with PHP easily enough, but the

DOMDocument Parse html

阅读更多关于 DOMDocument Parse html

问题 I have one html page where there are number of <tr><td> elements like <tr> <td class="notextElementLabel width100">address:</td> <td style="width: 100%" colspan="1" class="formFieldelement"><b>12284,CA</b></td> </tr> let say the above <tr> is at 4th position means before this elements there are 3 more <tr> Now I want to get the value of address so I am doing $doc = new DOMDocument(); @$doc->loadHTML($this->siteHtmlData); $tdElements = $doc->getElementsByTagName("td"); $i=0; foreach (