html-parsing | 易学教程

Help With PHP and XPath

阅读更多关于 Help With PHP and XPath

问题 I need help doing a few things with XPath in PHP. With any given HTML, I need to: Remove all tables and their contents Remove everything after the first h1 tag Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc)) With regex, I got everything working perfectly. When I encountered nested tables, however, I decided that it is indeed foolish to parse HTML with regex. Thanks so much! 回答1: With any given HTML, I need to: • Remove all tables and their contents • Remove everything

html search and replace on server side

阅读更多关于 html search and replace on server side

问题 I like to search something like stack <"sometag"> overflow</"sometag"> and replace with stack <"sometag">underflow</"sometag"> It is part of a large html text and I like to do it in Java (there is some limitation on the server side technologies that I can use). I searched through and found this post: How to find/replace text in html while preserving html tags/structure One of the answers there suggests marking with special markers, producing plain text and then using regex. Finally unmarking

Search in HTML page using Regex patterns with python

阅读更多关于 Search in HTML page using Regex patterns with python

问题 I'm trying to find a string inside a HTML page with known patterns. for example, in the following HTML code: <TABLE WIDTH="100%"> <TR><TD ALIGN="LEFT" width="50%"> </TD> <TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE> <HR> <TABLE WIDTH="100%"> <TR> <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD> <TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A HREF=#h2></A><B><I><

Extract an HTML tag name from a string

阅读更多关于 Extract an HTML tag name from a string

问题 I want to extract the tag name from an HTML tag with attributes. For example, I have this tag <a href="http://chat.stackoverflow.com" class="js-gps-track" data-gps-track="site_switcher.click({ item_type:6 })" > and I need to extract the tag name a I have tried the following regex, but it doesn't work. if ( $raw =~ /^<(\S*).*>$/ ) { print "$1 is tag name of string\n"; } What is wrong with my code? 回答1: Your regex is not matching the new line. You have to use s flag (single line) but since your

How to extract source html from webpage?

阅读更多关于 How to extract source html from webpage?

问题 I am trying to extract the html source of this page, http://www.fxstreet.com/rates-charts/currency-rates/ I want what I see when I save the page from chrome as a .html file. I tried to do this in java, using bufferedreader, and then using jsoup. I also tried to do it in python, however I keep getting the following message: "This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser." The end goal is to extract the values in the main

Unix - parse html file and get all his resources list

阅读更多关于 Unix - parse html file and get all his resources list

问题 I have an html file and i need to generate a list of all the resources it uses: *.htm, *.html, *.css, *.js, *.jpg I tried many options like grep and sed, without much sucess. Also am not sure how to do itin JAVA. This is an example file content: -------------------------------- > <link rel="StyleSheet" href="css/webworks.css" type="text/css" > media="all" /> > <script type="text/javascript" language="JavaScript1.2" src="wwhdata/common /context.js"> > /script> > <a class="WebWorks_Breadcrumb

Can't figure how to parse using HTML Agility Pack

阅读更多关于 Can't figure how to parse using HTML Agility Pack

问题 I have the following chunk of HTML code but i cant figure how i can get the designated values <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body > <form name="form1" method="post" action="" id="form1"> <div> <table class="tableclass" > <tbody> <tr> <tr> <td colspan="5" class="myclass1"><span id="myclass2">value1</span></td> </tr> <tr id="idvalue" aa="1" class

PHP : Extracting string between two tags by childs content [duplicate]

阅读更多关于 PHP : Extracting string between two tags by childs content [duplicate]

问题 This question already has answers here : How do you parse and process HTML/XML in PHP? (30 answers) Closed 5 years ago . I have this following html markup: <ul> <li> <strong>Online:</strong> 2/14/2010 3:40 AM </li> <li> <strong>Hearing Impaired:</strong> No </li> <li> <strong>Downloads:</strong> 3,840 </li> </ul> and I want to catch 3,840 from last li by "Downloads:" . What do you suggest ? My attempt: preg_match('/<li><strong>Downloads:<\/strong>(.*?)<\/li>/s', $s, $a); 回答1: I suggest use an

XPath search through HTML tags

阅读更多关于 XPath search through HTML tags

问题 The following HTML shows the 3rd search (search for "Practice Guidelines Professional") does not work as the text "Practice Guidelines" is placed between the <strong></strong> tag... Is it possible to achieve XPath search to bypass HTML tags between the texts? <html> <head> <meta http-equiv="X-UA-Compatible" content="chrome=1"> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="viewport" content="width

Having trouble accessing xpath attribute with scrapy

阅读更多关于 Having trouble accessing xpath attribute with scrapy

问题 I am currently trying to scrape the following url: http://www.bedbathandbeyond.com/store/product/dyson-dc59-motorhead-cordless-vacuum/1042997979?categoryId=10562 On this page, I want to extract the number of reviews listed. That is, I want to extract the number 693. This is my current xpath: sel.xpath('//*[@id="BVRRRatingSummaryLinkReadID"]/a/span/span') It seems to be only returning an empty array, can someone suggest a correct xpath? 回答1: There are no reviews on the initial page you are