html-parsing

Help With PHP and XPath

好久不见. 提交于 2019-12-11 09:46:41
问题 I need help doing a few things with XPath in PHP. With any given HTML, I need to: Remove all tables and their contents Remove everything after the first h1 tag Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc)) With regex, I got everything working perfectly. When I encountered nested tables, however, I decided that it is indeed foolish to parse HTML with regex. Thanks so much! 回答1: With any given HTML, I need to: • Remove all tables and their contents • Remove everything

html search and replace on server side

為{幸葍}努か 提交于 2019-12-11 09:40:03
问题 I like to search something like stack <"sometag"> overflow</"sometag"> and replace with stack <"sometag">underflow</"sometag"> It is part of a large html text and I like to do it in Java (there is some limitation on the server side technologies that I can use). I searched through and found this post: How to find/replace text in html while preserving html tags/structure One of the answers there suggests marking with special markers, producing plain text and then using regex. Finally unmarking

Search in HTML page using Regex patterns with python

▼魔方 西西 提交于 2019-12-11 09:36:33
问题 I'm trying to find a string inside a HTML page with known patterns. for example, in the following HTML code: <TABLE WIDTH="100%"> <TR><TD ALIGN="LEFT" width="50%"> </TD> <TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE> <HR> <TABLE WIDTH="100%"> <TR> <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD> <TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A HREF=#h2></A><B><I><

Extract an HTML tag name from a string

主宰稳场 提交于 2019-12-11 08:46:29
问题 I want to extract the tag name from an HTML tag with attributes. For example, I have this tag <a href="http://chat.stackoverflow.com" class="js-gps-track" data-gps-track="site_switcher.click({ item_type:6 })" > and I need to extract the tag name a I have tried the following regex, but it doesn't work. if ( $raw =~ /^<(\S*).*>$/ ) { print "$1 is tag name of string\n"; } What is wrong with my code? 回答1: Your regex is not matching the new line. You have to use s flag (single line) but since your

How to extract source html from webpage?

喜夏-厌秋 提交于 2019-12-11 08:42:31
问题 I am trying to extract the html source of this page, http://www.fxstreet.com/rates-charts/currency-rates/ I want what I see when I save the page from chrome as a .html file. I tried to do this in java, using bufferedreader, and then using jsoup. I also tried to do it in python, however I keep getting the following message: "This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser." The end goal is to extract the values in the main

Unix - parse html file and get all his resources list

瘦欲@ 提交于 2019-12-11 08:25:45
问题 I have an html file and i need to generate a list of all the resources it uses: *.htm, *.html, *.css, *.js, *.jpg I tried many options like grep and sed, without much sucess. Also am not sure how to do itin JAVA. This is an example file content: -------------------------------- > <link rel="StyleSheet" href="css/webworks.css" type="text/css" > media="all" /> > <script type="text/javascript" language="JavaScript1.2" src="wwhdata/common /context.js"> > /script> > <a class="WebWorks_Breadcrumb

Can't figure how to parse using HTML Agility Pack

让人想犯罪 __ 提交于 2019-12-11 08:02:36
问题 I have the following chunk of HTML code but i cant figure how i can get the designated values <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body > <form name="form1" method="post" action="" id="form1"> <div> <table class="tableclass" > <tbody> <tr> <tr> <td colspan="5" class="myclass1"><span id="myclass2">value1</span></td> </tr> <tr id="idvalue" aa="1" class

PHP : Extracting string between two tags by childs content [duplicate]

断了今生、忘了曾经 提交于 2019-12-11 07:55:59
问题 This question already has answers here : How do you parse and process HTML/XML in PHP? (30 answers) Closed 5 years ago . I have this following html markup: <ul> <li> <strong>Online:</strong> 2/14/2010 3:40 AM </li> <li> <strong>Hearing Impaired:</strong> No </li> <li> <strong>Downloads:</strong> 3,840 </li> </ul> and I want to catch 3,840 from last li by "Downloads:" . What do you suggest ? My attempt: preg_match('/<li><strong>Downloads:<\/strong>(.*?)<\/li>/s', $s, $a); 回答1: I suggest use an

XPath search through HTML tags

匆匆过客 提交于 2019-12-11 07:55:23
问题 The following HTML shows the 3rd search (search for "Practice Guidelines Professional") does not work as the text "Practice Guidelines" is placed between the <strong></strong> tag... Is it possible to achieve XPath search to bypass HTML tags between the texts? <html> <head> <meta http-equiv="X-UA-Compatible" content="chrome=1"> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="viewport" content="width

Having trouble accessing xpath attribute with scrapy

江枫思渺然 提交于 2019-12-11 07:54:29
问题 I am currently trying to scrape the following url: http://www.bedbathandbeyond.com/store/product/dyson-dc59-motorhead-cordless-vacuum/1042997979?categoryId=10562 On this page, I want to extract the number of reviews listed. That is, I want to extract the number 693. This is my current xpath: sel.xpath('//*[@id="BVRRRatingSummaryLinkReadID"]/a/span/span') It seems to be only returning an empty array, can someone suggest a correct xpath? 回答1: There are no reviews on the initial page you are