html-parsing | 易学教程

Parsing HTML with VB DOTNET

阅读更多关于 Parsing HTML with VB DOTNET

问题 I am trying to parse some data from a website to get specific items from their tables. I know that any tag with the bgcolor attribute set to #ffffff or #f4f4ff is where I want to start and my actual data sits in the 2nd within that . Currently I have: Private Sub runForm() Dim theElementCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("TR") For Each curElement As HtmlElement In theElementCollection Dim controlValue As String = curElement.GetAttribute("bgcolor")

Parsing HTML with VB DOTNET

阅读更多关于 Parsing HTML with VB DOTNET

How can I find the contents of the first h3 tag?

阅读更多关于 How can I find the contents of the first h3 tag?

问题 I am looking for a regex to find the contents of the first <h3> tag. What can I use there? 回答1: You should use php's DOM parser instead of regular expressions. You're looking for something like this (untested code warning): $domd = new DOMDocument(); libxml_use_internal_errors(true); $domd->loadHTML($html_content); libxml_use_internal_errors(false); $domx = new DOMXPath($domd); $items = $domx->query("//h3[position() = 1]"); echo $items->item(0)->textContent; 回答2: Well, a simple solution would

How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

阅读更多关于 How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

问题 As i want to remove duplicated placeholders in a html website, i use the .next_sibling operator of BeautifulSoup. As long as the duplicates are in the same line, this works fine (see data). But sometimes there is a empty line between them - so i want .next_sibling to ignore them (have a look at data2) That is the code: from bs4 import BeautifulSoup, Tag data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>" data2 = """<p>method-removed-here</p> <p>method

What regex can I use to extract URLs from a Google search?

阅读更多关于 What regex can I use to extract URLs from a Google search?

问题 I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it. I'm using it similar to: Var re:JVCLRegEx; I:Integer; Begin re := TJclRegEx.Create; With re do try Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false); If match(memo1.lines.text) then

Extract HTML Table ( span ) tags using Jsoup in Java

阅读更多关于 Extract HTML Table ( span ) tags using Jsoup in Java

问题 I am trying to extract the td name and the span class. In the sample code, I want to extract the a href with in the first td "accessory" and the span tag in the second td. I want to print Mouse, is-present, yes KeyBoard, No Dual-Monitor, is-present, Yes When I use the below Java code, I get, Mouse Yes Keyboard No Dual-Monitor Yes. How do I get the span class name? HTML Code <tr> <td class="" width="1%" style="padding:0px;"> </td> <td class=""> <a href="/accessory">Mouse</a> </td> <td class=

Scraping: cannot access information from web

阅读更多关于 Scraping: cannot access information from web

问题 I am scraping some information from this url: https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab Everything was fine till I scraped the description. I tried and tried to scrape, but I failed so far. It seems like I can't reach that information. Here is my code: html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon") tree=BeautifulSoup(html, "lxml")

Is it possible to select HTML comments using QueryPath?

阅读更多关于 Is it possible to select HTML comments using QueryPath?

问题 I see this is possible using jQuery, but how can it be done in QueryPath? Selecting HTML Comments with jQuery If not, can anyone suggest an HTML parser that can select comments? 回答1: QueryPath comes with an extension called QPXML that has several add-on methods. One of these is comment() . To use it, simply include it in your script: include 'QueryPath/QueryPath.php'; include 'QueryPath/Extensions/QPXML.php'; htmlqp($html, $selector)->comment(); This will retrieve the first comment attached

PHP DOM traverse HTML nodes and childnode

阅读更多关于 PHP DOM traverse HTML nodes and childnode

问题 I am using some code to pick out all the <td> tags from a HTML page: $dom = new DOMDocument; $dom->loadHTML($html); foreach ($dom->getElementsByTagName('td') as $node) { $array_data[ ] = $node->nodeValue; } This stores the data fine in my array. The html data being looked at is: <tr> <td>DATA 1</td> <td><a href="12345">DATA 2</a></td> <td>DATA 3</td> </tr> The $array_data returns: Array([0])=>DATA 1 [1]=>DATA 2 [2]=> DATA 3) My desired output is to get code out of the <a> tag that is

how do i loop a re.search for the next data

阅读更多关于 how do i loop a re.search for the next data

问题 I have a 2 set of data i crawled from a html table using regex expression data: <div class = "info"> <div class="name"><td>random</td></div> <div class="hp"><td>123456</td></div> <div class="email"><td>random@mail.com</td></div> </div> <div class = "info"> <div class="name"><td>random123</td></div> <div class="hp"><td>654321</td></div> <div class="email"><td>random123@mail.com</td></div> </div> regex: matchname = re.search('\<div class="name"><td>(.*?)</td>' , match3).group(1) matchhp = re