html-parsing

Parsing HTML with VB DOTNET

家住魔仙堡 提交于 2020-01-04 06:51:11
问题 I am trying to parse some data from a website to get specific items from their tables. I know that any tag with the bgcolor attribute set to #ffffff or #f4f4ff is where I want to start and my actual data sits in the 2nd within that . Currently I have: Private Sub runForm() Dim theElementCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("TR") For Each curElement As HtmlElement In theElementCollection Dim controlValue As String = curElement.GetAttribute("bgcolor")

Parsing HTML with VB DOTNET

て烟熏妆下的殇ゞ 提交于 2020-01-04 06:50:49
问题 I am trying to parse some data from a website to get specific items from their tables. I know that any tag with the bgcolor attribute set to #ffffff or #f4f4ff is where I want to start and my actual data sits in the 2nd within that . Currently I have: Private Sub runForm() Dim theElementCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("TR") For Each curElement As HtmlElement In theElementCollection Dim controlValue As String = curElement.GetAttribute("bgcolor")

How can I find the contents of the first h3 tag?

二次信任 提交于 2020-01-04 04:02:51
问题 I am looking for a regex to find the contents of the first <h3> tag. What can I use there? 回答1: You should use php's DOM parser instead of regular expressions. You're looking for something like this (untested code warning): $domd = new DOMDocument(); libxml_use_internal_errors(true); $domd->loadHTML($html_content); libxml_use_internal_errors(false); $domx = new DOMXPath($domd); $items = $domx->query("//h3[position() = 1]"); echo $items->item(0)->textContent; 回答2: Well, a simple solution would

How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

本小妞迷上赌 提交于 2020-01-03 11:28:19
问题 As i want to remove duplicated placeholders in a html website, i use the .next_sibling operator of BeautifulSoup. As long as the duplicates are in the same line, this works fine (see data). But sometimes there is a empty line between them - so i want .next_sibling to ignore them (have a look at data2) That is the code: from bs4 import BeautifulSoup, Tag data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>" data2 = """<p>method-removed-here</p> <p>method

What regex can I use to extract URLs from a Google search?

拟墨画扇 提交于 2020-01-03 05:27:15
问题 I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it. I'm using it similar to: Var re:JVCLRegEx; I:Integer; Begin re := TJclRegEx.Create; With re do try Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false); If match(memo1.lines.text) then

Extract HTML Table ( span ) tags using Jsoup in Java

守給你的承諾、 提交于 2020-01-03 05:24:09
问题 I am trying to extract the td name and the span class. In the sample code, I want to extract the a href with in the first td "accessory" and the span tag in the second td. I want to print Mouse, is-present, yes KeyBoard, No Dual-Monitor, is-present, Yes When I use the below Java code, I get, Mouse Yes Keyboard No Dual-Monitor Yes. How do I get the span class name? HTML Code <tr> <td class="" width="1%" style="padding:0px;"> </td> <td class=""> <a href="/accessory">Mouse</a> </td> <td class=

Scraping: cannot access information from web

旧街凉风 提交于 2020-01-03 02:52:31
问题 I am scraping some information from this url: https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab Everything was fine till I scraped the description. I tried and tried to scrape, but I failed so far. It seems like I can't reach that information. Here is my code: html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon") tree=BeautifulSoup(html, "lxml")

Is it possible to select HTML comments using QueryPath?

依然范特西╮ 提交于 2020-01-03 01:58:20
问题 I see this is possible using jQuery, but how can it be done in QueryPath? Selecting HTML Comments with jQuery If not, can anyone suggest an HTML parser that can select comments? 回答1: QueryPath comes with an extension called QPXML that has several add-on methods. One of these is comment() . To use it, simply include it in your script: include 'QueryPath/QueryPath.php'; include 'QueryPath/Extensions/QPXML.php'; htmlqp($html, $selector)->comment(); This will retrieve the first comment attached

PHP DOM traverse HTML nodes and childnode

泪湿孤枕 提交于 2020-01-03 01:40:11
问题 I am using some code to pick out all the <td> tags from a HTML page: $dom = new DOMDocument; $dom->loadHTML($html); foreach ($dom->getElementsByTagName('td') as $node) { $array_data[ ] = $node->nodeValue; } This stores the data fine in my array. The html data being looked at is: <tr> <td>DATA 1</td> <td><a href="12345">DATA 2</a></td> <td>DATA 3</td> </tr> The $array_data returns: Array([0])=>DATA 1 [1]=>DATA 2 [2]=> DATA 3) My desired output is to get code out of the <a> tag that is

how do i loop a re.search for the next data

让人想犯罪 __ 提交于 2020-01-02 23:14:46
问题 I have a 2 set of data i crawled from a html table using regex expression data: <div class = "info"> <div class="name"><td>random</td></div> <div class="hp"><td>123456</td></div> <div class="email"><td>random@mail.com</td></div> </div> <div class = "info"> <div class="name"><td>random123</td></div> <div class="hp"><td>654321</td></div> <div class="email"><td>random123@mail.com</td></div> </div> regex: matchname = re.search('\<div class="name"><td>(.*?)</td>' , match3).group(1) matchhp = re