html-parsing | 易学教程

BeautifulSoup html missing

阅读更多关于 BeautifulSoup html missing

问题 I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day. So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d I would want to acquire this (from the "Download Data" link above the table of data): "https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200

BeautifulSoup html missing

阅读更多关于 BeautifulSoup html missing

Replace characters in an HTML document that match a regex, except those inside tags

阅读更多关于 Replace characters in an HTML document that match a regex, except those inside tags

问题 I want to replace all characters matching a pattern in a HTML document except those inside HTML tags. How do you do this with a regex using Perl or sed? Example: replace all "a" with "b" but not if "a" is in an HTML tag like <a href="aaa"> . 回答1: As pointed out in the comments a HTML parser is the ideal solution for your problem, however if you do for whatever reason want to use a regex, the following will work: a(?![^<]*>) Working example on RegExr and the same for input. And in Perl : $var

How to turn off automatic generation of close tags </tagName> in Jsoup?

阅读更多关于 How to turn off automatic generation of close tags in Jsoup?

问题 I was trying to parse HTML document where I encountered the following scenario. I have put the content in the form of string in the following code. In this there is a P tag inside an anchor tag. If parsed with Jsoup, it adds an extra < /a> tag and < a> tags in between near #item1, changing the html structure. public class Test{ public static void main(String[] args) { String html="<A HREF=\"#Item1\">\n" + "\n" + "<FONT SIZE=2

extracting paragraph in python using lxml

阅读更多关于 extracting paragraph in python using lxml

问题 I would like to extract paragraphs in html by python. I used lxml module but it doesn't do exactly what I am looking for. print html.parse(url).xpath('//p')[1].text_content() Here is the First Paragraph.Here is the second Paragraph.Paragraph Three." I should add that, in different pages I have different number of paragraph, so would like to make a list and put paragraph into it

Extracting HTML table into R

阅读更多关于 Extracting HTML table into R

问题 I've been trying to extract a table from a webpage. The data is a flight track data from live flight tracking website (https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog). I've tried XML, RCurl and Curl packages, but I didn't work. I believe most likely because I couldn't figure out how to avoid the SSL as well as the columns that contains notes on the flight status (i. e., first two from the top and third from the bottom of the table). Can any one knows how

Library to generate .NET XmlDocument from HTML tag soup

阅读更多关于 Library to generate .NET XmlDocument from HTML tag soup

问题 I'm looking for a .NET library that can generate a clean Xml tree, ideally System.Xml.XmlDocument, from invalid HTML code. I.E. it should make the kind of best effort guesses, repairs, and substitutions browsers do when confronted with this situation, and generate a pretend XmlDocument. The library should also be well-maintained. :) I realize this is a lot (too much?) to ask, and I would appreciate any useful leads. There seem to be a fair number of implementations of this for Java, but I

Library to generate .NET XmlDocument from HTML tag soup

阅读更多关于 Library to generate .NET XmlDocument from HTML tag soup

Which wiki markup parser does Wikipedia use?

阅读更多关于 Which wiki markup parser does Wikipedia use?

问题 None of these parsers are used by Wikipedia; None of them handle the wiki code correctly. Does anyone know what parser Wikipedia uses? 回答1: Wikipedia uses MediaWiki, which has its own parser. 回答2: Wikipedia runs on the Mediawiki engine, originally written precisely to use for Wikipedia. They implement their own parser. A more thorough description of the parser is available in the manual. 来源： https://stackoverflow.com/questions/5956883/which-wiki-markup-parser-does-wikipedia-use

Parsing input element using JSoup

阅读更多关于 Parsing input element using JSoup

问题 JSoup is used to parse the following html <input type="checkbox" id="id12" name="renewalCheckboxGroup" value="check1" class="wicket-id11" /> Here is the code of JSoup Document document = Jsoup.parse("<input type=\"checkbox\" id=\"id12\" name=\"renewalCheckboxGroup\" value=\"check1\" class=\"wicket-id11\" />"); System.out.println(document.id()); Expected result should be id12, however, the returned id is an empty string. I also try to call attribute("id") function as well, but still in vain.