html-parsing

BeautifulSoup html missing

我只是一个虾纸丫 提交于 2019-12-23 05:43:11
问题 I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day. So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d I would want to acquire this (from the "Download Data" link above the table of data): "https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200

BeautifulSoup html missing

ぐ巨炮叔叔 提交于 2019-12-23 05:43:06
问题 I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day. So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d I would want to acquire this (from the "Download Data" link above the table of data): "https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200

Replace characters in an HTML document that match a regex, except those inside tags

两盒软妹~` 提交于 2019-12-23 04:25:29
问题 I want to replace all characters matching a pattern in a HTML document except those inside HTML tags. How do you do this with a regex using Perl or sed? Example: replace all "a" with "b" but not if "a" is in an HTML tag like <a href="aaa"> . 回答1: As pointed out in the comments a HTML parser is the ideal solution for your problem, however if you do for whatever reason want to use a regex, the following will work: a(?![^<]*>) Working example on RegExr and the same for input. And in Perl : $var

How to turn off automatic generation of close tags </tagName> in Jsoup?

筅森魡賤 提交于 2019-12-23 04:23:45
问题 I was trying to parse HTML document where I encountered the following scenario. I have put the content in the form of string in the following code. In this there is a P tag inside an anchor tag. If parsed with Jsoup, it adds an extra < /a> tag and < a> tags in between near #item1, changing the html structure. public class Test{ public static void main(String[] args) { String html="<A HREF=\"#Item1\">\n" + "<p style=\"font-family:times;margin-top:12pt;margin-left:0pt;\">\n" + "<FONT SIZE=2

extracting paragraph in python using lxml

岁酱吖の 提交于 2019-12-23 02:45:11
问题 I would like to extract paragraphs in html by python. I used lxml module but it doesn't do exactly what I am looking for. print html.parse(url).xpath('//p')[1].text_content() <span id="midArticle_1"></span><p>Here is the First Paragraph.</p><span id="midArticle_2"></span><p>Here is the second Paragraph.</p><span id="midArticle_3"></span><p>Paragraph Three."</p> I should add that, in different pages I have different number of paragraph, so would like to make a list and put paragraph into it

Extracting HTML table into R

陌路散爱 提交于 2019-12-23 01:14:06
问题 I've been trying to extract a table from a webpage. The data is a flight track data from live flight tracking website (https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog). I've tried XML, RCurl and Curl packages, but I didn't work. I believe most likely because I couldn't figure out how to avoid the SSL as well as the columns that contains notes on the flight status (i. e., first two from the top and third from the bottom of the table). Can any one knows how

Library to generate .NET XmlDocument from HTML tag soup

无人久伴 提交于 2019-12-22 19:54:05
问题 I'm looking for a .NET library that can generate a clean Xml tree, ideally System.Xml.XmlDocument, from invalid HTML code. I.E. it should make the kind of best effort guesses, repairs, and substitutions browsers do when confronted with this situation, and generate a pretend XmlDocument. The library should also be well-maintained. :) I realize this is a lot (too much?) to ask, and I would appreciate any useful leads. There seem to be a fair number of implementations of this for Java, but I

Library to generate .NET XmlDocument from HTML tag soup

回眸只為那壹抹淺笑 提交于 2019-12-22 19:53:16
问题 I'm looking for a .NET library that can generate a clean Xml tree, ideally System.Xml.XmlDocument, from invalid HTML code. I.E. it should make the kind of best effort guesses, repairs, and substitutions browsers do when confronted with this situation, and generate a pretend XmlDocument. The library should also be well-maintained. :) I realize this is a lot (too much?) to ask, and I would appreciate any useful leads. There seem to be a fair number of implementations of this for Java, but I

Which wiki markup parser does Wikipedia use?

ε祈祈猫儿з 提交于 2019-12-22 18:35:01
问题 None of these parsers are used by Wikipedia; None of them handle the wiki code correctly. Does anyone know what parser Wikipedia uses? 回答1: Wikipedia uses MediaWiki, which has its own parser. 回答2: Wikipedia runs on the Mediawiki engine, originally written precisely to use for Wikipedia. They implement their own parser. A more thorough description of the parser is available in the manual. 来源: https://stackoverflow.com/questions/5956883/which-wiki-markup-parser-does-wikipedia-use

Parsing input element using JSoup

狂风中的少年 提交于 2019-12-22 10:53:42
问题 JSoup is used to parse the following html <input type="checkbox" id="id12" name="renewalCheckboxGroup" value="check1" class="wicket-id11" /> Here is the code of JSoup Document document = Jsoup.parse("<input type=\"checkbox\" id=\"id12\" name=\"renewalCheckboxGroup\" value=\"check1\" class=\"wicket-id11\" />"); System.out.println(document.id()); Expected result should be id12, however, the returned id is an empty string. I also try to call attribute("id") function as well, but still in vain.