html-parsing | 易学教程

BeautifulSoup HTML table parsing

阅读更多关于 BeautifulSoup HTML table parsing

问题 I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1 Currently I am using BeautifulSoup and the code I have looks like this from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1" page = mech.open(url) html = page.read() soup = BeautifulSoup(html) table = soup.find("table") rows = table.findAll('tr')[3] cols = rows.findAll('td')

Reading from a URL Connection Java

阅读更多关于 Reading from a URL Connection Java

问题 I'm trying to read html code from a URL Connection. In one case the html file I'm trying to read includes 5 line breaks before the actual doc type declaration. In this case the input reader throws an exception for EOF. URL pageUrl = new URL( "http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html" ); URLConnection getConn = pageUrl.openConnection(); getConn.connect(); DataInputStream dis = new DataInputStream(getConn.getInputStream()); //some read method here Has anyone ran into

How should parse with PHP (simple html dom parser) background images and other images of webpage?

阅读更多关于 How should parse with PHP (simple html dom parser) background images and other images of webpage?

问题 How should parse with PHP (simple html dom/etc..) background and other images of webpage? case 1: inline css <div id="id100" style="background:url(/mycar1.jpg)"></div> case 2: css inside html page <div id="id100"></div> <style type="text/css"> #id100{ background:url(/mycar1.jpg); } </style> case 3: separate css file <div id="id100" style="background:url(/mycar1.jpg);"></div> external.css #id100{ background:url(/mycar1.jpg); } case 4: image inside img tag solution to case 4 as he appears in

How can I simply parse a CSS like (!) file in my Qt application?

阅读更多关于 How can I simply parse a CSS like (!) file in my Qt application?

问题 I have a document in a *.css (Cascading Style Sheets) like format, but it has its own keywords. Actually it is a personalized css (I call it *.pss), with own tags and properties. here I have an excerpt: /* CSS like style sheet file *.pss */ @include "otherStyleSheet.pss"; /* comment */ [propertyID="1230000"] { fillColor : #f3f1ed; minSize : 5; lineWidth : 3; } /* sphere */ [propertyID="124???|123000"] { lineType : dotted; } /* square */ [propertyID="125???"] { lineType : thinline; } /* ring *

Grabbing meta-tags and comments using HTML Agility Pack

阅读更多关于 Grabbing meta-tags and comments using HTML Agility Pack

问题 I've looked for tutorials on using HTML Agility Pack as it seems to do everything I want it to do but it seems that for such a powerful tool there is little noise about it on the Internet. I am writing a simple method that will retrieve any given tag based on name: public string[] GetTagsByName(string TagName, string Source) { ... } This can be easily done using a Regular Expression but we all know that using the regex for parsing HTML isn't right. So far I have the following code: ... //

How to parse a webpage that includes Javascript? [duplicate]

阅读更多关于 How to parse a webpage that includes Javascript? [duplicate]

问题 This question already has an answer here : Parse JavaScript with jsoup (1 answer) Closed 6 years ago . I've got a webpage that creates a table using Javascript. Right now I'm using JSoup in my Java project to parse the webpage. By the way JSoup isn't able to run Javascript so the table isn't generated and the source of the webpage is incomplete. How can I include the HTML code created by that script in order to parse its content using JSoup? Can you provide a simple example? Thank you!

How can I use iText to convert HTML with images and hyperlinks to PDF?

阅读更多关于 How can I use iText to convert HTML with images and hyperlinks to PDF?

问题 I'm trying to convert HTML to PDF using iTextSharp in an ASP.NET web application that uses both MVC, and web forms. The <img> and <a> elements have absolute and relative URLs, and some of the <img> elements are base64. Typical answers here at SO and Google search results use generic HTML to PDF code with XMLWorkerHelper that looks something like this: using (var stringReader = new StringReader(xHtml)) { using (Document document = new Document()) { PdfWriter writer = PdfWriter.GetInstance

PHP parsing invalid html

阅读更多关于 PHP parsing invalid html

问题 i'm trying to parse some html that is not on my server $dom = new DOMDocument(); $dom->loadHTMLfile("http://www.some-site.org/page.aspx"); echo $dom->getElementById('his_id')->item(0); but php returns an error something like ID his_id already defined in http://www.some-site.org/page.aspx, line: 33 . I think that is because DOMDocument is dealing with invalid html. So, how can i parse it even though is invalid? 回答1: You should run HTML Tidy on it to clean it up before parsing it. $html = file

Web scraping a website with dynamic javascript content

阅读更多关于 Web scraping a website with dynamic javascript content

问题 So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this? 回答1: There are basically two main options to proceed with: using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json

How to collect all script tags of HTML page in a variable

阅读更多关于 How to collect all script tags of HTML page in a variable

问题 I would like to collect all the <script> ....</script> code section present in the HTML page in some variable. What should be the simpler way to do this, Any idea how it can be retrieved using JavaScript.?? Any help will be greatly appreciated. 回答1: To get a list of scripts you can use document.getElementsByTagName("script"); by tag document.scripts; Built-in collection document.querySelectorAll("script"); by selector $("script") jQuery by selector var scripts = document.getElementsByTagName(