html-parsing | 易学教程

Is there a HTML parsing for D?

阅读更多关于 Is there a HTML parsing for D?

问题 I'm looking for a HTML parsing for D language(that supports XPath, if possible). I did some googling, but no luck (hard find solutions with "D" keyword; it's like C, I say "C", google say C# .). On http://www.dsource.org and https://stackoverflow.com/questions/tagged/html-parsing+d there is no too. Note: I want not to mix C++ and D code. I am seeking solutions either in C or in libxml2. 回答1: Check out Adam Ruppe's dom.d: https://github.com/adamdruppe/misc-stuff-including-D-programming

List files on HTTP/FTP server in R

阅读更多关于 List files on HTTP/FTP server in R

问题 I'm trying to get list of files on HTTP/FTP server from R!, so that in next step I will be able to download them (or select some of files which meet my criteria to download). I know that it is possible to use external program in web browser (download manager) which will allow me to select files to download from current web page/ftp. However, I wish to have everything scripted, so that it will be easier for me to reproduce. I thought about calling Python from R! (since it seems much easier),

Html agility pack not loading url

阅读更多关于 Html agility pack not loading url

问题 I have something like this: class MyTask { public MyTask(int id) { Id = id; IsBusy = false; Document = new HtmlDocument(); } public HtmlDocument Document { get; set; } public int Id { get; set; } public bool IsBusy { get; set; } } class Program { public static void Main() { var task = new MyTask(1); task.Document.LoadHtml("http://urltomysite"); if (task.Document.DocumentNode.SelectNodes("//span[@class='some-class']").Count == 0) { task.IsBusy = false; return; } } } Now when I start my program

Bs4 select_one vs find

阅读更多关于 Bs4 select_one vs find

问题 I was wondering what is the difference between performing bs.find('div') and bs.select_one('div') . Same goes for find_all and select . Is there any difference performance wise, or if any is better to use over the other in specific cases. 回答1: select() and select_one() give you a different way navigating through an HTML tree using the CSS selectors which has rich and convenient syntax. Though, the CSS selector syntax support in BeautifulSoup is limited but covers most common cases.

How do I get this text using Jsoup?

阅读更多关于 How do I get this text using Jsoup?

问题 How do i get "this text" from the following html code using Jsoup? <h2 class="link title"><a href="myhref.html">this text<img width=10 height=10 src="img.jpg" /><span class="blah"> <span>Other texts</span><span class="sometime">00:00</span></span> </a></h2> When I try String s = document.select("h2.title").select("a[href]").first().text(); it returns this textOther texts00:00 I tried to read the api for Selector in Jsoup but could not figure out much. Also how do i get an element of class

Scrape using Beautiful Soup preserving entities

阅读更多关于 Scrape using Beautiful Soup preserving entities

问题 I would like to scrape a table from the web and keep the entities intact so that I can republish as HTML later. BeautifulSoup seems to be converting these to spaces though. Example: from bs4 import BeautifulSoup html = "<html><body><table><tr>" html += "<td> hello </td>" html += "</tr></table></body></html>" soup = BeautifulSoup(html) table = soup.find_all('table')[0] row = table.find_all('tr')[0] cell = row.find_all('td')[0] print cell observed result: <td> hello </td> required result: <td

Extract url & their names of an html file stored on disk and print them respectively - Python

阅读更多关于 Extract url & their names of an html file stored on disk and print them respectively - Python

问题 I am trying to extract and print urls and their name (between <a href='url' title='smth'>NAME</a> existing in an html file (saved in disk) without using BeautifulSoup or another library. Just a beginner's Python code. The wishing print format is: http://..filepath/filename.pdf File's Name so on... I was able to extract and print the all urls or all the names solely, but I fail to append all the names that follows after a while in the code included just before the tag and print them below each

Extract url & their names of an html file stored on disk and print them respectively - Python

阅读更多关于 Extract url & their names of an html file stored on disk and print them respectively - Python

Convert formatted email (HTML) to plain Text?

阅读更多关于 Convert formatted email (HTML) to plain Text?

问题 I have this code that implements ParserCallback and converts HTML emails to Plain text. This code works fine when I parse email body like this = "DO NOT REPLY TO THIS EMAIL MESSAGE. <br>---------------------------------------<br>\n" + "nix<br>---------------------------------------<br> Esfghjdfkj\n" + "</blockquote></div><br><br clear=\"all\"><div><br></div>-- <br><div dir=\"ltr\"><b>Regards <br>Nisj<br>Software Engineer<br></b><div><b>Bingo</b></div></div>\n" + "</div>" but when I parse this

Parsing XHTML with XPATH using Microsoft.XMLHTTP in VBScript

阅读更多关于 Parsing XHTML with XPATH using Microsoft.XMLHTTP in VBScript

问题 I'm looking to parse an xhtml document with Microsoft.XMLHTTP with XPATH in VBScript. I have the following xhtml document structure. How would I get an array of the urls? <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <title>Local index</title> </head> <body> <table> <tr> <td> <a href="url1.html