html-parsing | 易学教程

How to print the empty data present in a table from its HTML code?

阅读更多关于 How to print the empty data present in a table from its HTML code?

问题 I am using the HTMLParser module present in python to print the data in a table by parsing the HTML page through the HTMLParser. I am unable to print the empty field in the table. Here is the code I'm using: class MyParser(HTMLParser): def __init__(self, data ): HTMLParser.__init__(self) self.feed(data) def handle_data(self, data): print "result -->", data m = MyParser("""<p>105</p><p></p>""") result --> 105 I am able to print the data between the first tag <p>105</p> . I want to print the

Problem parsing children of a node with HtmlAgilityPack

阅读更多关于 Problem parsing children of a node with HtmlAgilityPack

问题 I'm having a problem parsing the input tag children of a form in html. I can parse them from the root using //input[@type] but not as children of a specific node. Here's some code that illustrates the problem: private const string HTML_CONTENT = "<html>" + "<head>" + "<title>Test Page</title>" + "<link href='site.css' rel='stylesheet' type='text/css' />" + "</head>" + "<body>" + "<form id='form1' method='post' action='http://www.someplace.com/input'>" + "<input type='hidden' name='id' value=

Scrape original links and headlines from Facebook posts

阅读更多关于 Scrape original links and headlines from Facebook posts

问题 I need to gather some information which is not provided by Facebook Analytics. For example, the original url and headline of an article promoted on Facebook as a link post. This info is buried in the html code of a Facebook post but I struggle to dig it out. Will appreciate your help. Let's take this example: https://www.facebook.com/bbcnews/posts/10156428513547217 I identified classes for a link (bbc.in...): "_6ks" and headline: 'mbs _6m6 _2cnj _5s6c' The code below doesn't return anything:

How To Match Content Between Tags Without Regex

阅读更多关于 How To Match Content Between Tags Without Regex

问题 I have read the following post about How to match content between HTML specific tags with attribute using grep?. However, when I use the code derived from that page, I'm unable to match the content. I keep getting a blank output. Here's the code I'm using: grep -oP '(?<=<div class="tag"> ).*?(?= </tag>)' file1.txt I've ensured that all the line endings are in linux (LF). Here's file1.txt: <div class="tag"> <p>hello world!</p> </tag> I would want the output: <div class=

htmlDocPtr/gethtml issue

阅读更多关于 htmlDocPtr/gethtml issue

问题 I have this kind of code in an iOS app of mine: NSString *docNameString; docNameString=@"https://www.mysite.php"; documentHTML=gethtml((char*)[docNameString UTF8String], (char*)[@"UTF-8" UTF8String]); It was working up to now (meaning last time I touched the app in Xcode). But now it is no longer working. I get this message in the Xcode console: I/O warning : failed to load external entity "https://www.mysite.php" Document not parsed successfully. I am currently using Xcode Version 10.0.

Android: Fill Form Data and Extract HTML

阅读更多关于 Android: Fill Form Data and Extract HTML

问题 I have a very simple problem: Using java (and android) I am trying to go to a certain website (URL given), fill in form fields, login (i.e. clicking a button) and then extract the HTML source of the resulting page. I have tried already to use headless browsers and htmlparsers like HTML Unit and Selenium, but the jars always conflict and they don't seem to work with Android. How else would I go about doing this? (Also, I need this to happen without the user needing to see a WebView or page, so

Beautiful soup failing to parse this HTML

阅读更多关于 Beautiful soup failing to parse this HTML

问题 We're using Beautiful Soup to parse many websites successfully, but a few are given us problems. An example is this page: http://www.designsponge.com/2013/04/biz-ladies-how-to-use-networking-to-improve-your-search-engine-rankings.html We're feeding the exact source to beautiful soup, but it returns a stunted HTML string, though no errors... Code: soup = BeautifulSoup(site_html) print str(soup.html) Result: <html class="no-js" lang="en">  </html> I'm trying to determine what's

Html Agility Pack creating irrelevant characters on save html file in c#

阅读更多关于 Html Agility Pack creating irrelevant characters on save html file in c#

问题 I am working on project using asp.net mvc3 C# . I want to change some html element attributes by c# like width , height etc. I have a simple (_Layout.cshtml) file <html> <head> <link href="@Url.Content("file.css")" rel="stylesheet" type="text/css" /> <body> <a href="#" id="link1" title="@Function.ConfigElement("FacebookLink")" ></a> </body> </head> </html> So i am using html agility pack to load and save this file HtmlDocument doc= new HtmlDocument(); doc.load("_Layout.cshtml"); doc

Parsing content which contains html tags using XMLPullParser

阅读更多关于 Parsing content which contains html tags using XMLPullParser

问题 I am building an app in android using XmlPullParser. How can I get the content from an html formatted like this? <div class="content"> "Some text is here." <br> "some more text "<a class="link" href="adress">continues here</a> <br> </div> I want to parse all the content like this: "Some text is here. some more text continues here" "continues here" part should also be hyperlinked. ADDITION after some comments: HTML is first put into Yahoo YQL and YQL generates an XML. I use the generated XML

javax.swing.text.ElementIterator weird behavior

阅读更多关于 javax.swing.text.ElementIterator weird behavior

问题 I'm getting a weird behavior with javax.swing.text.ElementIterator(). It never shows all elements, and it shows a different amount of elements depending on what type of ParserCallback I use. The test below is done with the website that is in my profile, but can be done with any other big html file. // some imports shown in case its an import mixup import javax.swing.text.AttributeSet; import javax.swing.text.BadLocationException; import javax.swing.text.ChangedCharSetException; import javax