html-parsing | 易学教程

How to collect all script tags of HTML page in a variable

阅读更多关于 How to collect all script tags of HTML page in a variable

问题 I would like to collect all the <script> ....</script> code section present in the HTML page in some variable. What should be the simpler way to do this, Any idea how it can be retrieved using JavaScript.?? Any help will be greatly appreciated. 回答1: To get a list of scripts you can use document.getElementsByTagName("script"); by tag document.scripts; Built-in collection document.querySelectorAll("script"); by selector $("script") jQuery by selector var scripts = document.getElementsByTagName(

How to change tag name with BeautifulSoup?

阅读更多关于 How to change tag name with BeautifulSoup?

问题 I am using python + BeautifulSoup to parse an HTML document. Now I need to replace all <h2 class="someclass"> elements in an HTML document, with <h1 class="someclass"> . How can I change the tag name, without changing anything else in the document? 回答1: I don't know how you're accessing tag but the following works for me: import BeautifulSoup if __name__ == "__main__": data = """ <html> <h2 class='someclass'>some title</h2> <ul> <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.<

HTML parsing in perl

阅读更多关于 HTML parsing in perl

问题 I'm trying to parse the following HTML structure with in perl. I need to select all of the dd elements that contain the class message and also an id. All I would like the script to do is loop through all of the dd elements and print out the id of the dd element but it needs to ignore the first dd element as that is static and will not change. It can be with any perl module as long as it can be installed from cpan to make it easy for me. I don't have much experience with perl and parsing html

Difference between “findAll” and “find_all” in BeautifulSoup

阅读更多关于 Difference between “findAll” and “find_all” in BeautifulSoup

问题 I would like to parse an HTML file with Python, and the module I am using is BeautifulSoup. It is said that the function find_all is the same as findAll . I've tried both of them, but I believe they are different: import urllib, urllib2, cookielib from BeautifulSoup import * site = "http://share.dmhy.org/topics/list?keyword=TARI+TARI+team_id%3A407" rqstr = urllib2.Request(site) rq = urllib2.urlopen(rqstr) fchData = rq.read() soup = BeautifulSoup(fchData) t = soup.findAll('tr') Can anyone tell

Difference between “findAll” and “find_all” in BeautifulSoup

阅读更多关于 Difference between “findAll” and “find_all” in BeautifulSoup

Fatal error: Out of memory [closed]

阅读更多关于 Fatal error: Out of memory [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 8 years ago . error: Fatal error: Out of memory (allocated 123469824) (tried to allocate 71 bytes) in /home/test/tset/tset.net/public_html/test/simple_html_dom.php on line 1236 Yes, I know that many of these issues were. But I

Get Inner HTML - PHP

阅读更多关于 Get Inner HTML - PHP

问题 I have the following code: $data = file_get_contents('http://www.robotevents.com/robot-competitions/vex-robotics-competition?limit=all'); echo "Downloaded"; $dom = new domDocument; @$dom->loadHTML($data); $dom->preserveWhiteSpace = false; $tables = $dom->getElementsByTagName('table'); $rows = $tables->item(2)->getElementsByTagName('tr'); foreach ($rows as $row) { $cols = $row->getElementsByTagName('td'); for ($i = 0; $i < $cols->length; $i++) { echo $cols->item($i)->nodeValue . "\n"; } } The

JSoup Login and Cookie

阅读更多关于 JSoup Login and Cookie

问题 I'm trying to login into a site using JSoup but I'm having trouble getting a good cookie back. I'm not sure if the URL or login data is incorrect. Any help would be much appreciated. The login page is here I'm currently trying with the following code: Connection.Response res = Jsoup.connect("https://go.sfu.ca/psp/goprd/?cmd=login&languageCd=ENG") .data("user", "myUserID", "pwd", "myPassword") .method(Connection.Method.POST) .execute(); I do not get the same amount of cookie information if I

How to Retrieve data from the following HTML document structure in R

阅读更多关于 How to Retrieve data from the following HTML document structure in R

问题 I am trying to retrieve tabular data from a html document stored in my local drive.I am stuck @ what to do after parsing i.e how to retrieve those nodes where we have data stored specifically. <thead> <tr> <th></th> <th data-field="position"><a>Rank</a></th> <th data-field="name"><a>Brand</a></th> <th data-field="brandValue"><a>Brand Value</a></th> <th data-field="oneYearValueChange"><a>1-Yr Value Change</a></th> <th data-field="revenue"><a>Brand Revenue</a></th> <th data-field="advertising">

Extracting href from a class within other div/id classes with jsoup

阅读更多关于 Extracting href from a class within other div/id classes with jsoup

问题 Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page): div id="atfResults" class="list results "> <div id="result_0" class="result firstRow product" name="0006754023"> <div id="srNum_0" class="number">1.</div> <div class="image"> <a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1"> <img src="http:/