html-parsing

How to collect all script tags of HTML page in a variable

佐手、 提交于 2019-12-28 04:21:07
问题 I would like to collect all the <script> ....</script> code section present in the HTML page in some variable. What should be the simpler way to do this, Any idea how it can be retrieved using JavaScript.?? Any help will be greatly appreciated. 回答1: To get a list of scripts you can use document.getElementsByTagName("script"); by tag document.scripts; Built-in collection document.querySelectorAll("script"); by selector $("script") jQuery by selector var scripts = document.getElementsByTagName(

How to change tag name with BeautifulSoup?

柔情痞子 提交于 2019-12-28 04:18:27
问题 I am using python + BeautifulSoup to parse an HTML document. Now I need to replace all <h2 class="someclass"> elements in an HTML document, with <h1 class="someclass"> . How can I change the tag name, without changing anything else in the document? 回答1: I don't know how you're accessing tag but the following works for me: import BeautifulSoup if __name__ == "__main__": data = """ <html> <h2 class='someclass'>some title</h2> <ul> <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.<

HTML parsing in perl

∥☆過路亽.° 提交于 2019-12-28 02:04:56
问题 I'm trying to parse the following HTML structure with in perl. I need to select all of the dd elements that contain the class message and also an id. All I would like the script to do is loop through all of the dd elements and print out the id of the dd element but it needs to ignore the first dd element as that is static and will not change. It can be with any perl module as long as it can be installed from cpan to make it easy for me. I don't have much experience with perl and parsing html

Difference between “findAll” and “find_all” in BeautifulSoup

两盒软妹~` 提交于 2019-12-27 12:07:53
问题 I would like to parse an HTML file with Python, and the module I am using is BeautifulSoup. It is said that the function find_all is the same as findAll . I've tried both of them, but I believe they are different: import urllib, urllib2, cookielib from BeautifulSoup import * site = "http://share.dmhy.org/topics/list?keyword=TARI+TARI+team_id%3A407" rqstr = urllib2.Request(site) rq = urllib2.urlopen(rqstr) fchData = rq.read() soup = BeautifulSoup(fchData) t = soup.findAll('tr') Can anyone tell

Difference between “findAll” and “find_all” in BeautifulSoup

北城余情 提交于 2019-12-27 12:06:58
问题 I would like to parse an HTML file with Python, and the module I am using is BeautifulSoup. It is said that the function find_all is the same as findAll . I've tried both of them, but I believe they are different: import urllib, urllib2, cookielib from BeautifulSoup import * site = "http://share.dmhy.org/topics/list?keyword=TARI+TARI+team_id%3A407" rqstr = urllib2.Request(site) rq = urllib2.urlopen(rqstr) fchData = rq.read() soup = BeautifulSoup(fchData) t = soup.findAll('tr') Can anyone tell

Fatal error: Out of memory [closed]

ε祈祈猫儿з 提交于 2019-12-25 18:00:34
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 8 years ago . error: Fatal error: Out of memory (allocated 123469824) (tried to allocate 71 bytes) in /home/test/tset/tset.net/public_html/test/simple_html_dom.php on line 1236 Yes, I know that many of these issues were. But I

Get Inner HTML - PHP

房东的猫 提交于 2019-12-25 17:47:13
问题 I have the following code: $data = file_get_contents('http://www.robotevents.com/robot-competitions/vex-robotics-competition?limit=all'); echo "Downloaded"; $dom = new domDocument; @$dom->loadHTML($data); $dom->preserveWhiteSpace = false; $tables = $dom->getElementsByTagName('table'); $rows = $tables->item(2)->getElementsByTagName('tr'); foreach ($rows as $row) { $cols = $row->getElementsByTagName('td'); for ($i = 0; $i < $cols->length; $i++) { echo $cols->item($i)->nodeValue . "\n"; } } The

JSoup Login and Cookie

独自空忆成欢 提交于 2019-12-25 16:39:48
问题 I'm trying to login into a site using JSoup but I'm having trouble getting a good cookie back. I'm not sure if the URL or login data is incorrect. Any help would be much appreciated. The login page is here I'm currently trying with the following code: Connection.Response res = Jsoup.connect("https://go.sfu.ca/psp/goprd/?cmd=login&languageCd=ENG") .data("user", "myUserID", "pwd", "myPassword") .method(Connection.Method.POST) .execute(); I do not get the same amount of cookie information if I

How to Retrieve data from the following HTML document structure in R

女生的网名这么多〃 提交于 2019-12-25 08:59:06
问题 I am trying to retrieve tabular data from a html document stored in my local drive.I am stuck @ what to do after parsing i.e how to retrieve those nodes where we have data stored specifically. <thead> <tr> <th></th> <th data-field="position"><a>Rank</a></th> <th data-field="name"><a>Brand</a></th> <th data-field="brandValue"><a>Brand Value</a></th> <th data-field="oneYearValueChange"><a>1-Yr Value Change</a></th> <th data-field="revenue"><a>Brand Revenue</a></th> <th data-field="advertising">

Extracting href from a class within other div/id classes with jsoup

主宰稳场 提交于 2019-12-25 08:40:02
问题 Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page): div id="atfResults" class="list results "> <div id="result_0" class="result firstRow product" name="0006754023"> <div id="srNum_0" class="number">1.</div> <div class="image"> <a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1"> <img src="http:/