html-parsing

Parse HTML and Get All h3's After an h2 Before the Next h2 Using PHP

不问归期 提交于 2019-12-22 10:29:38
问题 I am looking to find the first h2 in the article. Once found, look for all h3's until the next h2 is found. Rinse and repeat until all headings and subheadings have been located. Before you immediately flag or close this question as duplicate parsing question, please take note of the question title , as for this isn't about basic node retrieval. I've got that part down. I am using DOMDocument to parse HTML using DOMDocument::loadHTML(), DOMDocument::getElementsByTagName() and DOMDocument:

Parsing html data into python list for manipulation

荒凉一梦 提交于 2019-12-22 08:34:59
问题 I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access: EPS (Basic)\n13.4620.6226.6930.1732.81\n\n So I would like to create a list called

Looking for a CSS parser in Ruby [closed]

耗尽温柔 提交于 2019-12-22 06:56:47
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I'm looking for a CSS parser, similar to this one Looking for a CSS Parser in java , but in Ruby. Input: an element of a HTML document. Output: all styles associated to that specific element. I've googled for it, and I've also searched here at Stackoverflow, but all I could find was this Java parser. 回答1: You

beautifulsoup and invalid html document

只愿长相守 提交于 2019-12-22 06:03:49
问题 I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm. I want to get countries and names at the beginning of the document. Here is my code import urllib import re from bs4 import BeautifulSoup url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm" soup=BeautifulSoup(urllib.urlopen(url)) attendances_table=soup.find("table", {"width":850}) print attendances_table #this works, I see the whole table

antisamy parser force closing tag

╄→尐↘猪︶ㄣ 提交于 2019-12-22 05:31:12
问题 I use Antisamy for validating HTML. My policy allow iframes, like youtube videos. Problem is - if tag is empty(like this): <iframe src="//www.youtube.com/embed/uswzriFIf_k?feature=player_detailpage" allowfullscreen></iframe> than after cleaning it will be like this: <iframe src="//www.youtube.com/embed/uswzriFIf_k?feature=player_detailpage" allowfullscreen/> But it should have normal closing tag. And this break all content on page after. I already set my directives to use most of HTML but not

How to get the option text using BeautifulSoup

北战南征 提交于 2019-12-22 05:24:40
问题 I want to using BeautifulSoup to get the option text in the following html. For example: I'd like to get 2002/12 , 2003/12 etc. <select id="start_dateid"> <option value="0">2002/12</option> <option value="1">2003/12</option> <option value="2">2004/12</option> <option value="3">2005/12</option> <option value="4">2006/12</option> <option value="5" selected="">2007/12</option> <option value="6">2008/12</option> <option value="7">2009/12</option> <option value="8">2010/12</option> <option value=

How to get an XPath from selenium webelement or from lxml?

不打扰是莪最后的温柔 提交于 2019-12-22 04:36:26
问题 I am using selenium and I need to find the XPaths of some selenium web elements. For example: import selenium.webdriver driver = selenium.webdriver.Firefox() element = driver.find_element_by_xpath(<some_xpath>) elements = element.find_elements_by_xpath(<some_relative_xpath>) for e in elements: print e.get_xpath() I know I can't get the XPath from the element itself, but is there a nice way to get it anyway? I tried using lxml to parse the HTML, but it doesn't recognize the XPath, <some_xpath>

PHP DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity

和自甴很熟 提交于 2019-12-22 03:21:47
问题 I trying to get the "link" elements from certain webpages. I can't figure out what i'm doing wrong though. I'm getting the following error: Severity: Warning Message: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity, line: 536 Filename: controllers/test.php Line Number: 34 Line 34 is the following in the code: $dom->loadHTML($html); my code: $url = "http://www.amazon.com/"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT

Extracting email addresses in an html block in ruby/rails

谁说胖子不能爱 提交于 2019-12-22 01:04:32
问题 I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it) I've tried regexes and so far this has been successful: /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i problem is, i need to ignore all email addresses with mailto hrefs. for example: <a href="mailto:test@mail.com">test@mail.com</a> should only return the second email add. To get a background of what im doing, im reversing the

How to open url having Arabic text using php file-get-contents function

懵懂的女人 提交于 2019-12-21 23:06:23
问题 I want to get html from a URL having some Arabic like http://www.example.com/2013/07/31/الاختبار.html using php. I tried with file_get_html("http://www.example.com/2013/07/31/الاختبار.html") but it is giving the following error Warning: file_get_contents(http://www.example.com/2013/07/31/الاختبار.html) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found in filename.php Please help. http://www.example.com/2013/07/31/الاختبار.html is for reference