html-parsing

Detecting BeautifulSoup Freeze

狂风中的少年 提交于 2020-01-05 12:08:13
问题 Parsing the following site using beautifulsoup with html5lib parser hangs but the same works well with html.parser Site: http://www.webmasterworld.com/google/3584866.htm #contents is fetched via urllib2 soup = BeautifulSoup(contents, 'html5lib') soup('a') hangs but the following works just fine soup = BeautifulSoup(contents1, 'html.parser') soup('a') > [.....] Is there something wrong with what i am doing or is there a way to detect if soup will fail before making the call to soup('a') from

rvest package read_html() function stops reading at “<” symbol

北城余情 提交于 2020-01-05 08:57:52
问题 I was wondering if this behavior is intentional in the rvest package. When rvest sees the < character it stops reading the HTML. library(rvest) read_html("<html><title>under 30 years = < 30 years <title></html>") Prints: [1] <head>\n <title>under 30 = </title>\n</head> If this is intentional, is there a workaround? 回答1: Yes, it is normal for rvest because it's normal for html. See the w3schools HTML Entities page. < and > are reserved characters in html and their literal values have to be

How to Keep HTML Formatting Intact When Parsing with DOM - (No Tag Stripping)

喜你入骨 提交于 2020-01-05 08:17:52
问题 Employing DOMDocument, I'm trying to read a portion of an HTML file and displaying it on a different HTML page using the code below. The DIV portion that I'm trying to access has several <p> tags. The problem is when DOM parses the file, it only fetches the text content between the <p> tags - strips tags - and the paragraph formatting is lost. It merges the texts and displays them all as one paragraph. How can I keep the HTML formatting so that the paragraphs are displayed as they were in the

How to Keep HTML Formatting Intact When Parsing with DOM - (No Tag Stripping)

六眼飞鱼酱① 提交于 2020-01-05 08:17:48
问题 Employing DOMDocument, I'm trying to read a portion of an HTML file and displaying it on a different HTML page using the code below. The DIV portion that I'm trying to access has several <p> tags. The problem is when DOM parses the file, it only fetches the text content between the <p> tags - strips tags - and the paragraph formatting is lost. It merges the texts and displays them all as one paragraph. How can I keep the HTML formatting so that the paragraphs are displayed as they were in the

Get the specific word in text in HTML page

不打扰是莪最后的温柔 提交于 2020-01-05 08:11:37
问题 If I have the following HTML page <div> <p> Hello world! </p> <p> <a href="example.com"> Hello and Hello again this is an example</a></p> </div> I want to get the specific word for example 'hello' and change it to 'welcome' wherever they are in the document Do you have any suggestion? I will be happy to get your answers whatever the type of parser you use? 回答1: This is easy to do with XSLT. XSLT 1.0 solution : <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

Get the specific word in text in HTML page

前提是你 提交于 2020-01-05 08:11:20
问题 If I have the following HTML page <div> <p> Hello world! </p> <p> <a href="example.com"> Hello and Hello again this is an example</a></p> </div> I want to get the specific word for example 'hello' and change it to 'welcome' wherever they are in the document Do you have any suggestion? I will be happy to get your answers whatever the type of parser you use? 回答1: This is easy to do with XSLT. XSLT 1.0 solution : <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

Apply wordwrap to html content, excluding html attributes

好久不见. 提交于 2020-01-05 07:08:40
问题 I'm not used to regular expressions so this might seem easy while tricky for me. Basically, i'm applying wordwrap to content, that contains classic html tags : , ... $text = wordwrap($text, $cutLength, " ", $wordCut); $text = nl2br(bbcode_parser($text)); return $text; As you can see, my problem is pretty simple : all I want is to apply wordwrap() to my content, excluding what could be in html attributes : href , src ... Could someone help me out ? Thanks a lot ! 回答1: You shouldn't use regex

scrape data from into dataframe with BeautifulSoup

醉酒当歌 提交于 2020-01-05 07:01:26
问题 I'm working on a project to scrape and parse data from California lottery into a dataframe Here's my code so far, it produces no error but also no output: import requests from bs4 import BeautifulSoup as bs4 draw = 'http://www.calottery.com/play/draw-games/superlotto-plus/winning-numbers/?page=1' page = requests.get(draw) soup = bs4(page.text) drawing_list = [] for table_row in soup.select("table.tag_even_numbers tr"): cells = table_row.findAll('td') if len(cells) > 0: draw_date = cells[0]

parsing HTML table with BeautifulSoup4

依然范特西╮ 提交于 2020-01-05 04:06:12
问题 I am new to BeautifulSoup and trying to extract the table. I have followed documentation to do a nested for loop to extract the cell data but it only returns the first three rows. Here is my code: from six.moves import urllib from bs4 import BeautifulSoup import pandas as pd def get_url_content(url): try: html=urllib.request.urlopen(url) except urllib.error.HTTPError as e: return None try: soup=BeautifulSoup(html.read(),'html.parser') except AttributeError as e: return None return soup URL=

How to read xpath values from many HTML files in .Net?

血红的双手。 提交于 2020-01-04 13:25:28
问题 I have about 5000 html files in a folder. I need to loop through them, open, grab say 10 values using xpath, close, and store in (SQL Server) DB. What is the easiest way to do read the xpath values using .Net? The xpaths should be pretty stable. Please provide example code to read one value, say /html/head/title/text() Thanks 回答1: I think you should look into the HTML Agility Pack. It is an HTML parser rather than an XML parser, and is better for this task. If there is anything that doesn't