html-parsing | 易学教程

Detecting BeautifulSoup Freeze

阅读更多关于 Detecting BeautifulSoup Freeze

问题 Parsing the following site using beautifulsoup with html5lib parser hangs but the same works well with html.parser Site: http://www.webmasterworld.com/google/3584866.htm #contents is fetched via urllib2 soup = BeautifulSoup(contents, 'html5lib') soup('a') hangs but the following works just fine soup = BeautifulSoup(contents1, 'html.parser') soup('a') > [.....] Is there something wrong with what i am doing or is there a way to detect if soup will fail before making the call to soup('a') from

rvest package read_html() function stops reading at “<” symbol

阅读更多关于 rvest package read_html() function stops reading at “

问题 I was wondering if this behavior is intentional in the rvest package. When rvest sees the < character it stops reading the HTML. library(rvest) read_html("<html><title>under 30 years = < 30 years <title></html>") Prints: [1] <head>\n <title>under 30 = </title>\n</head> If this is intentional, is there a workaround? 回答1: Yes, it is normal for rvest because it's normal for html. See the w3schools HTML Entities page. < and > are reserved characters in html and their literal values have to be

How to Keep HTML Formatting Intact When Parsing with DOM - (No Tag Stripping)

阅读更多关于 How to Keep HTML Formatting Intact When Parsing with DOM - (No Tag Stripping)

问题 Employing DOMDocument, I'm trying to read a portion of an HTML file and displaying it on a different HTML page using the code below. The DIV portion that I'm trying to access has several <p> tags. The problem is when DOM parses the file, it only fetches the text content between the <p> tags - strips tags - and the paragraph formatting is lost. It merges the texts and displays them all as one paragraph. How can I keep the HTML formatting so that the paragraphs are displayed as they were in the

How to Keep HTML Formatting Intact When Parsing with DOM - (No Tag Stripping)

阅读更多关于 How to Keep HTML Formatting Intact When Parsing with DOM - (No Tag Stripping)

Get the specific word in text in HTML page

阅读更多关于 Get the specific word in text in HTML page

问题 If I have the following HTML page <div> <p> Hello world! </p> <p> <a href="example.com"> Hello and Hello again this is an example</a></p> </div> I want to get the specific word for example 'hello' and change it to 'welcome' wherever they are in the document Do you have any suggestion? I will be happy to get your answers whatever the type of parser you use? 回答1: This is easy to do with XSLT. XSLT 1.0 solution : <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

Get the specific word in text in HTML page

阅读更多关于 Get the specific word in text in HTML page

Apply wordwrap to html content, excluding html attributes

阅读更多关于 Apply wordwrap to html content, excluding html attributes

问题 I'm not used to regular expressions so this might seem easy while tricky for me. Basically, i'm applying wordwrap to content, that contains classic html tags : , ... $text = wordwrap($text, $cutLength, " ", $wordCut); $text = nl2br(bbcode_parser($text)); return $text; As you can see, my problem is pretty simple : all I want is to apply wordwrap() to my content, excluding what could be in html attributes : href , src ... Could someone help me out ? Thanks a lot ! 回答1: You shouldn't use regex

scrape data from into dataframe with BeautifulSoup

阅读更多关于 scrape data from into dataframe with BeautifulSoup

问题 I'm working on a project to scrape and parse data from California lottery into a dataframe Here's my code so far, it produces no error but also no output: import requests from bs4 import BeautifulSoup as bs4 draw = 'http://www.calottery.com/play/draw-games/superlotto-plus/winning-numbers/?page=1' page = requests.get(draw) soup = bs4(page.text) drawing_list = [] for table_row in soup.select("table.tag_even_numbers tr"): cells = table_row.findAll('td') if len(cells) > 0: draw_date = cells[0]

parsing HTML table with BeautifulSoup4

阅读更多关于 parsing HTML table with BeautifulSoup4

问题 I am new to BeautifulSoup and trying to extract the table. I have followed documentation to do a nested for loop to extract the cell data but it only returns the first three rows. Here is my code: from six.moves import urllib from bs4 import BeautifulSoup import pandas as pd def get_url_content(url): try: html=urllib.request.urlopen(url) except urllib.error.HTTPError as e: return None try: soup=BeautifulSoup(html.read(),'html.parser') except AttributeError as e: return None return soup URL=

How to read xpath values from many HTML files in .Net?

阅读更多关于 How to read xpath values from many HTML files in .Net?

问题 I have about 5000 html files in a folder. I need to loop through them, open, grab say 10 values using xpath, close, and store in (SQL Server) DB. What is the easiest way to do read the xpath values using .Net? The xpaths should be pretty stable. Please provide example code to read one value, say /html/head/title/text() Thanks 回答1: I think you should look into the HTML Agility Pack. It is an HTML parser rather than an XML parser, and is better for this task. If there is anything that doesn't