beautifulsoup

how to find an xml tag with special character in Python BeautifulSoup

巧了我就是萌 提交于 2019-12-25 14:14:52
问题 I am using Python BeautifulSoup version 3. my xml looks something like this (its from docx format):- <w:r w:rsidRPr="00541D75"> <w:rPr> <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/> <w:b/> <w:color w:val="1F497D" w:themeColor="text2"/> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> <w:t>Mandatory / Optional</w:t> </w:r> </w:p> </w:tc> </w:tr> I wanted to extract out the content from tag 'w:t', and so this is what i did:- print soup.findAll('w:t')

how to find an xml tag with special character in Python BeautifulSoup

☆樱花仙子☆ 提交于 2019-12-25 14:14:19
问题 I am using Python BeautifulSoup version 3. my xml looks something like this (its from docx format):- <w:r w:rsidRPr="00541D75"> <w:rPr> <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/> <w:b/> <w:color w:val="1F497D" w:themeColor="text2"/> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> <w:t>Mandatory / Optional</w:t> </w:r> </w:p> </w:tc> </w:tr> I wanted to extract out the content from tag 'w:t', and so this is what i did:- print soup.findAll('w:t')

findAll returning empty for html

我怕爱的太早我们不能终老 提交于 2019-12-25 11:54:07
问题 I'm using the BeautifulSoup module to parse an html file that I want to extract certain information from. Specifically game scores and team names. However, when I use the findAll function, it continually returns empty for a string that is certainly within the html. If someone can explain what I am doing wrong it will be greatly appreciated. See code below. import urllib import bs4 import re from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = 'http://www

How to get the opening and closing tag in beautiful soup from HTML string?

北城余情 提交于 2019-12-25 09:17:10
问题 I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code. Here is my string: string = <p>...</p> I want to get <p> in a variable called opening_tag and </p> in a variable called closing_tag . I have searched the documentation but don't seem to find the solution. Can anyone advise me with that? 回答1: There is no direct way to get opening and closing parts of the tag in BeautifulSoup , but, at least, you can get the name of it:

Need to extract all the font sizes and the text using beautifulsoup

泪湿孤枕 提交于 2019-12-25 09:12:54
问题 I have the following html file stored on my local system: <span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:792px;"></span> <div style="position:absolute; top:50px;"><a name="1">Page 1</a></div> <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:71px; width:322px; height:38px;"><span style="font-family: BAAAAA+DejaVuSans-Bold; font-size:30px">One <br></span></div><div style="position:absolute; border:

Python, BeautifulSoup - Parsing out a Tweet

喜欢而已 提交于 2019-12-25 08:20:41
问题 I have a peice of HTML I took from the source of my Twitter timeline, shown here: http://pastebin.com/deefvbYw That's one Tweet I'll use for an example. I can't for the life of me get it to co-operate. I want it to show: Dmitri @TheFPShow "I do this all the time... youtube.com/watch?v=DF9WP8…" If anyone could offer some suggestions that'd be great. 回答1: soup = BeautifulSoup(twit) name_tag = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'}) user = name_tag[0]

'str' object has no attribute 'p' using beautifulsoup

霸气de小男生 提交于 2019-12-25 08:14:41
问题 I have been following a tutorial on using BeautifulSoup, however when I try to read the title or even paragraphs (using soup.p) I get an error saying, "Traceback (most recent call last): File "*****/Tutorial1.py", line 9, in pTag = soup.p AttributeError: 'str' object has no attribute 'p'" I am still very new to Python, sorry to bother if these is too much of an easy issue but I will greatly appreciate any help. Code given below: import urllib.request from bs4 import BeautifulSoup with urllib

Bypassing script response when scraping website with Requests/BeautifulSoup

走远了吗. 提交于 2019-12-25 07:58:22
问题 I am scraping www.marriot.com for information on their hotels and prices. I used the chrome inspect tool to monitor network traffic to figure out what API endpoint marriot is using. This is the request I am trying to emulate: http://www.marriott.com/reservation/availabilitySearch.mi?propertyCode=TYSMC&isSearch=true&fromDate=02/23/17&toDate=02/24/17&numberOfRooms=1&numberOfGuests=1&numberOfChildren=0&numberOfAdults=1 With my python code: import requests from bs4 import BeautifulSoup base_uri =

Bypassing intrusive cookie statement with requests library

寵の児 提交于 2019-12-25 07:29:19
问题 I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement. I am trying to access the website as follows: from bs4 import BeautifulSoup as soup import requests website = r"http://www.vi.nl/matchcenter/vandaag.shtml" html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}) htmlsoup = soup(html.text, "html.parser") This returns a web page that

lxml incorrectly parsing the Doctype while looking for links

孤者浪人 提交于 2019-12-25 07:09:23
问题 I've got a BeautifulSoup4 (4.2.1) parser which collects all href attributes from our template files, and until now it has been just perfect. But with lxml installed, one of our guys is now getting a; TypeError: string indices must be integers . I managed to replicate this on my Linux Mint VM and the only difference appears to be lxml so I assume when bs4 uses that html parser the issue occurs. The problem function is; def collecttemplateurls(templatedir, urlslist): """ Uses BeautifulSoup to