html-parsing | 易学教程

How to extract values with BeautifulSoup with no class

阅读更多关于 How to extract values with BeautifulSoup with no class

问题 html code : <td class="_480u"> <div class="clearfix"> <div> Female </div> </div> </td> I wanted the value "Female" as an output. I tried bs.findAll('div',{'class':'clearfix'}) ; bs.findAll('tag',{'class':'_480u'}) But these classes are all over my html code and the output is a big list. I wanted to incorporate {td --> class = ".." and div --> class = ".."} in my search, so that I get the output as Female. How can I do this? Thanks 回答1: Use stripped_strings property: >>> from bs4 import

how to pass search key and get result through bs4

阅读更多关于 how to pass search key and get result through bs4

问题 def get_main_page_url("https://malwr.com/analysis/search/", strDestPath, strMD5): base_url = 'https://malwr.com/' url = 'https://malwr.com/account/login/' username = 'myname' password = 'pswd' session = requests.Session() # getting csrf value response = session.get(url) soup = bs4.BeautifulSoup(response.content) form = soup.form csrf = form.find('input', attrs={'name': 'csrfmiddlewaretoken'}).get('value') ## csrf1 = form.find('input', attrs ={'name': 'search'}).get('value') # logging in data

I want ro get all article content from all links inside from an website

阅读更多关于 I want ro get all article content from all links inside from an website

问题 I want to extract all article content from an website using any web crawling/scraping methods. The problem is I can get content from a single page but not its redirecting links. Anyone please give me the proper solutions import java.io.FileOutputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.Reader; import java.net.URI; import java.net.URL; import java.net.URLConnection; import javax.swing.text.EditorKit; import javax.swing.text.html.HTMLDocument;

How to ignore http link in string and return everything else?

阅读更多关于 How to ignore http link in string and return everything else?

问题 I'm try to parse some html content, here's the HTML content: <font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p> <font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p> <font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p> <font color="green"> *TITLE* </font> Event

Using JSoup to get data-code value of a table

阅读更多关于 Using JSoup to get data-code value of a table

问题 How would I be able to use JSoup to get the data-code value from a table row? Here is what I have tried but it just prints nothing: Document doc = Jsoup.connect("http://www.example.com").get(); Elements dataCodes = doc.select("table[class=team-list]"); for (Element dataCode : dataCodes) { System.out.println(dataCode.attr("data-code")); } The HTML code looks like this: <body> <div id-=""main"> <div id="inner"> <div id="table" class="scores-table"> <table class ="team-list"> <tbody> <tr data

Jsoup - extracting data from an <a> tag, inside a <td> tag

阅读更多关于 Jsoup - extracting data from an tag, inside a tag

问题 I want to extract data from a Web site, using Jsoup. The data are in a table. HTML code: <table><tr><td><a href="......">Pop.Density</a></td> <td>123</td></tr></table> I want to print: zip code...(taken from a text file): 123 I have the following exception: Exception in thread "main" java.lang.NullPointerException Any help would be appreciated. Thank you! This is my code: String s = br.readLine(); String str="http://www.bestplaces.net/people/zip-code/illinois/"+s; org.jsoup.Connection conn =

How to easily parse HTML for consumption as a service using Java?

阅读更多关于 How to easily parse HTML for consumption as a service using Java?

问题 I want to parse an HTML such as http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top and only want extract the text of the element which has <a class="title" The options I have looked so far all look like overkill (SAX, DOM traversal). 回答1: Use Jsoup. It supports jQuery-like CSS selectors. Here's a kickoff example: String url = "http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top"; Document document = Jsoup.connect(url).get(); for (Element link : document.select("a.title"

JSOUP Finding Groups of Words

阅读更多关于 JSOUP Finding Groups of Words

问题 For a homework assignment I have to write a program that scraps HTML from a website and then somehow find phrases within the website. When I say phrases I mean some sort of arbitrary way of organizing text so that words that are in close proximity to each other are put in the same group. I know this sounds really unclear, but the assignment states how we do this is up to our own interpretation of how to find "phrases". Currently I have code that looks like: Document doc = Jsoup.connect("http:

Python how to search and correct html tags and attributes?

阅读更多关于 Python how to search and correct html tags and attributes?

问题 I have to fix all the closing tags of the <img> tag as shown in the text below. Instead of closing the <img> with a > , it should close with /> . Is there any easy way to search for all the <img> in this text and fix the > ? (If it is closed with a /> already then there is no action required). Other question, if there is no "width" or "height" to the <img> specified, what is the best way to solve the issue? Download all the images and get the corresponding attributes of width and height, then

BeautifulSoup Specify table column by number?

阅读更多关于 BeautifulSoup Specify table column by number?

问题 Using Python 2.7 and BeautifulSoup 4, I'm scraping song names from a table. Right now the script finds links in the row of a table; how can I specify I want the first column? Ideally I'd be able to switch numbers around to change which ones got selected. Right now the code looks like this: from bs4 import BeautifulSoup import requests r = requests.get("http://evamsharma.finosus.com/beatles/index.html") data = r.text soup = BeautifulSoup(data) for table in soup.find_all('table'): for row in