html-parsing

How to extract values with BeautifulSoup with no class

大兔子大兔子 提交于 2019-12-11 12:42:04
问题 html code : <td class="_480u"> <div class="clearfix"> <div> Female </div> </div> </td> I wanted the value "Female" as an output. I tried bs.findAll('div',{'class':'clearfix'}) ; bs.findAll('tag',{'class':'_480u'}) But these classes are all over my html code and the output is a big list. I wanted to incorporate {td --> class = ".." and div --> class = ".."} in my search, so that I get the output as Female. How can I do this? Thanks 回答1: Use stripped_strings property: >>> from bs4 import

how to pass search key and get result through bs4

限于喜欢 提交于 2019-12-11 12:15:30
问题 def get_main_page_url("https://malwr.com/analysis/search/", strDestPath, strMD5): base_url = 'https://malwr.com/' url = 'https://malwr.com/account/login/' username = 'myname' password = 'pswd' session = requests.Session() # getting csrf value response = session.get(url) soup = bs4.BeautifulSoup(response.content) form = soup.form csrf = form.find('input', attrs={'name': 'csrfmiddlewaretoken'}).get('value') ## csrf1 = form.find('input', attrs ={'name': 'search'}).get('value') # logging in data

I want ro get all article content from all links inside from an website

╄→尐↘猪︶ㄣ 提交于 2019-12-11 11:53:13
问题 I want to extract all article content from an website using any web crawling/scraping methods. The problem is I can get content from a single page but not its redirecting links. Anyone please give me the proper solutions import java.io.FileOutputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.Reader; import java.net.URI; import java.net.URL; import java.net.URLConnection; import javax.swing.text.EditorKit; import javax.swing.text.html.HTMLDocument;

How to ignore http link in string and return everything else?

旧城冷巷雨未停 提交于 2019-12-11 11:44:36
问题 I'm try to parse some html content, here's the HTML content: <font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p> <font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p> <font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p> <font color="green"> *TITLE* </font> Event

Using JSoup to get data-code value of a table

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-11 11:43:02
问题 How would I be able to use JSoup to get the data-code value from a table row? Here is what I have tried but it just prints nothing: Document doc = Jsoup.connect("http://www.example.com").get(); Elements dataCodes = doc.select("table[class=team-list]"); for (Element dataCode : dataCodes) { System.out.println(dataCode.attr("data-code")); } The HTML code looks like this: <body> <div id-=""main"> <div id="inner"> <div id="table" class="scores-table"> <table class ="team-list"> <tbody> <tr data

Jsoup - extracting data from an <a> tag, inside a <td> tag

两盒软妹~` 提交于 2019-12-11 11:13:04
问题 I want to extract data from a Web site, using Jsoup. The data are in a table. HTML code: <table><tr><td><a href="......">Pop.Density</a></td> <td>123</td></tr></table> I want to print: zip code...(taken from a text file): 123 I have the following exception: Exception in thread "main" java.lang.NullPointerException Any help would be appreciated. Thank you! This is my code: String s = br.readLine(); String str="http://www.bestplaces.net/people/zip-code/illinois/"+s; org.jsoup.Connection conn =

How to easily parse HTML for consumption as a service using Java?

本秂侑毒 提交于 2019-12-11 10:49:53
问题 I want to parse an HTML such as http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top and only want extract the text of the element which has <a class="title" The options I have looked so far all look like overkill (SAX, DOM traversal). 回答1: Use Jsoup. It supports jQuery-like CSS selectors. Here's a kickoff example: String url = "http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top"; Document document = Jsoup.connect(url).get(); for (Element link : document.select("a.title"

JSOUP Finding Groups of Words

会有一股神秘感。 提交于 2019-12-11 10:48:47
问题 For a homework assignment I have to write a program that scraps HTML from a website and then somehow find phrases within the website. When I say phrases I mean some sort of arbitrary way of organizing text so that words that are in close proximity to each other are put in the same group. I know this sounds really unclear, but the assignment states how we do this is up to our own interpretation of how to find "phrases". Currently I have code that looks like: Document doc = Jsoup.connect("http:

Python how to search and correct html tags and attributes?

拜拜、爱过 提交于 2019-12-11 10:14:45
问题 I have to fix all the closing tags of the <img> tag as shown in the text below. Instead of closing the <img> with a > , it should close with /> . Is there any easy way to search for all the <img> in this text and fix the > ? (If it is closed with a /> already then there is no action required). Other question, if there is no "width" or "height" to the <img> specified, what is the best way to solve the issue? Download all the images and get the corresponding attributes of width and height, then

BeautifulSoup Specify table column by number?

时光怂恿深爱的人放手 提交于 2019-12-11 09:59:50
问题 Using Python 2.7 and BeautifulSoup 4, I'm scraping song names from a table. Right now the script finds links in the row of a table; how can I specify I want the first column? Ideally I'd be able to switch numbers around to change which ones got selected. Right now the code looks like this: from bs4 import BeautifulSoup import requests r = requests.get("http://evamsharma.finosus.com/beatles/index.html") data = r.text soup = BeautifulSoup(data) for table in soup.find_all('table'): for row in