web-scraping

How to filter out nodes with rvest?

谁说胖子不能爱 提交于 2021-01-27 11:29:25
问题 I am using the R rvest library to read an html page containing tables. Unfortunately the tables have inconsistent number of columns. Here is an example of the table I read: <table> <tr class="alt"> <td>1</td> <td>2</td> <td class="hidden">3</td> </tr> <tr class="tr0 close notule"> <td colspan="9">4</td> </tr> </table> and my code to read the table in R: require(rvest) url = "table.html" x <- read_html(url) (x %>% html_nodes("table")) %>% html_table(fill=T) # [[1]] # X1 X2 X3 X4 X5 X6 X7 X8 X9

How to filter out nodes with rvest?

守給你的承諾、 提交于 2021-01-27 11:28:35
问题 I am using the R rvest library to read an html page containing tables. Unfortunately the tables have inconsistent number of columns. Here is an example of the table I read: <table> <tr class="alt"> <td>1</td> <td>2</td> <td class="hidden">3</td> </tr> <tr class="tr0 close notule"> <td colspan="9">4</td> </tr> </table> and my code to read the table in R: require(rvest) url = "table.html" x <- read_html(url) (x %>% html_nodes("table")) %>% html_table(fill=T) # [[1]] # X1 X2 X3 X4 X5 X6 X7 X8 X9

How to filter out nodes with rvest?

ぃ、小莉子 提交于 2021-01-27 11:26:07
问题 I am using the R rvest library to read an html page containing tables. Unfortunately the tables have inconsistent number of columns. Here is an example of the table I read: <table> <tr class="alt"> <td>1</td> <td>2</td> <td class="hidden">3</td> </tr> <tr class="tr0 close notule"> <td colspan="9">4</td> </tr> </table> and my code to read the table in R: require(rvest) url = "table.html" x <- read_html(url) (x %>% html_nodes("table")) %>% html_table(fill=T) # [[1]] # X1 X2 X3 X4 X5 X6 X7 X8 X9

Getting form “action” from BeautifulSoup result

左心房为你撑大大i 提交于 2021-01-27 07:20:22
问题 I'm coding a Python parser for a website to do some job automatically but I'm not much into "re" module (regex) for Py and can't make it work. req = urllib2.Request(tl2) req.add_unredirected_header('User-Agent', ua) response = urllib2.urlopen(req) try: html = response.read() except urllib2.URLError, e: print "Error while reading data. Are you connected to the interwebz?!", e soup = BeautifulSoup.BeautifulSoup(html) form = soup.find('form', id='form_product_page') pret = form.prettify() print

Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

主宰稳场 提交于 2021-01-26 09:19:07
问题 I'm trying to learn how to scrape web pages and in the tutorial I'm using the code below is throwing this error: lxml.etree.XPathEvalError: Invalid predicate The website I'm querying is (don't judge me, it was the one used in the training vid :/ ): https://itunes.apple.com/us/app/candy-crush-saga/id553834731 The xpath string that causes the error is here: links = tree.xpath('//div[@class="center-stack"//*/a[@class="name"]/@href') I'm using the LXML and requests libraries. If you need any

Python Xpath: lxml.etree.XPathEvalError: Invalid predicate

醉酒当歌 提交于 2021-01-26 09:17:01
问题 I'm trying to learn how to scrape web pages and in the tutorial I'm using the code below is throwing this error: lxml.etree.XPathEvalError: Invalid predicate The website I'm querying is (don't judge me, it was the one used in the training vid :/ ): https://itunes.apple.com/us/app/candy-crush-saga/id553834731 The xpath string that causes the error is here: links = tree.xpath('//div[@class="center-stack"//*/a[@class="name"]/@href') I'm using the LXML and requests libraries. If you need any

Need help to scrape “Show more” button

我是研究僧i 提交于 2021-01-25 22:12:24
问题 I have the followind code import pandas as pd import requests from bs4 import BeautifulSoup import datetime import time url_list = [ 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No', # 'https://www.coolmod.com/componentes-pc-placas-base?f=55::ATX||prices::3-300', ] df_list = [] for url in url_list: headers = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 'Accept-Language': 'es-ES, es;q=0.5'}) print

Need help to scrape “Show more” button

Deadly 提交于 2021-01-25 22:11:28
问题 I have the followind code import pandas as pd import requests from bs4 import BeautifulSoup import datetime import time url_list = [ 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No', # 'https://www.coolmod.com/componentes-pc-placas-base?f=55::ATX||prices::3-300', ] df_list = [] for url in url_list: headers = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 'Accept-Language': 'es-ES, es;q=0.5'}) print

Need help to scrape “Show more” button

谁都会走 提交于 2021-01-25 22:06:36
问题 I have the followind code import pandas as pd import requests from bs4 import BeautifulSoup import datetime import time url_list = [ 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No', # 'https://www.coolmod.com/componentes-pc-placas-base?f=55::ATX||prices::3-300', ] df_list = [] for url in url_list: headers = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 'Accept-Language': 'es-ES, es;q=0.5'}) print