web-scraping

(Beautiful Soup) Get data inside a button tag

早过忘川 提交于 2021-01-28 14:11:45
问题 I try to scrape out an ImageId inside a button tag, want to have the result: "25511e1fd64e99acd991a22d6c2d6b6c". When I try: drawing_url = drawing_url.find_all('button', class_='inspectBut')['onclick'] it doesn't work. Giving an error- TypeError: list indices must be integers or slices, not str Input = for article in soup.find_all('div', class_='dojoxGridRow'): drawing_url = article.find('td', class_='dojoxGridCell', idx='3') drawing_url = drawing_url.find_all('button', class_='inspectBut')

How to web scrape a chart by using Python?

爱⌒轻易说出口 提交于 2021-01-28 13:42:48
问题 I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule The chart starts out like: Tuesday, October 25 8:00 PM Knicks/Cavaliers TNT 10:30 PM Spurs/Warriors TNT Wednesday, October 26 8:00 PM Thunder/Sixers ESPN 10:30 PM Rockets/Lakers ESPN I am using these packages: from bs4 import BeautifulSoup import requests import pandas as pd import numpy as np The output I want in a .csv file looks like this: These are the first six lines

Get response 200 instead of <418 I'm a Teapot>, using DDG

為{幸葍}努か 提交于 2021-01-28 13:33:56
问题 I was trying to scrape search results from DDG the other day, but i keep getting response 418. How can i make it response 200 or get results from it? This is my code. import requests from bs4 import BeautifulSoup import urllib while True: query = input("Enter Search Text: ") a = query.replace(' ', '+') url = 'https://duckduckgo.com/?q=random' +a headers = {"User-Agent": "Mozilla/5.0 (Linux; Android 6.0.1; SHIELD Tablet K1 Build/MRA58K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0

Get response 200 instead of <418 I'm a Teapot>, using DDG

那年仲夏 提交于 2021-01-28 13:32:15
问题 I was trying to scrape search results from DDG the other day, but i keep getting response 418. How can i make it response 200 or get results from it? This is my code. import requests from bs4 import BeautifulSoup import urllib while True: query = input("Enter Search Text: ") a = query.replace(' ', '+') url = 'https://duckduckgo.com/?q=random' +a headers = {"User-Agent": "Mozilla/5.0 (Linux; Android 6.0.1; SHIELD Tablet K1 Build/MRA58K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0

Excel vba Translate IE.Document empty

不羁岁月 提交于 2021-01-28 13:00:44
问题 This is a VBA script i am using to translate fields in a excell sheet. The problem is that the script works for me about 2 3 month ago, but now the IE.Document is empty after translating. The page comes up with correct translation but i can't get the result inside my excel sheet inputstring = "en" outputstring = "da" text_to_convert = str 'open website IE.Visible = True IE.navigate "https://translate.google.com/#" & inputstring & "/" & outputstring & "/" & text_to_convert Do Until IE

How to log on to my wsj account from linux terminal (using curl, oauth2.0)

孤人 提交于 2021-01-28 12:52:29
问题 I'm a paid member of wsj and I want to log onto my wsj account from linux terminal so I can write codes to scrap some articles to for my NLP research. I won't release the data whatsoever. My approach is based on a previous answer from Scrap articles form wsj by requests, CURL and BeautifulSoup The main issue with the codes that work back then but do not work now is that apparently wsj has adopted a different OAuth 2.0 approach. First, connection I cannot obtain anymore by running login_url. I

How to log on to my wsj account from linux terminal (using curl, oauth2.0)

╄→尐↘猪︶ㄣ 提交于 2021-01-28 12:48:03
问题 I'm a paid member of wsj and I want to log onto my wsj account from linux terminal so I can write codes to scrap some articles to for my NLP research. I won't release the data whatsoever. My approach is based on a previous answer from Scrap articles form wsj by requests, CURL and BeautifulSoup The main issue with the codes that work back then but do not work now is that apparently wsj has adopted a different OAuth 2.0 approach. First, connection I cannot obtain anymore by running login_url. I

Get All Spiders Class name in Scrapy

拟墨画扇 提交于 2021-01-28 12:45:49
问题 in the older version we could get the list of spiders(spider names ) with following code, but in the current version (1.4) I faced with [py.warnings] WARNING: run-all-spiders.py:17: ScrapyDeprecationWarning: CrawlerRunner.spiders attribute is renamed to CrawlerRunner.spider_loader. for spider_name in process.spiders.list(): # list all the available spiders in my project Use crawler.spiders.list() : >>> for spider_name in crawler.spiders.list(): ... print(spider_name) How Can I get spiders

How to log on to my wsj account from linux terminal (using curl, oauth2.0)

﹥>﹥吖頭↗ 提交于 2021-01-28 12:41:56
问题 I'm a paid member of wsj and I want to log onto my wsj account from linux terminal so I can write codes to scrap some articles to for my NLP research. I won't release the data whatsoever. My approach is based on a previous answer from Scrap articles form wsj by requests, CURL and BeautifulSoup The main issue with the codes that work back then but do not work now is that apparently wsj has adopted a different OAuth 2.0 approach. First, connection I cannot obtain anymore by running login_url. I

Scrapy - NameError: name 'items' is not defined

浪子不回头ぞ 提交于 2021-01-28 12:21:00
问题 I'm trying to fill my Items with parsed data and I'm getting error: item = items() NameError: name 'items' is not defined** When I run scrapy crawl usa_florida_scrapper Here's my spider's code: import scrapy import re class UsaFloridaScrapperSpider(scrapy.Spider): name = 'usa_florida_scrapper' start_urls = ['https://www.txlottery.org/export/sites/lottery/Games/index.html'] def parse(self, response): item = items() print('++++++ Latest Results for Powerball ++++++++++') power_ball_html =