web-scraping

How to scrape Instagram with BeautifulSoup

回眸只為那壹抹淺笑 提交于 2021-01-16 08:11:55
问题 I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right? Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the

How to scrape Instagram with BeautifulSoup

梦想与她 提交于 2021-01-16 08:00:12
问题 I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right? Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the

How to scrape Instagram with BeautifulSoup

无人久伴 提交于 2021-01-16 07:59:16
问题 I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right? Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the

How to scrape Instagram with BeautifulSoup

泄露秘密 提交于 2021-01-16 07:52:55
问题 I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right? Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the

Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

本小妞迷上赌 提交于 2021-01-14 23:42:08
问题 Im trying to scrape data from sciencedirect website. Im trying to automate the scarping process by accessing the journal issues one after the other by creating a list of xpaths and looping them. when im running the loop im unable to access the rest of the elements after accessing the first journal. This process worked for me on another website but not on this. I also wanted to know is there any better way to access these elements apart from this process. #Importing libraries import requests

Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

老子叫甜甜 提交于 2021-01-14 23:40:51
问题 Im trying to scrape data from sciencedirect website. Im trying to automate the scarping process by accessing the journal issues one after the other by creating a list of xpaths and looping them. when im running the loop im unable to access the rest of the elements after accessing the first journal. This process worked for me on another website but not on this. I also wanted to know is there any better way to access these elements apart from this process. #Importing libraries import requests

Unable to make my script stop when some urls are scraped

依然范特西╮ 提交于 2021-01-14 17:23:28
问题 I'v created a script in scrapy to parse the titles of different sites listed in start_urls . The script is doing it's job flawlessly. What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there. I've tried so far with: import scrapy from scrapy.crawler import CrawlerProcess class TitleSpider(scrapy.Spider): name = "title_bot" start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"] def parse(self, response):

Unable to make my script stop when some urls are scraped

时光总嘲笑我的痴心妄想 提交于 2021-01-14 17:10:44
问题 I'v created a script in scrapy to parse the titles of different sites listed in start_urls . The script is doing it's job flawlessly. What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there. I've tried so far with: import scrapy from scrapy.crawler import CrawlerProcess class TitleSpider(scrapy.Spider): name = "title_bot" start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"] def parse(self, response):

Limited number of scraped data?

时光总嘲笑我的痴心妄想 提交于 2021-01-07 02:51:06
问题 I am scraping a website and everything seems work fine from today's news until news published in 2015/2016. After these years, I am not able to scrape news. Could you please tell me if anything has changed? I should get 672 pages getting titles and snippets from this page: https://catania.liveuniversity.it/attualita/ but I have got approx. 158. The code that I am using is: import bs4, requests import pandas as pd import re headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11

Scrapy does not find text in Xpath or Css

我怕爱的太早我们不能终老 提交于 2021-01-07 02:19:03
问题 I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element. to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text. from scrapy.selector import Selector start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html" #BASIC ITEM AND SPIDER YADA, SPARE