web-scraping | 易学教程

How to scrape Instagram with BeautifulSoup

阅读更多关于 How to scrape Instagram with BeautifulSoup

问题 I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right? Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the

How to scrape Instagram with BeautifulSoup

阅读更多关于 How to scrape Instagram with BeautifulSoup

How to scrape Instagram with BeautifulSoup

阅读更多关于 How to scrape Instagram with BeautifulSoup

How to scrape Instagram with BeautifulSoup

阅读更多关于 How to scrape Instagram with BeautifulSoup

Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

阅读更多关于 Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

问题 Im trying to scrape data from sciencedirect website. Im trying to automate the scarping process by accessing the journal issues one after the other by creating a list of xpaths and looping them. when im running the loop im unable to access the rest of the elements after accessing the first journal. This process worked for me on another website but not on this. I also wanted to know is there any better way to access these elements apart from this process. #Importing libraries import requests

Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

阅读更多关于 Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python

Unable to make my script stop when some urls are scraped

阅读更多关于 Unable to make my script stop when some urls are scraped

问题 I'v created a script in scrapy to parse the titles of different sites listed in start_urls . The script is doing it's job flawlessly. What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there. I've tried so far with: import scrapy from scrapy.crawler import CrawlerProcess class TitleSpider(scrapy.Spider): name = "title_bot" start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"] def parse(self, response):

Unable to make my script stop when some urls are scraped

阅读更多关于 Unable to make my script stop when some urls are scraped

Limited number of scraped data?

阅读更多关于 Limited number of scraped data?

问题 I am scraping a website and everything seems work fine from today's news until news published in 2015/2016. After these years, I am not able to scrape news. Could you please tell me if anything has changed? I should get 672 pages getting titles and snippets from this page: https://catania.liveuniversity.it/attualita/ but I have got approx. 158. The code that I am using is: import bs4, requests import pandas as pd import re headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11

Scrapy does not find text in Xpath or Css

阅读更多关于 Scrapy does not find text in Xpath or Css

问题 I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element. to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text. from scrapy.selector import Selector start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html" #BASIC ITEM AND SPIDER YADA, SPARE