问题
I am trying to scrape some LinkedIn profiles of well known people. The code takes a bunch of LinkedIn profile URLS and then uses Selenium
and scrape_linkedin
to collect the information and save it into a folder as a .json file.
The problem I am running into is that LinkedIn naturally blocks the scraper from collecting some profiles. I am always able to get the first profile in the list of URLs. I put this down to the fact that it opens a new Google Chrome window and then goes to the LinkedIn page. (I could be wrong on this point however.)
What I would like to do is to add to the for loop a line which opens a new Google Chrome session and once the scraper has collected the data close the Google Chrome session such that on the next iteration in the loop it will open up a fresh new Google Chrome session.
From the package website here it states:
driver {selenium.webdriver}: driver type to use
default: selenium.webdriver.Chrome
Looking at the Selenium package website here I see:
driver = webdriver.Firefox()
...
driver.close()
So Selenium does have a close()
option.
How can I add an open and close Google Chrome browser to the for loop?
I have tried alternative methods to try and collect the data such as changing the time.sleep()
to 10 minutes, to changing the scroll_increment
and scroll_pause
but it still does not download the whole profile after the first one has been collected.
Code:
from datetime import datetime
from scrape_linkedin import ProfileScraper
import pandas as pd
import json
import os
import re
import time
my_profile_list = ['https://www.linkedin.com/in/williamhgates/', 'https://www.linkedin.com/in/christinelagarde/', 'https://www.linkedin.com/in/ursula-von-der-leyen/']
# To get LI_AT key
# Navigate to www.linkedin.com and log in
# Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
# Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
# Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
# Find and copy the li_at value
myLI_AT_Key = 'INSERT LI_AT Key'
with ProfileScraper(cookie=myLI_AT_Key, scroll_increment = 50, scroll_pause = 0.8) as scraper:
for link in my_profile_list:
print('Currently scraping: ', link, 'Time: ', datetime.now())
profile = scraper.scrape(url=link)
dataJSON = profile.to_dict()
profileName = re.sub('https://www.linkedin.com/in/', '', link)
profileName = profileName.replace("?originalSubdomain=es", "")
profileName = profileName.replace("?originalSubdomain=pe", "")
profileName = profileName.replace("?locale=en_US", "")
profileName = profileName.replace("?locale=es_ES", "")
profileName = profileName.replace("?originalSubdomain=uk", "")
profileName = profileName.replace("/", "")
with open(os.path.join(os.getcwd(), 'ScrapedLinkedInprofiles', profileName + '.json'), 'w') as json_file:
json.dump(dataJSON, json_file)
time.sleep(10)
print('The first observation scraped was:', my_profile_list[0:])
print('The last observation scraped was:', my_profile_list[-1:])
print('END')
回答1:
Here is a way to open and close tabs/browser.
from datetime import datetime
from scrape_linkedin import ProfileScraper
import random #new import made
from selenium import webdriver #new import made
import pandas as pd
import json
import os
import re
import time
my_profile_list = ['https://www.linkedin.com/in/williamhgates/', 'https://www.linkedin.com/in/christinelagarde/',
'https://www.linkedin.com/in/ursula-von-der-leyen/']
myLI_AT_Key = 'INSERT LI_AT Key'
for link in my_profile_list:
my_driver = webdriver.Chrome() #if you don't have Chromedrive in the environment path then use the next line instead of this
#my_driver = webdriver.Chrome(executable_path=r"C:\path\to\chromedriver.exe")
#sending our driver as the driver to be used by srape_linkedin
#you can also create driver options and pass it as an argument
ps = ProfileScraper(cookie=myLI_AT_Key, scroll_increment=random.randint(10,50), scroll_pause=0.8 + random.uniform(0.8,1),driver=my_driver) #changed name, default driver and scroll_pause time and scroll_increment made a little random
print('Currently scraping: ', link, 'Time: ', datetime.now())
profile = ps.scrape(url=link) #changed name
dataJSON = profile.to_dict()
profileName = re.sub('https://www.linkedin.com/in/', '', link)
profileName = profileName.replace("?originalSubdomain=es", "")
profileName = profileName.replace("?originalSubdomain=pe", "")
profileName = profileName.replace("?locale=en_US", "")
profileName = profileName.replace("?locale=es_ES", "")
profileName = profileName.replace("?originalSubdomain=uk", "")
profileName = profileName.replace("/", "")
with open(os.path.join(os.getcwd(), 'ScrapedLinkedInprofiles', profileName + '.json'), 'w') as json_file:
json.dump(dataJSON, json_file)
time.sleep(10 + random.randint(0,5)) #added randomness to the sleep time
#this will close your browser at the end of every iteration
my_driver.quit()
print('The first observation scraped was:', my_profile_list[0:])
print('The last observation scraped was:', my_profile_list[-1:])
print('END')
This scraper by default uses Chrome as the browser but also gives the freedom to choose what browser you want to use in all possible places like CompanyScraper
, ProfileScraper
, etc.
I have just changed the default arguments to be passed in the initialization of ProfileScrapper()
class and made your driver run browser and close it rather than the default one, added some random time into the wait/sleep intervals as you had requested(you can tweak it as per your needs. You can change the Random Noise
I have added to your comfort.
There is no need to use scrape_in_parallel()
as I had suggested in my comments but if you want to then, you can define the number of browser instances(num_instances
) you want to run along with your own dictionary of drivers having it's own options too(in a another dictionary) :
from scrape_linkedin import scrape_in_parallel, CompanyScraper
from selenium import webdriver
driver1 = webdriver.Chrome()
driver2 = webdriver.Chrome()
driver3 = webdriver.Chrome()
driver4 = webdriver.Chrome()
my_drivers = [driver1,driver2,driver3,driver4]
companies = ['facebook', 'google', 'amazon', 'microsoft', ...]
driver_dict = {}
for i in range(1,len(my_drivers)+1):
driver_dict[i] = my_drivers[i-1]
#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
scraper_type=CompanyScraper,
items=companies,
output_file="companies.json",
num_instances=4,
driver= driver_dict
)
It's an open source code and since it's written solely in Python you can understand the source code very easily. It's quite an interesting scraper, thank you for letting me know about it too!
NOTE:
There are some concerning unresolved issues in this module as it's told in it's GitHub Issues tab. I would wait for a few more forks and updates if I were you if this doesn't work properly.
来源:https://stackoverflow.com/questions/65531036/adding-an-open-close-google-chrome-browser-to-selenium-linkedin-scraper-code