How can I scrape the title of different jobs from a website using requests?

与世无争的帅哥 提交于 2020-03-22 04:54:47

问题


I'm trying to create a script in python using requests module to scrape the title of different jobs from a website. To parse the title of different jobs I need to get the relevant response from that site first so that I can process the content using BeautifulSoup. However, When I run the following script, I can see that the script produces gibberish which literally do not contain the titles I look for.

website link (In case you don't see any data, make sure to refresh the page)

I've tried with:

import requests
from bs4 import BeautifulSoup

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
    s.headers.update({"Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="})
    res = s.get(link,params=query_string)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        print(item.text)

I even tried like this:

import urllib.request
from bs4 import BeautifulSoup
from urllib.parse import urlencode

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
    "Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="  
}

def get_content(url,params):
    req = urllib.request.Request(f"{url}{params}",headers=headers)
    res = urllib.request.urlopen(req).read()
    soup = BeautifulSoup(res,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        yield item.text

if __name__ == '__main__':
    params = urlencode(query_string)
    for item in get_content(link,params):
        print(item)

How can I fetch the title of different jobs using requests?

PS Browser simulator is not an option here to do the task.


回答1:


To successful get expected request, you have to use cookies. For the URL you need rbzid cookie is enough. You can get it manually, if it will expire you can implement solution using Selenium and Proxy Server to refresh it and continue scraping with requests.

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
             'Chrome/80.0.3987.122 Safari/537.36'
cookies = {
    'rbzid': 'DGF6ckG9dPQkJ0RhPIIqCu2toGvky84UY2z7QpJln31JVdw/YU4wJ7WXe5Tom9VhEvsZT6PikTaeZjJfsKwp'
             'M1TaCZr6tOHaOtE8jX3eWsFX5Zm8TJLeO8+O2fFfTHBf++lRgo/NaYq/sXh+QobO59zQRmZQd0XMjTSpVMDu'
             'YZS8C3GMsIR8cBt9gyuDCYD2XL8pVz68fD4OqBep3G/LnKR4bQsMiLHwKjglQ4fBrq8=',
}
headers = {'User-Agent': user_agent, }
params = (
    ('page', '1'),
    ('position', '235'),
    ('type', ''),
    ('city', ''),
    ('region', ''),
)

response = requests.get('https://www.alljobs.co.il/SearchResultsGuest.aspx',
                        headers=headers, params=params, cookies=cookies)

soup = BeautifulSoup(response.text, "lxml")
titles = soup.select("a[title]")



回答2:


I'd like to see what your gibberish looks like. When I ran your code, I got a bunch of Hebrew characters (unsurprising, since the website is in Hebrew) and job titles:

לחברת הייטק מובילה, IT project manager דרושים AllStars-IT Group (MT) אלעד מערכות מגייסת מפתח /ת JAVA לגוף רפואי גדול היושב בתל אביב! דרושים אלעד מערכות מנתח /ת מערכות ומאפיין /ת דרושים מרטנס הופמן שירותי מחשוב אנשי /נשות תפעול ותמיכה טכנית למוצר אינטרנטי דרושים המימד השלישי DBA SQL /ORACLE דרושים CPS Jobs דרושים /ות אנשי /נשות תמיכה על מערכת פריוריטי, שכר מתגמל למתאימים /ות דרושים חבר הון אנושי מפתח /ת SAP ABAP דרושים טאואר סמיקונדקטור דרוש /ה Director of Data analytics דרושים אופיסופט Fullstack Developer דרושים SQLink מפתח /ת תשתיות דאטה ותומך תשתית BI דרושים המימד השביעי בע"מ מפתח /ת תשתיות דאטה ותומך /ת תשתית BI דרושים יוניטסק לארגון בעל משמעות גבוהה דרוש /ה תוכניתן /ית ABAP דרושים יוניטסק לחברת טלדור דרוש /ה ארכיטקט /ית למערכת פיקוד ובקרה עבור ארגון גדול בתל אביב דרושים טלדור Taldor מערכות מחשבים דרוש /ה מפתח /ת אינטגרציה דרושים SQLink דרוש /ה ראש צוות Full stack מתכנת /ת Senior Software Engineer Manager Senior Software Engineer Senior Embedded Software Engineer Embedded Software Engineer Senior Software Engineer Subsidiary PMM Manager תוכניתן /ית BackEnd Full Stack /Frontend Software Engineer Software Validation Engineer Principal Product Manager Quantum Algorithms Research intern Principal/Senior Detection Team Lead Support Engineer Software Engineer

Is your problem that you want to filter out the Hebrew characters? Because that just requires simple regex! Import the re package, and then replace your print statement with this:

print(re.sub('[^A-z0-9]+',' ',item.text))

Hope this helps!



来源:https://stackoverflow.com/questions/60486625/how-can-i-scrape-the-title-of-different-jobs-from-a-website-using-requests

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!