Web scrapping using beautiful soup

左心房为你撑大大i 提交于 2019-12-24 20:48:58

问题


How could i get all the categories mentioned on each listing page of the same website "https://www.sfma.org.sg/member/category". for example, when i choose alcoholic beverage category on the above mentioned page, the listings mentioned on that page has the category information like this :-

Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier

how can i extract the categories mentioned here with in same variable.

The code i have written for this is :-

  category = soup_2.find_all('a', attrs ={'class' :'plink'})
  links = [links['href'] for links in category]

but it is producing the below output which are all the links on the page & not the text with in the href:-

['http://www.sfma.org.sg/about/singapore-food-manufacturers-association',
 'http://www.sfma.org.sg/about/council-members',
 'http://www.sfma.org.sg/about/history-and-milestones',
 'http://www.sfma.org.sg/membership/',
 'http://www.sfma.org.sg/member/',
 'http://www.sfma.org.sg/member/alphabet/',
 'http://www.sfma.org.sg/member/category/',
 'http://www.sfma.org.sg/resources/sme-portal',
 'http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore',
 'http://www.sfma.org.sg/resources/import-export-requirements-and-procedures',
 'http://www.sfma.org.sg/resources/labelling-guidelines',
 'http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes',
 'http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard',
 'http://www.sfma.org.sg/resources/p-max',
 'http://www.sfma.org.sg/event/',
  .....]

Please excuse if the question seems to be novice, i am just very new to python,

Thanks !!!


回答1:


If you just want the links out of the results you already posted, you can get that like this:

import requests 
from bs4 import BeautifulSoup

page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'plink'})
for link in links:
    print(link['href'])

Output:

../info/{{permalink}}
http://www.sfma.org.sg/about/singapore-food-manufacturers-association
http://www.sfma.org.sg/about/council-members
http://www.sfma.org.sg/about/history-and-milestones
http://www.sfma.org.sg/membership/
http://www.sfma.org.sg/member/
http://www.sfma.org.sg/member/alphabet/
http://www.sfma.org.sg/member/category/
http://www.sfma.org.sg/resources/sme-portal
http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore
http://www.sfma.org.sg/resources/import-export-requirements-and-procedures
http://www.sfma.org.sg/resources/labelling-guidelines
http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes
http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard
http://www.sfma.org.sg/resources/p-max
http://www.sfma.org.sg/event/
http://www.sfma.org.sg/news/
http://www.fipa.com.sg/
http://www.sfma.org.sg/stp
http://www.sgfoodgifts.sg/

However, if you want the links to each of the entries on the website, you need to join the permalink values with the base url. I've extended that answer from nag to help get the data you want from the website you are looking at. There are permalink values that appear in a second list, and don't work (food/beverage types, rather than companies) so I'm removing them.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')

url_list = []

script_sections = soup.find_all('script')
for i in range(len(script_sections)):
    if len(script_sections[i].contents) >= 1:
        txt = script_sections[i].contents[0]
        pattern = re.compile(r'permalink:\'(.*?)\'')
        permlinks = re.findall(pattern, txt)
        for i in permlinks:
            href = "../info/{{permalink}}"
            href = href.split('{')[0]+i
            full_url = urljoin(page, href)
            if full_url in url_list:
                # drop the repeat extras?
                url_list.remove(full_url)
            else:
                url_list.append(full_url)

for urls in url_list:
    print(urls)

Output (truncated):

https://www.sfma.org.sg/member/info/1a-catering-pte-ltd
https://www.sfma.org.sg/member/info/a-linkz-marketing-pte-ltd
https://www.sfma.org.sg/member/info/aalst-chocolate-pte-ltd
https://www.sfma.org.sg/member/info/abb-pte-ltd
https://www.sfma.org.sg/member/info/ace-synergy-international-pte-ltd
https://www.sfma.org.sg/member/info/acez-instruments-pte-ltd
https://www.sfma.org.sg/member/info/acorn-investments-holding-pte-ltd
https://www.sfma.org.sg/member/info/ad-wright-communications-pte-ltd
https://www.sfma.org.sg/member/info/added-international-s-pte-ltd
https://www.sfma.org.sg/member/info/advance-carton-pte-ltd
https://www.sfma.org.sg/member/info/agroegg-pte-ltd
https://www.sfma.org.sg/member/info/airverclean-pte-ltd
...



回答2:


You need to get the permalink values from the script using regex and join with the base url. Here is the sample

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

base = 'https://www.sfma.org.sg/member/category/manufacturer'

script_txt = """<script>
        var tmObject = {'tmember':[{id:'1',begin_with:'0-9',name:'1A Catering Pte Ltd',category:'22,99',mem_type:'1',permalink:'1a-catering-pte-ltd'},{id:'330',begin_with:'A',name:'A-Linkz Marketing Pte Ltd',category:'3,4,10,14,104,28,40,43,45,49,51,52,63,66,73,83,95,96',mem_type:'1',permalink:'a-linkz-marketing-pte-ltd'},{id:'318',begin_with:'A',name:'Aalst Chocolate Pte Ltd',category:'30,82,83,84,95,97',mem_type:'1',permalink:'aalst-chocolate-pte-ltd'},{id:'421',begin_with:'A',name:'ABB Pte Ltd',category:'86,127,90,92,97,100',mem_type:'3',permalink:'abb-pte-ltd'},{id:'2',begin_with:'A',name:'Ace Synergy International Pte Ltd',category:'104,27,31,59,83,86,95',mem_type:'1',permalink:'ace-synergy-international-pte-ltd'}
        </script>"""

soup = BeautifulSoup(script_txt)

txt = soup.script.get_text()
pattern = re.compile(r'permalink:\'(.*?)\'}')

permlinks = re.findall(pattern, txt)
for i in permlinks:
    href = "../info/{{permalink}}"
    href = href.split('{')[0]+i
    print(urljoin(base, href))  

https://www.sfma.org.sg/member/info/1a-catering-pte-ltd
https://www.sfma.org.sg/member/info/a-linkz-marketing-pte-ltd
https://www.sfma.org.sg/member/info/aalst-chocolate-pte-ltd
https://www.sfma.org.sg/member/info/abb-pte-ltd
https://www.sfma.org.sg/member/info/ace-synergy-international-pte-ltd



回答3:


To get the correct total number of 240 for manufacturer (and get total all categories or any given category count):

If you want just the manufacturer listings first look at the page and check how many links there should be:

By ensuring the css selector has the class of the parent ul i.e. .w3-ul we are limiting to just the appropriate links when we add in the child class selector of .plink. So, we have 240 links on the page.


If we simply used that on the returned html from requests we would find we are far short of this, as many links are dynamically added and thus not present with requests where javascript doesn't run.

However, all links (for all dropdown selections - not just manufacturing) are present in a javascript dictionary, within a script tag, which we can see the start of below:


We can regex out this object using the following expression:

var tmObject = (.*?);


Now, when we inspect the returned string, we can see that we have unquoted keys which may pose problems if we wish to read this dictionary in with a json library:

We can use the hjson library for parsing as this will allow the unquoted keys. * pip install hjson


Finally, we know we have all listings and not just manufacturers; inspecting the tags in the original html we can determine that the manufacturers tag is associated with group code 97.


So, I extract both the links and the groups from the json object as a list of tuples. I split the groups on the "," so I can use in to filter for the appropriate manufacturing code:

all_results = [(base + item['permalink'], item['category'].split(',')) for item in data['tmember']]
manufacturers = [item[0] for item in all_results if '97' in item[1]]

Checking the final len of the list we can get our target 240.

So, we have all_results (all categories), a way to split by category, as well as a worked example for manufacturer.


import requests
from bs4 import BeautifulSoup as bs
import hjson

base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
all_results = [(base + item['permalink'], item['category'].split(',')) for item in data['tmember']]  #manufacturer is category 97
manufacturers = [item[0] for item in all_results if '97' in item[1]]
print(manufacturers)



回答4:


The links you are looking for are obviously populated by a script (Look for response of https://www.sfma.org.sg/member/category/manufacturer in Chrome->Inspect->Network). If you look at the page, you'll see the script loading it. Instead of scraping links, scrape scripts, you'll have the list. Then since link format is known, plugin the values from json. Et voila! Here is the starter code to work with. You may infer the rest.

import requests 
from bs4 import BeautifulSoup

page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
links = [soup.find_all('script')]



来源:https://stackoverflow.com/questions/57212776/web-scrapping-using-beautiful-soup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!