Extract specific columns from a given webpage

问题

I am trying to read web page using python and save the data in csv format to be imported as pandas dataframe.

I have the following code that extracts the links from all the pages, instead I am trying to read certain column fields.

for i in range(10):
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
    import urllib2
    from bs4 import BeautifulSoup
    try:
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
            print i, anchor.text
    except:
        pass

Can I save these 9 columns as pandas dataframe?

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']

回答1:

This returns the correct results for the first 10 pages - but it takes a lot of time for 100 pages. Any suggestions to make it faster?

import urllib2
from bs4 import BeautifulSoup

finallist=list()
for i in range(10):
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
    try:
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        mylist=list()
        for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
            mylist.append(anchor.text)
        finallist.append(mylist)
    except:
        pass

import pandas as pd
df=pd.DataFrame(finallist)

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']

df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['participants'] = df['participants'].astype(int)

来源：https://stackoverflow.com/questions/41137778/extract-specific-columns-from-a-given-webpage

标签

pandas

beautifulsoup

bs4