问题
I am trying to read web page using python and save the data in csv format to be imported as pandas dataframe.
I have the following code that extracts the links from all the pages, instead I am trying to read certain column fields.
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
import urllib2
from bs4 import BeautifulSoup
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
print i, anchor.text
except:
pass
Can I save these 9 columns as pandas dataframe?
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']
回答1:
This returns the correct results for the first 10 pages - but it takes a lot of time for 100 pages. Any suggestions to make it faster?
import urllib2
from bs4 import BeautifulSoup
finallist=list()
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
mylist=list()
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
mylist.append(anchor.text)
finallist.append(mylist)
except:
pass
import pandas as pd
df=pd.DataFrame(finallist)
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['participants'] = df['participants'].astype(int)
来源:https://stackoverflow.com/questions/41137778/extract-specific-columns-from-a-given-webpage