BeautifulSoup - scraping a forum page

丶灬走出姿态 提交于 2021-02-17 09:04:47

问题


I'm trying to scrape a forum discussion and export it as a csv file, with rows such as "thread title", "user", and "post", where the latter is the actual forum post from each individual.

I'm a complete beginner with Python and BeautifulSoup so I'm having a really hard time with this!

My current problem is that all the text is split into one character per row in the csv file. Is there anyone out there who can help me out? It would be fantastic if someone could give me a hand!

Here's the code I've been using:

from bs4 import BeautifulSoup
import csv
import urllib2

f = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")

soup = BeautifulSoup(f)

b = soup.get_text().encode("utf-8").strip() #the posts contain non-ascii words, so I had to do this

writer = csv.writer(open('silkroad.csv', 'w'))
writer.writerows(b)

回答1:


Ok here we go. Not quite sure what I'm helping you do here, but hopefully you have a good reason to be analyzing silk road posts.

You have a few issues here, the big one is that you aren't parsing the data at all. What you're essentially doing with .get_text() is going to the page, highlighting the whole thing, and then copying and pasting the whole thing to a csv file.

So here is what you should be trying to do:

  1. Read the page source
  2. Use soup to break it into sections you want
  3. Save sections in parallel arrays for author, date, time, post, etc
  4. Write data to csv file row by row

I wrote some code to show you what that looks like, it should do the job:

from bs4 import BeautifulSoup
import csv
import urllib2

# get page source and create a BeautifulSoup object based on it
print "Reading page..."
page = urllib2.urlopen("https://silkroad5v7dywlc.onion.to/index.php?action=printpage;topic=28536.0")
soup = BeautifulSoup(page)

# if you look at the HTML all the titles, dates, 
# and authors are stored inside of <dt ...> tags
metaData = soup.find_all("dt")

# likewise the post data is stored
# under <dd ...>
postData = soup.find_all("dd")

# define where we will store info
titles = []
authors = []
times = []
posts = []

# now we iterate through the metaData and parse it
# into titles, authors, and dates
print "Parsing data..."
for html in metaData:
    text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
    titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
    authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
    times.append(text.split(" on ")[1].strip()) # get date

# now we go through the actual post data and extract it
for post in postData:
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

# now we write data to csv file
# ***csv files MUST be opened with the 'b' flag***
csvfile = open('silkroad.csv', 'wb')
writer = csv.writer(csvfile)

# create template
writer.writerow(["Time", "Author", "Title", "Post"])

# iterate through and write all the data
for time, author, title, post in zip(times, authors, titles, posts):
    writer.writerow([time, author, title, post])


# close file
csvfile.close()

# done
print "Operation completed successfully."

EDIT: Included solution that can read files from directory and use data from that

Okay, so you have your HTML files in a directory. You need to get a list of files in the directory, iterate through them, and append to your csv file for each file in the directory.

This is the basic logic of our new program.

If we had a function called processData() that took a file path as an argument and appended data from the file to your csv file here is what it would look like:

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # increment counter

As it happens our processData() function is more or less what we did before, with a few changes.

So this is very similar to our last program, with a few small changes:

  1. We write the column headers first thing
  2. Following that we open the csv with the 'ab' flag to append
  3. We import os to get a list of files

Here's what that looks like:

from bs4 import BeautifulSoup
import csv
import urllib2
import os # added this import to process files/dirs

# ** define our data processing function
def processData( pageFile ):
    ''' take the data from an html file and append to our csv file '''
    f = open(pageFile, "r")
    page = f.read()
    f.close()
    soup = BeautifulSoup(page)

    # if you look at the HTML all the titles, dates, 
    # and authors are stored inside of <dt ...> tags
    metaData = soup.find_all("dt")

    # likewise the post data is stored
    # under <dd ...>
    postData = soup.find_all("dd")

    # define where we will store info
    titles = []
    authors = []
    times = []
    posts = []

    # now we iterate through the metaData and parse it
    # into titles, authors, and dates
    for html in metaData:
        text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") # convert the html to text
        titles.append(text.split("Title:")[1].split("Post by:")[0].strip()) # get Title:
        authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
        times.append(text.split(" on ")[1].strip()) # get date

    # now we go through the actual post data and extract it
    for post in postData:
        posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip())

    # now we write data to csv file
    # ***csv files MUST be opened with the 'b' flag***
    csvfile = open('silkroad.csv', 'ab')
    writer = csv.writer(csvfile)

    # iterate through and write all the data
    for time, author, title, post in zip(times, authors, titles, posts):
        writer.writerow([time, author, title, post])

    # close file
    csvfile.close()
# ** start our process of going through files

# the directory where we have all our HTML files
dir = "myDir"

# our csv file
csvFile = "silkroad.csv"

# insert the column titles to csv
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Time", "Author", "Title", "Post"])
csvfile.close()

# get a list of files in the directory
fileList = os.listdir(dir)

# define variables we need for status text
totalLen = len(fileList)
count = 1

# iterate through files and read all of them into the csv file
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile) # get the file path
    processData(path) # process the data in the file
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." # display status
    count = count + 1 # incriment counter


来源:https://stackoverflow.com/questions/21972690/beautifulsoup-scraping-a-forum-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!