Split Twitter RSS string using Python

问题

I am trying to parse Twitter RSS feeds and put the information in a sqlite database, using Python. Here's an example:

MiamiPete: today's "Last Call" is now up http://bit.ly/MGDzu #stocks #stockmarket #finance #money

What I want to do is create one column for the main content (Miami Pete…now up), one column for the URL (http://bit.ly/MGDzu), and four separate columns for the hashtags (stocks, stockmarket, finance, money). I've been playing around with how to do this.

Any advice would be greatly appreciated!

P.S. Some code I've been playing around with is below--you can see I tried initially creating a variable called "tiny_url" and splitting it, which it does seem to do, but this feeble attempt is not anywhere close to solving the problem noted above. :)

def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
        tinyurl = entry.summary    ### I added this in
        print tinyurl.split('http') ### I added this in 
        if len(c.fetchall()) == 0:
            c.execute('INSERT INTO RSSEntries (id, url, title, content, tinyurl, date, tiny) VALUES (?,?,?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, tinyurl, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed), tiny ))

回答1:

It seems like your data-driven design is rather flawed. Unless all your entries have a text part, an url and up to 4 tags, it's not going to work.

You also need to separate saving to db from parsing. Parsing could be easily done with a regexep (or even string methods):

>>> s = your_string
>>> s.split()
['MiamiPete:', "today's", '"Last', 'Call"', 'is', 'now', 'up', 'http://bit.ly/MGDzu', '#stocks', '#stockmarket', '#finance', '#money']
>>> url = [i for i in s.split() if i.startswith('http://')]
>>> url
['http://bit.ly/MGDzu']
>>> tags = [i for i in s.split() if i.startswith('#')]
>>> tags
['#stocks', '#stockmarket', '#finance', '#money']
>>> ' '.join(i for i in s.split() if i not in url+tags)
'MiamiPete: today\'s "Last Call" is now up'

Single-table db design would probably have to go, though.

回答2:

Also, you can parse your strings using regexps:

>>> s = (u'MiamiPete: today\'s "Last Call" is now up http://bit.ly/MGDzu '
         '#stocks #stockmarket #finance #money')
>>> re.match(r'(.*) (http://[^ ]+)', s).groups()
(u'MiamiPete: today\'s "Last Call" is now up', u'http://bit.ly/MGDzu')
>>> re.findall(r'(#\w+)', s)
[u'#stocks', u'#stockmarket', u'#finance', u'#money']

回答3:

Twitter has an api that may be easier for you to use here, http://apiwiki.twitter.com/Twitter-API-Documentation.

You can get the results as JSON or XML and use one of the many Python libraries to parse the results.

Or if you must your the RSS there are Python feed parsers like, http://www.feedparser.org/.

回答4:

I would highly recommend using the Twitter API. There are actually two APIs, one for the main twitter server and one for the search server. They are used for different things.

You can find sample code, pytwitter on svn. Add simplejson and you can be doing very powerful things in a matter of minutes.

Good luck

来源：https://stackoverflow.com/questions/1354415/split-twitter-rss-string-using-python

标签

python

string

sqlite

split

bit.ly