Tracking keywords in a live stream of tweets

妖精的绣舞 提交于 2019-12-02 10:11:57

The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:

def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them 
to the database population function on a per-instance basis, so as to avoid disaster 
if the stream is disconnected.

Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""   
filters = []
for key in args:
    filters.append(str(key))
if len(filters) == 0:
    stream = tweetstream.SampleStream(username, password)  
else:
    stream = tweetstream.FilterStream(username, password, track=filters)
try:
    count = 0
    while count < numtweets:       
        for tweet in stream:
            # a check is needed on text as some "tweets" are actually just API operations
            # the language selection doesn't really work but it's better than nothing(?)
            if tweet.get('text') and tweet['user']['lang'] == 'en':   
                if tweet['retweet_count'] == 0:
                    # bundle up the features I want and send them to the db population function
                    bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                    db_initpop(bundle)
                    break
                else:
                    # a RT has a different structure.  This bundles the original tweet.  Getting  the
                    # retweets comes later, after the stream is de-accessed.
                    bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                              tweet['retweet_count'], tweet['retweeted_status']['text'])
                    db_initpop(bundle)
                    break
        count += 1
except tweetstream.ConnectionError, e:
    print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
    +'.  Reason: ', e.reason

I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.

Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.

Take a look at the streaming API. You can even subscribe to a list of words that you define, and only tweets that match those words are returned.

The streaming API rate limiting works differently: you get 1 connection per IP, and a maximum number of events per second. If more events occur than that, then you only get the maximum anyways, with a notification regarding how many events you missed because of rate limiting.

My understanding is that the streaming API is most suitable for servers that will redistribute the content to your users as needed, instead of being accessed directly by your users - the standing connections are expensive and Twitter starts blacklisting IPs after too many failed connections and re-connections, and possibly your API key afterwards.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!