可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.

I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:

import psycopg2 from socialanalytics import pinterest from socialanalytics import facebook from socialanalytics import twitter from socialanalytics import google_plus from time import strftime, sleep  conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'") cur = conn.cursor()  # Select all URLs cur.execute("SELECT * FROM urls;") urls = cur.fetchall()  for url in urls:      # Pinterest     try:         p = pinterest.getPins(url[2])     except:         p = { 'pin_count': 0 }     # Facebook     try:         f = facebook.getObject(url[2])     except:         f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }     # Twitter     try:         t = twitter.getShares(url[2])     except:         t = { 'share_count': 0 }     # Google     try:         g = google_plus.getPlusOnes(url[2])     except:         g = { 'plus_count': 0 }      # Save results     try:         now = strftime("%Y-%m-%d %H:%M:%S")         cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))         conn.commit()     except:         conn.rollback()

You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.

Any guidance is much appreciated!

回答1:

At first you should measure time that your script spends on every step. May be you discover something interesting :)

Second, you can split your urls on chunks:

chunk_size = len(urls)/cpu_core_count; // don't forget about remainder of division

After these steps you can use multiprocessing for processing every chunk in parallel. Here is example for you:

import multiprocessing as mp  p = mp.Pool(5)  # first solution for urls_chunk in urls: # urls = [(url1...url6),(url7...url12)...]     res = p.map(get_social_stat, urls_chunk)     for record in res:         save_to_db(record)  # or, simple res = p.map(get_social_stat, urls)  for record in res:    save_to_db(record)

Also, gevent can help you. Because it can optimize time spending on processing sequence of synchronous blocking requests.

文章来源: Improve Speed of Python Script: Multithreading or Multiple Instances?

标签

脚本

python