Python socket.gethostbyname_ex() multithread fails

爷,独闯天下 提交于 2019-12-02 13:51:48

问题


I programmed a script that should resolve multiple hostnames into ip addresses using Multithreading.

However, it fails and freezes at some random point. How can this be solved?

num_threads = 100
conn = pymysql.connect(host='xx.xx.xx.xx', unix_socket='/tmp/mysql.sock', user='user', passwd='pw', db='database')
cur = conn.cursor()
def mexec(befehl):
    cur = conn.cursor()
    cur.execute(befehl)

websites = ['facebook.com','facebook.org' ... ... ... ...] \#10.000 websites in array
queue = Queue()
def getips(i, q):
    while True:
        #--resolve IP--
        try:
            result = socket.gethostbyname_ex(site)
            print(result)
            mexec("UPDATE sites2block SET ip='"+result+"', updated='yes' ") #puts site in mysqldb
        except (socket.gaierror):
            print("no ip")
            mexec("UPDATE sites2block SET ip='no ip', updated='yes',")
        q.task_done()
#Spawn thread pool
for i in range(num_threads):
    worker = Thread(target=getips, args=(i, queue))
    worker.setDaemon(True)
    worker.start()
#Place work in queue
for site in websites:
    queue.put(site)
#Wait until worker threads are done to exit
queue.join()

回答1:


You could use a sentinel value to signal threads that there is no work and join the threads instead of queue.task_done() and queue.join():

#!/usr/bin/env python
import socket
from Queue import Queue
from threading import Thread

def getips(queue):
    for site in iter(queue.get, None):
        try: # resolve hostname
            result = socket.gethostbyname_ex(site)
        except IOError, e:
            print("error %s reason: %s" % (site, e))
        else:
            print("done %s %s" % (site, result))

def main():
    websites = "youtube google non-existent.example facebook yahoo live".split()
    websites = [name+'.com' for name in websites]

    # Spawn thread pool
    queue = Queue()
    threads = [Thread(target=getips, args=(queue,)) for _ in range(20)]
    for t in threads:
        t.daemon = True
        t.start()

    # Place work in queue
    for site in websites: queue.put(site)
    # Put sentinel to signal the end
    for _ in threads: queue.put(None)
    # Wait for completion
    for t in threads: t.join()

main()

gethostbyname_ex() function is obsolete. To support both IPv4/v6 addresses you could use socket.getaddrinfo() instead.




回答2:


My first idea was that you get errors due to overload on the DNS - maybe your resolver just doesn't allow you to do more than a certain amount of queries per time.


Besides, I spotted some issues:

  1. You forgot to assign site correctly in the while loop - which would probably better be replaced by a for loop iterating over the queue, or something. In your version, you use the site variable from the module level namespace, which can lead to queries made double and others skipped.

    In this place, you have control over if the queue still has entries or awaits some. If both not, you can quit your thread.

  2. For security reasons, you would better do

    def mexec(befehl, args=None):
        cur = conn.cursor()
        cur.execute(befehl, args)
    

    in order to do afterwards

    mexec("UPDATE sites2block SET ip=%s, updated='yes'", result) #puts site in mysqldb
    

In order to stay compatible with future protocols, you should use socket.getaddrinfo() instead of socket.gethostbyname_ex(site). There you get all IPs you want (at first, you can limit to IPv4, but switching to IPv6 is easier then) and can maybe put them all into the DB.


For your queue, code samples could be

def queue_iterator(q):
    """Iterate over the contents of a queue. Waits for new elements as long as the queue is still filling."""
    while True:
        try:
            item = q.get(block=q.is_filling, timeout=.1)
            yield item
            q.task_done() # indicate that task is done.
        except Empty:
            # If q is still filling, continue.
            # If q is empty and not filling any longer, return.
            if not q.is_filling: return

def getips(i, q):
    for site in queue_iterator(q):
        #--resolve IP--
        try:
            result = socket.gethostbyname_ex(site)
            print(result)
            mexec("UPDATE sites2block SET ip=%s, updated='yes'", result) #puts site in mysqldb
        except (socket.gaierror):
            print("no ip")
            mexec("UPDATE sites2block SET ip='no ip', updated='yes',")
# Indicate it is filling.
q.is_filling = True
#Spawn thread pool
for i in range(num_threads):
    worker = Thread(target=getips, args=(i, queue))
    worker.setDaemon(True)
    worker.start()
#Place work in queue
for site in websites:
    queue.put(site)
queue.is_filling = False # we are done filling, if q becomes empty, we are done.
#Wait until worker threads are done to exit
queue.join()

should do the trick.


Another issue is your parallel inserting into MySQL. You are only allowed to do one MySQL query at a time. So you could either protect the access via threading.Lock() or RLock(), or you could put the answers into another queue which is processed by another thread, which could even bundle them.




回答3:


You might find it simpler to use concurrent.futures than threading, multiprocessing, Queue directly:

#!/usr/bin/env python3
import socket
# pip install futures on Python 2.x
from concurrent.futures import ThreadPoolExecutor as Executor

hosts = "youtube.com google.com facebook.com yahoo.com live.com".split()*100
with Executor(max_workers=20) as pool:
     for results in pool.map(socket.gethostbyname_ex, hosts, timeout=60):
         print(results)

Note: you could easily switch from using threads to processes:

from concurrent.futures import ProcessPoolExecutor as Executor

You need it if gethostbyname_ex() is not thread-safe on your OS e.g., it might be the case on OSX.

If you'd like to process exceptions that might arise in gethostbyname_ex():

import concurrent.futures

with Executor(max_workers=20) as pool:
    future2host = dict((pool.submit(socket.gethostbyname_ex, h), h)
                       for h in hosts)
    for f in concurrent.futures.as_completed(future2host, timeout=60):
        e = f.exception()
        print(f.result() if e is None else "{0}: {1}".format(future2host[f], e))

It similar to the example from the docs.



来源:https://stackoverflow.com/questions/9194472/python-socket-gethostbyname-ex-multithread-fails

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!