Increasing throughput in a python script

后端 未结 4 678
傲寒
傲寒 2021-01-22 02:23

I\'m processing a list of thousands of domain names from a DNSBL through dig, creating a CSV of URLs and IPs. This is a very time-consuming process that can take several hours.

4条回答
  •  半阙折子戏
    2021-01-22 02:52

    Well, it's probably the name resolution that's taking you so long. If you count that out (i.e., if somehow dig returned very quickly), Python should be able to deal with thousands of entries easily.

    That said, you should try a threaded approach. That would (theoretically) resolve several addresses at the same time, instead of sequentially. You could just as well continue to use dig for that, and it should be trivial to modify my example code below for that, but, to make things interesting (and hopefully more pythonic), let's use an existing module for that: dnspython

    So, install it with:

    sudo pip install -f http://www.dnspython.org/kits/1.8.0/ dnspython
    

    And then try something like the following:

    import threading
    from dns import resolver
    
    class Resolver(threading.Thread):
        def __init__(self, address, result_dict):
            threading.Thread.__init__(self)
            self.address = address
            self.result_dict = result_dict
    
        def run(self):
            try:
                result = resolver.query(self.address)[0].to_text()
                self.result_dict[self.address] = result
            except resolver.NXDOMAIN:
                pass
    
    
    def main():
        infile = open("domainlist", "r")
        intext = infile.readlines()
        threads = []
        results = {}
        for address in [address.strip() for address in intext if address.strip()]:
            resolver_thread = Resolver(address, results)
            threads.append(resolver_thread)
            resolver_thread.start()
    
        for thread in threads:
            thread.join()
    
        outfile = open('final.csv', 'w')
        outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
        outfile.close()
    
    if __name__ == '__main__':
        main()
    

    If that proves to start too many threads at the same time, you could try doing it in batches, or using a queue (see http://www.ibm.com/developerworks/aix/library/au-threadingpython/ for an example)

提交回复
热议问题