问题
The class BrokenLinkTest in the code below does the following.
- takes a web page url
- finds all the links in the web page
- get the headers of the links concurrently (this is done to check if the link is broken or not)
- print 'completed' when all the headers are received.
from bs4 import BeautifulSoup
import requests
class BrokenLinkTest(object):
def __init__(self, url):
self.url = url
self.thread_count = 0
self.lock = threading.Lock()
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
self.lock.acquire()
for link in soup.find_all('a'):
url = link.get('href')
threading.Thread(target=self._check_url(url))
self.lock.acquire()
def _on_complete(self):
self.thread_count -= 1
if self.thread_count == 0: #check if all the threads are completed
self.lock.release()
print "completed"
def _check_url(self, url):
self.thread_count += 1
print url
result = requests.head(url)
print result
self._on_complete()
BrokenLinkTest("http://www.example.com").execute()
Can the concurrency/synchronization part be done in a better way. I did it using threading.Lock. This is my first experiment with python threading.
回答1:
def execute(self):
soup = BeautifulSoup(requests.get(self.url).text)
threads = []
for link in soup.find_all('a'):
url = link.get('href')
t = threading.Thread(target=self._check_url, args=(url,))
t.start()
threads.append(t)
for thread in threads:
thread.join()
You could use the join
method to wait for all the threads to finish.
Note I also added a start call, and passed the bound method object to the target param. In your original example you were calling _check_url
in the main thread and passing the return value to the target param.
回答2:
All threads in Python run on the same core, so you won't be gaining any performance by doing it this way. Also - it's very unclear what is actually happening?
- You are never actually starting a threads, you are just initializing it
- The threads themselves do absolutely nothing other than decrementing the thread count
You may only gain performance in a thread-based scenario if your program is delivering work to the IO (sending requests, writing to file and so on), where other threads can work in the meanwhile.
来源:https://stackoverflow.com/questions/26234301/synchronise-muti-threads-in-python