问题
I've created a scraper using python in combination with Thread
to make the execution faster. The scraper is supposed to parse all the links available within the webpage ended with different alphabets. It does parse them all.
However, I wish to parse all the names
and phone
numbers from those individual links using Thread
again. The first portion I could manage to run using Thread
but I can't get any idea how to create another Thread
to execute the latter portion of the script?
I could have wrapped them within a single Thread
but my intention is to know how to use two Threads
to execute two functions.
For the first portion: I tried like below and it worked
import requests
import threading
from lxml import html
main_url = "https://www.houzz.com/proListings/letter/{}"
def alphabetical_links(mainurl):
response = requests.get(link).text
tree = html.fromstring(response)
return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]
if __name__ == '__main__':
linklist = []
for link in [main_url.format(chr(page)) for page in range(97,123)]:
thread = threading.Thread(target=alphabetical_links, args=(link,))
thread.start()
linklist+=[thread]
for thread in linklist:
thread.join()
My question is: How can I use sub_links()
function within another Thread
import requests
import threading
from lxml import html
main_url = "https://www.houzz.com/proListings/letter/{}"
def alphabetical_links(mainurl):
response = requests.get(link).text
tree = html.fromstring(response)
return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]
def sub_links(process_links):
response = requests.get(process_links).text
root = html.fromstring(response)
for container in root.cssselect(".proListing"):
try:
name = container.cssselect("h2 a")[0].text
except Exception: name = ""
try:
phone = container.cssselect(".proListingPhone")[0].text
except Exception: phone = ""
print(name, phone)
if __name__ == '__main__':
linklist = []
for link in [main_url.format(chr(page)) for page in range(97,123)]:
thread = threading.Thread(target=alphabetical_links, args=(link,))
thread.start()
linklist+=[thread]
for thread in linklist:
thread.join()
回答1:
Try to update alphabetical_links
with its own Threads:
import requests
import threading
from lxml import html
main_url = "https://www.houzz.com/proListings/letter/{}"
def alphabetical_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
links_on_page = [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]
threads = []
for link in links_on_page:
thread = threading.Thread(target=sub_links, args=(link,))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
def sub_links(process_links):
response = requests.get(process_links).text
root = html.fromstring(response)
for container in root.cssselect(".proListing"):
try:
name = container.cssselect("h2 a")[0].text
except Exception: name = ""
try:
phone = container.cssselect(".proListingPhone")[0].text
except Exception: phone = ""
print(name, phone)
if __name__ == '__main__':
linklist = []
for link in [main_url.format(chr(page)) for page in range(97,123)]:
thread = threading.Thread(target=alphabetical_links, args=(link,))
thread.start()
linklist+=[thread]
for thread in linklist:
thread.join()
Note that this is just an example of how to manage "inner Threads". Because of numerous threads that are starting at the same time your system might fail to start some of them due to lack of resources and you will get RuntimeError: can't start new thread
exception. In this case you should try to implement ThreadPool
回答2:
You can start more threads the same way you started the first one
from threading import Thread
t1 = Thread(target=alphabetical_links, kwargs={
'mainurl': link,
})
t1.start()
t2 = Thread(target=sub_links, kwargs={
'process_links': link,
})
t2.start()
来源:https://stackoverflow.com/questions/52245035/unable-to-use-two-threads-to-execute-two-functions-within-a-script