Unable to use two Threads to execute two functions within a script

问题

I've created a scraper using python in combination with Thread to make the execution faster. The scraper is supposed to parse all the links available within the webpage ended with different alphabets. It does parse them all.

However, I wish to parse all the names and phone numbers from those individual links using Thread again. The first portion I could manage to run using Thread but I can't get any idea how to create another Thread to execute the latter portion of the script?

I could have wrapped them within a single Thread but my intention is to know how to use two Threads to execute two functions.

For the first portion: I tried like below and it worked

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"

def alphabetical_links(mainurl):
    response = requests.get(link).text
    tree = html.fromstring(response)
    return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]

    for thread in linklist:
        thread.join()

My question is: How can I use sub_links() function within another Thread

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"

def alphabetical_links(mainurl):
    response = requests.get(link).text
    tree = html.fromstring(response)
    return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]

def sub_links(process_links):
    response = requests.get(process_links).text
    root = html.fromstring(response)

    for container in root.cssselect(".proListing"):
        try:
            name = container.cssselect("h2 a")[0].text
        except Exception: name = ""
        try:
            phone = container.cssselect(".proListingPhone")[0].text
        except Exception: phone = ""
        print(name, phone)

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]

    for thread in linklist:
        thread.join()

回答1:

Try to update alphabetical_links with its own Threads:

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"


def alphabetical_links(mainurl):
    response = requests.get(mainurl).text
    tree = html.fromstring(response)
    links_on_page = [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]
    threads = []
    for link in links_on_page:
        thread = threading.Thread(target=sub_links, args=(link,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()


def sub_links(process_links):
    response = requests.get(process_links).text
    root = html.fromstring(response)

    for container in root.cssselect(".proListing"):
        try:
            name = container.cssselect("h2 a")[0].text
        except Exception: name = ""
        try:
            phone = container.cssselect(".proListingPhone")[0].text
        except Exception: phone = ""
        print(name, phone)

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]


    for thread in linklist:
        thread.join()

Note that this is just an example of how to manage "inner Threads". Because of numerous threads that are starting at the same time your system might fail to start some of them due to lack of resources and you will get RuntimeError: can't start new thread exception. In this case you should try to implement ThreadPool

回答2:

You can start more threads the same way you started the first one

from threading import Thread

t1 = Thread(target=alphabetical_links, kwargs={
    'mainurl':     link,
})
t1.start()

t2 = Thread(target=sub_links, kwargs={
    'process_links':     link,
})
t2.start()

来源：https://stackoverflow.com/questions/52245035/unable-to-use-two-threads-to-execute-two-functions-within-a-script

标签

python

python-3.x

web-scraping

python-multithreading