Unable to use two Threads to execute two functions within a script

女生的网名这么多〃 提交于 2020-02-07 03:40:08

问题


I've created a scraper using python in combination with Thread to make the execution faster. The scraper is supposed to parse all the links available within the webpage ended with different alphabets. It does parse them all.

However, I wish to parse all the names and phone numbers from those individual links using Thread again. The first portion I could manage to run using Thread but I can't get any idea how to create another Thread to execute the latter portion of the script?

I could have wrapped them within a single Thread but my intention is to know how to use two Threads to execute two functions.

For the first portion: I tried like below and it worked

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"

def alphabetical_links(mainurl):
    response = requests.get(link).text
    tree = html.fromstring(response)
    return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]

    for thread in linklist:
        thread.join()

My question is: How can I use sub_links() function within another Thread

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"

def alphabetical_links(mainurl):
    response = requests.get(link).text
    tree = html.fromstring(response)
    return [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]

def sub_links(process_links):
    response = requests.get(process_links).text
    root = html.fromstring(response)

    for container in root.cssselect(".proListing"):
        try:
            name = container.cssselect("h2 a")[0].text
        except Exception: name = ""
        try:
            phone = container.cssselect(".proListingPhone")[0].text
        except Exception: phone = ""
        print(name, phone)

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]

    for thread in linklist:
        thread.join()

回答1:


Try to update alphabetical_links with its own Threads:

import requests
import threading
from lxml import html

main_url = "https://www.houzz.com/proListings/letter/{}"


def alphabetical_links(mainurl):
    response = requests.get(mainurl).text
    tree = html.fromstring(response)
    links_on_page = [container.attrib['href'] for container in tree.cssselect(".proSitemapLink a")]
    threads = []
    for link in links_on_page:
        thread = threading.Thread(target=sub_links, args=(link,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()


def sub_links(process_links):
    response = requests.get(process_links).text
    root = html.fromstring(response)

    for container in root.cssselect(".proListing"):
        try:
            name = container.cssselect("h2 a")[0].text
        except Exception: name = ""
        try:
            phone = container.cssselect(".proListingPhone")[0].text
        except Exception: phone = ""
        print(name, phone)

if __name__ == '__main__':
    linklist = []
    for link in [main_url.format(chr(page)) for page in range(97,123)]:
        thread = threading.Thread(target=alphabetical_links, args=(link,))
        thread.start()
        linklist+=[thread]


    for thread in linklist:
        thread.join()

Note that this is just an example of how to manage "inner Threads". Because of numerous threads that are starting at the same time your system might fail to start some of them due to lack of resources and you will get RuntimeError: can't start new thread exception. In this case you should try to implement ThreadPool




回答2:


You can start more threads the same way you started the first one

from threading import Thread

t1 = Thread(target=alphabetical_links, kwargs={
    'mainurl':     link,
})
t1.start()

t2 = Thread(target=sub_links, kwargs={
    'process_links':     link,
})
t2.start()


来源:https://stackoverflow.com/questions/52245035/unable-to-use-two-threads-to-execute-two-functions-within-a-script

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!