Submit data via web form and extract the results

前端 未结 3 1448
离开以前
离开以前 2020-12-08 03:05

My python level is Novice. I have never written a web scraper or crawler. I have written a python code to connect to an api and extract the data that I want. But for some th

3条回答
  •  独厮守ぢ
    2020-12-08 03:57

    You can use mechanize to submit and retrieve content, and the re module for getting what you want. For example, the script below does it for the text of your own question:

    import re
    from mechanize import Browser
    
    text = """
    My python level is Novice. I have never written a web scraper 
    or crawler. I have written a python code to connect to an api and 
    extract the data that I want. But for some the extracted data I want to 
    get the gender of the author. I found this web site 
    http://bookblog.net/gender/genie.php but downside is there isn't an api 
    available. I was wondering how to write a python to submit data to the 
    form in the page and extract the return data. It would be a great help 
    if I could get some guidance on this."""
    
    browser = Browser()
    browser.open("http://bookblog.net/gender/genie.php")
    
    browser.select_form(nr=0)
    browser['text'] = text
    browser['genre'] = ['nonfiction']
    
    response = browser.submit()
    
    content = response.read()
    
    result = re.findall(
        r'The Gender Genie thinks the author of this passage is: (\w*)!', content)
    
    print result[0]
    

    What does it do? It creates a mechanize.Browser and goes to the given URL:

    browser = Browser()
    browser.open("http://bookblog.net/gender/genie.php")
    

    Then it selects the form (since there is only one form to be filled, it will be the first):

    browser.select_form(nr=0)
    

    Also, it sets the entries of the form...

    browser['text'] = text
    browser['genre'] = ['nonfiction']
    

    ... and submit it:

    response = browser.submit()
    

    Now, we get the result:

    content = response.read()
    

    We know that the result is in the form:

    The Gender Genie thinks the author of this passage is: male!
    

    So we create a regex for matching and use re.findall():

    result = re.findall(
        r'The Gender Genie thinks the author of this passage is: (\w*)!',
        content)
    

    Now the result is available for your use:

    print result[0]
    

提交回复
热议问题