Python: Only writes last line of output

问题

Trying to write a program that extracts URLs from a website. The output is good, but when I try to write the output to a file, only the last record is written. Here is the code:

import re
import urllib.request

# Retrieves URLs from the HTML source code of a website
def extractUrls(url, unique=True, sort=True, restrictToTld=None):
    # Prepend "www." if not present
    if url[0:4] != "www.":
        url = "".join(["www.",url])
    # Open a connection
    with urllib.request.urlopen("http://" + url) as h:
        # Grab the headers
        headers = h.info()
        # Default charset
        charset = "ISO-8859-1"
        # If a charset is in the headers then override the default
        for i in headers:
            match = re.search(r"charset=([\w\-]+)", headers[i], re.I)
            if match != None:
                charset = match.group(1).lower()
                break
        # Grab and decode the source code
        source = h.read().decode(charset)
        # Find all URLs in the source code
        matches = re.findall(r"http\:\/\/(www.)?([a-z0-9\-\.]+\.[a-z]{2,6})\b", source, re.I)
        # Abort if no URLs were found
        if matches == None:
            return None
        # Collect URLs
        collection = []
        # Go over URLs one by one
        for url in matches:
            url = url[1].lower()
            # If there are more than one dot then the URL contains
            # subdomain(s), which we remove
            if url.count(".") > 1:
                temp = url.split(".")
                tld = temp.pop()
                url = "".join([temp.pop(),".",tld])
            # Restrict to TLD if one is set
            if restrictToTld:
                tld = url.split(".").pop()
                if tld != restrictToTld:
                    continue
            # If only unique URLs should be returned
            if unique:
                if url not in collection:
                    collection.append(url)
            # Otherwise just add the URL to the collection
            else:
                collection.append(url)
        # Done
        return sorted(collection) if sort else collection

# Test
url = "msn.com"
print("Parent:", url)
for x in extractUrls(url):
    print("-", x)

f = open("f2.txt", "w+", 1)
f.write( x ) 
f.close()

The output is:

Parent: msn.com
- 2o7.net
- atdmt.com
- bing.com
- careerbuilder.com
- delish.com
- discoverbing.com
- discovermsn.com
- facebook.com
- foxsports.com
- foxsportsarizona.com
- foxsportssouthwest.com
- icra.org
- live.com
- microsoft.com
- msads.net
- msn.com
- msnrewards.com
- myhomemsn.com
- nbcnews.com
- northjersey.com
- outlook.com
- revsci.net
- rsac.org
- s-msn.com
- scorecardresearch.com
- skype.com
- twitter.com
- w3.org
- yardbarker.com
[Finished in 0.8s]

Only "yardbarker.com" is written to the file. I appreciate the help, thank you.

回答1:

url = "msn.com"
print("Parent:", url)
f = open("f2.txt", "w",)
for x in extractUrls(url):
    print("-", x)
    f.write( x )
f.close()

回答2:

As per other answers the file write needs to be inside the loop but also try writing a new line character \n after x:

f = open("f2.txt", "w+")
for x in extractUrls(url):
    print("-", x)
    f.write( x +'\n' ) 
f.close()

Also the line return sorted(collection) if sort else collection has two indents where it should have one.

Also your subdomain code might not give what you expect for things like www.something.com.au which will only return .com.au

回答3:

f = open("f2.txt", "w+", 1)

for x in extractUrls(url):
    print("-", x)
    f.write( x )

f.close()

回答4:

You need to open you file then Write each X in the for loop.

At the end you can close the file.

f = open("f2.txt", "w+",1)

for x in extractUrls(url):
    print("-", x)
    f.write( x ) 

f.close()

来源：https://stackoverflow.com/questions/19419751/python-only-writes-last-line-of-output

标签

python

file

url

file-io

output