How to sort a text file line-by-line

问题

I need to sort a text file in ascending order. Each line of the text file starts with an index, as seen below:

2       0       4         0d 07:00:38.0400009155273
3       0       4         0d 07:00:38.0400009155273
1       0       4         0d 07:00:38.0400009155273

The idea result would be as follows:

1       0       4         0d 07:00:38.0400009155273
2       0       4         0d 07:00:38.0400009155273
3       0       4         0d 07:00:38.0400009155273

Please note, this text file has +3 million rows and each element is naturally considered a string.

I've been messing around with this for sometime now without any luck so I figured it was time to consult with the experts. Thank you for you time!

EDIT:

I'm using windows OS with Python 3.7 in Spyder IDE. The file is not a CSV its a text file that is tab delimited. There is the possibility that not all indices are present. Forgive the noob-ness, I haven't got a lot of experience coding.

回答1:

fn = 'filename.txt'
sorted_fn = 'sorted_filename.txt'

with open(fn,'r') as first_file:
    rows = first_file.readlines()
    sorted_rows = sorted(rows, key=lambda x: int(x.split()[0]), reverse=False)
    with open(sorted_fn,'w') as second_file:
        for row in sorted_rows:
            second_file.write(row)

This should work for a text file of 3+ million rows. Using int(x.split()[0]) will sort the first item in each row as an integer

Edited to remove close() statements

回答2:

I would go about this by reading the file into lines, splitting them on whitespace and then sorting them according to a custom key; i.e., if your file were called "foo.txt":

with open("foo.txt") as file:
    lines = file.readlines()
    sorted(lines, key=lambda line: int(line.split()[0]))

After that, lines should contain all lines sorted by the first column.

However, I don't know how well this would work, regarding your file size. Maybe you would have to split the file's contents into chunks that you sort one by one and then you can sort the chunks.

回答3:

Use pandas it will help you immensely. Assuming the file is a csv do the following:

import pandas as pd
df = pd.read_csv('to/file', sep='\t', index='Name of column with index')  # Guessing that your file is tab separated
df.sort_index(inplace=True)

Now you have a dataframe with all of the information you need sorted. I'd suggest digging into pandas since it will really help you out. Here is a link to get started https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

回答4:

I would use a simple .split(' ') to format the data into a dictionary that looks like:

my_data = {
 2: ['0', '4', '0d', '07:00:38.0400009155273'],
 3: ['0', '4', '0d', '07:00:38.0400009155273'],
 1: ['0', '4', '0d', '07:00:38.0400009155273']
}

Which you could then iterate through (assuming all keys exist) like:

for i in range(1, max(list(my_data.keys())) + 1):
    pass # do some computation

Additionally you could single out a specific value like my_data[1]

To be able to put your data in this form I would use the script:

with open("foo.txt", "r") as file:
    in_data = file.readlines()

my_data = {}
for data in in_data:
    split_info = data.split(" ")
    useful_data = [item.strip() for item in split_info[1:] if item != ""]
    my_data.update({split_info[0]: useful_data})

for key in sorted(my_data.keys()):
    print("{}: {}".format(key, my_data[key]))

Which prints:

1: ['0', '4', '0d', '07:00:38.0400009155273']

2: ['0', '4', '0d', '07:00:38.0400009155273']

3: ['0', '4', '0d', '07:00:38.0400009155273']

回答5:

Here's an edited version of a perfectly good answer you already have. The edits might be useful as you learn more about coding. The key points:

When writing a program, it's often best to do your coding with a small sample of the input data (for example, a file with 30 rows rather than 3 million): your program will run quicker; debugging output will be smaller and more readable; and some other reasons as well. Thus, rather than hard-coding the path to the input file (or other files), take those file paths as command-line parameters, using sys.argv.
```
import sys

in_path = sys.argv[1]
out_path = sys.argv[2]
```
If you are holding a lot of data in memory (enough to make you think you are close to your machine's limits), don't create unneeded copies of the data. For example, to ignore the first few lines, don't store the original lines in rows and then get the desired values using rows[2:]: that creates a new list. Instead add the conditional logic to your initial creation of rows (the example uses a list comprehension, but you can do the same thing in a regular for loop). And if you need to sort that data, don't use sorted(), which creates a new list; instead, sort the list in place, with rows.sort().
```
with open(in_path, 'r') as fh:
    rows = [line for i, line in enumerate(fh) if i > 1]
    rows.sort(key = lambda x: int(x.split(None, 1)[0]))
```
There's no reason to nest the writing with-block inside the reading with-block. If you don't have a good reason to connect two different tasks within a program, explicitly separate them. This is among of the most important keys to writing better software.
```
with open(out_path, 'w') as fh:
    for r in rows:
        fh.write(r)
```

回答6:

A one-stop solution would be to do reading, sorting and writing all with one file handle. Thanks to 'r+' mode:

with open('your_file.txt', 'r+') as f:
    sorted_contents =  ''.join(sorted(f.readlines(), key = lambda x: int(x.split(' ')[0])))
    f.seek(0)
    f.truncate()
    f.write(sorted_contents)

来源：https://stackoverflow.com/questions/56120633/how-to-sort-a-text-file-line-by-line

标签

python

python-3.x

file

sorting