csv file compression without using existing libraries in Python

问题

I'm trying to compress a .csv file without using any 3rd party or framework provided compression libraries.

I have tried, what I wish to think, everything. I looked at Huffman, but since I'm not allowed to use that solution I tried to do my own.

An example:

6NH8,F,A,0,60541567,60541567,78.78,20
6NH8,F,A,0,60541569,60541569,78.78,25
6AH8,F,B,0,60541765,60541765,90.52,1
QMH8,F,B,0,60437395,60437395,950.5,1

I made an algorithm that counts every char and gives me amount of times they've been used and, depending on how many time they been dedicated a number.

',' --- 28  
'5' --- 18
'6' --- 17 
'0' --- 15
'7' --- 10  
'8' --- 8 
'4' --- 8  
'1' --- 8
'9' --- 6  
'.' --- 4
'3' --- 4 
'\n'--- 4 
'H' --- 4    
'F' --- 4
'2' --- 3 
'A' --- 3
'N' --- 2 
'B' --- 2 
'M' --- 1 
'Q' --- 1 

[(',', 0), ('5', 1), ('6', 2), ('0', 3), ('7', 4), ('8', 5), 
 ('4', 6), ('1', 7), ('9', 8), ('.', 9), ('3', 10), ('\n', 11), 
 ('H', 12), ('F', 13), ('2', 14), ('A', 15), ('N', 16), ('B', 17), 
 ('M', 18), ('Q', 19)]

So instead of storing for example ord('H') = 72, I give H the value 12, and so on.

But, when I change all the chars to my values, my generated cvs(>40MB) is still larger than original(19MB).

I even tried the alternatives to divide the list into 2. i.e. for one row make it two rows.

[6NH8,F,A,0,]
[60541567,60541567,78.78,20]

But still larger, even larger than my "huffman" version.

QUESTION: Anybody have any suggestions on how to 1.Read a .csv file, 2.use something thats a lib. or 3rd party. 3.generate and write a smaller .csv file?

For step 2 Im not asking for a full computed solution, just suggestions of how to minimize the file, by i.e. write each value as one list ? etc.

Thank you

回答1:

It is unclear whether you need to create a generic compression algorithm or a custom one that works reasonably well for this kind of data.

It is also unclear whether the output should be another CSV, a string made of printable ASCII characters or plain binary data.

I'm going to assume that we're talking about a custom algorithm and a CSV output. (The same principles would apply to another output format anyway.)

It appears that your input is well formatted and always repeat the same kind of fields:

0 '6NH8'     : 4-character code
1 'F'        : character
2 'A'        : character
3 '0'        : integer
4 '60541567' : integer \_ some kind of
5 '60541567' : integer /  timestamps?
6 '78.78'    : float
7 '20'       : integer

Building dictionaries

See how many distinct codes are used in column #0 and how many distinct combinations of 'column #1' + 'column #2' you have.

If the same values are used frequently, then it's definitely worth building dictionaries that will be stored only once and then referenced in the compressed rows.

For instance:

column0_dictionary = [ '6NH8', '6AH8', 'QMH8' ]
column12_dictionary = [ 'FA', 'FB' ];

So, 6NH8 would be referenced as 0, 6AH8 as 1, etc.

In the same way, F,A would be referenced as 0 and F,B as 1.

Encoding timestamps in a shorter format

Assuming that columns #4 and #5 are indeed timestamps, a quick win would be to store the minimum value and subtract it from the actual value in each compressed row.

minimum_timestamp = 60437395

Therefore, 60541569 becomes 60541569 - 60437395 = 104174.

Example output

Here is what we get when applying these two simple methods to your example input:

# header
6NH8,6AH8,QMH8
FA,FB
60437395
# payload data
0,0,0,104172,104172,78.78,20
0,0,0,104174,104174,78.78,25
1,1,0,104370,104370,90.52,1
2,1,0,0,0,950.5,1

You could also store in column #5 the difference between column #5 and column #4, if it turns out that they correspond to the 'start of something' and 'end of something'.

As is, the size of the compressed payload is about 70% of the size of the original input. (Keep in mind that the size of the header should become negligible when you have much more rows.)

Your example is too short to detect any other obvious patterns for the remaining fields, but hopefully these examples will give you some ideas.

UPDATE

It turns out that the timestamps are expressed in number of milliseconds elapsed since midnight. So they are probably evenly distributed in 0-86399999 and it's not possible to subtract a minimum.

These numbers can however be encoded in a more compact manner than the ASCII representation of their decimal value.

The easiest way is to convert them to hexadecimal:

60541567 = 39BCA7F

A slightly more complicated way is to encode them in Base64:

Convert timestamp to its 4-byte representation (all values from 0 to 86399999 will fit in 4 bytes):
Build a string made of the 4 corresponding characters and encode it in Base64.

For example:

60541567 = 03 9B CA 7F  # in hexadecimal and big-endian order

BASE64(CHR(0x03) + CHR(0x9B) + CHR(0xCA) + CHR(0x7F)) = A5vKfw
# here without the padding characters

回答2:

Try running your algorithm on the contents of each cell instead of individual characters and then creating a new CSV file with the compressed cell values.

If the data you have provided is an example of the larger file you may want to run the compression algorithm on each column separately. For example it may only help to compress columns 0,4 and 5.

For reading and writing CSV files check out the csv module where you can do things like:

import csv
with open('eggs.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in spamreader:
        print ', '.join(row)

回答3:

For each line, search for matching substrings in the previous line or lines. For each matching substring (e.g. 6NH8,F,A,0,6054156 or ,78.78,2), send the length of the match and distance back to copy from instead. This is called LZ77 compression.

来源：https://stackoverflow.com/questions/38647250/csv-file-compression-without-using-existing-libraries-in-python

标签

python

csv

compression