How to merge continuous lines of a csv file

北战南征 提交于 2019-12-23 04:19:13

问题


I have a csv file that carries outputs of some processes over video frames. In the file, each line is either fire or none. Each line has startTime and endTime. Now I need to cluster and print only one instance out of continuous fires with their start and end time. The point is that a few none in the middle can also be tolerated if their time is within 1 second. So to be clear, the whole point is to cluster detections of closer frames together...somehow smooth out the results. Instead of multiple 31-32, 32-33, ..., have a single line with 31-35 seconds.

How to do that?

For instance, the whole following continuous items are considered a single one since the none gaps is within 1s. So we would have something like 1,file1,name1,30.6,32.2,fire,0.83 with that score being the mean of all fire lines.

frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
...
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344

This is my attempts so far:

with open(filename) as fin:
    lastWasFire=False
    for line in fin:
        if "fire" in line:
             if lastWasFire==False and line !="" and line.split(",")[5] != lastline.split(",")[5]:
                  fout.write(line)
             else:
                lastWasFire=False
             lastline=line

回答1:


I assume you don't want to use external libraries for data processing like numpy or pandas. The following code should be quite similar to your attempt:

threshold = 1.0

# We will chain a "none" object at the end which triggers the threshold to make sure no "fire" objects are left unprinted
from itertools import chain
trigger = (",,,0,{},,none,".format(threshold + 1),)

# Keys for columns of input data
keys = (
    "frame_num",
    "uniqueId",
    "title",
    "startTime",
    "endTime",
    "startTime_fmt",
    "object",
    "score",
)

# Store last "fire" or "none" objects
last = {
    "fire": [],
    "none": [],
}

with open(filename) as f:
    # Skip first line of input file
    next(f)
    for line in chain(f, trigger):
        line = dict(zip(keys, line.split(",")))
        last[line["object"]].append(line)
        # Check threshold for "none" objects if there are previous unprinted "fire" objects
        if line["object"] == "none" and last["fire"]:
            if float(last["none"][-1]["endTime"]) - float(last["none"][0]["startTime"]) > threshold:
                print("{},{},{},{},{},{},{},{}".format(
                    last["fire"][0]["frame_num"],
                    last["fire"][0]["uniqueId"],
                    last["fire"][0]["title"],
                    last["fire"][0]["startTime"],
                    last["fire"][-1]["endTime"],
                    last["fire"][0]["startTime_fmt"],
                    last["fire"][0]["object"],
                    sum([float(x["score"]) for x in last["fire"]]) / len(last["fire"]),
                ))
                last["fire"] = []
        # Previous "none" objects don't matter anymore as soon as a "fire" object is being encountered
        if line["object"] == "fire":
            last["none"] = []

The input file is being processed line by line and "fire" objects are being accumulated in last["fire"]. They will be merged and printed if either

  • the "none" objects in last["none"] reach the threshold defined in threshold

  • or when the end of the input file is reached due to the manually chained trigger object, which is a "none" object of length threshold + 1, therefore triggering the threshold and subsequent merge and print.

You could replace print with a call to write into an output file, of course.




回答2:


This is close to what you are looking for and may be an acceptable alternative.

If your sample rate is quite stable (looks to be about 0.12s or 50 Hz) then you can find the equivalent number of samples you can tolerate to be 'none'. Let's say that's 8.

This code will read in the data and fill the 'none' values with up to 8 of the last valid value.

import numpy as np
import pandas as pd

def groups_of_true_values(x):
    """Returns array of integers where each True value in x
    is replaced by the count of the group of consecutive
    True values that it was found in.
    """
    return (np.diff(np.concatenate(([0], np.array(x, dtype=int)))) == 1).cumsum()*x 

df = pd.read_csv('test.csv', index_col=0)
# Forward-fill the 'none' values to a limit
df['filled'] = df['object'].replace('none', None).fillna(method='ffill', limit=8)

# Find the groups of consecutive fire values
df['group'] = groups_of_true_values(df['filled'] == 'fire')

# Produce sum of scores by group
group_scores = df[['group', 'score']].groupby('group').sum()  
print(group_scores)

# Find firing start and stop times
df['start'] = ((df['filled'] == 'fire') & (df['filled'].shift(1) == 'none'))
df['stop'] = ((df['filled'] == 'none') & (df['filled'].shift(1) == 'fire'))
start_times = df.loc[df['start'], 'startTime'].to_list()  
stop_times = df.loc[df['stop'], 'startTime'].to_list()
print(start_times, stop_times)

Output:

           score
group           
1      10.347362
[] []

Hopefully, the output would be more interesting if there were longer sequences of no firing...




回答3:


My approach, using pandas and groupby:

  1. Combine continuous lines of the same object (fire or none) into a spell
  2. Drop none-fire spells with duration less than 1 second
  3. Combine continuous series of spells of the same object (fire or none) into a superspell, and calculate the corresponding score

I assume the data is sorted by time (otherwise we need to add a sort after reading the data). The trick to combining continuous lines of the same object into spells/superspells is: first, identify where the new spell/superspell starts (i.e. when the object type changes), and second, assign a unique id to each spell (= the number of new spell before it)

import pandas as pd

# preparing the test data
data = '''frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344'''

with open("a.txt", 'w') as f:
    print(data, file=f)
df1 = pd.read_csv("a.txt")



# mark new spell (the start of a series of continuous lines of the same object)
# new spell if the current object is different from the previous object
df1['newspell'] = df1.object != df1.object.shift(1)

# give each spell a unique spell number (equal to the total number of new spell before it)
df1['spellnum'] = df1.newspell.cumsum()

# group lines from the same spell together
spells = df1.groupby(by=["uniqueId", "title", "spellnum", "object"]).agg(
        first_frame = ('frame_num', 'min'),
        last_frame = ('frame_num', 'max'),
        startTime = ('startTime', 'min'),
        endTime = ('endTime', 'max'),
        totalScore = ('score', 'sum'),
        cnt = ('score', 'count')).reset_index()

# remove none-fire spells with duration less than 1
spells = spells[(spells.object == 'fire') | (spells.endTime > spells.startTime + 1)]


# Now group conitnous fire spells into superspells
# mark new superspell
spells['newsuperspell'] = spells.object != spells.object.shift(1)

# give each superspell a unique number
spells['superspellnum'] = spells.newsuperspell.cumsum()

superspells = spells.groupby(by=["uniqueId", "title", "superspellnum", "object"]).agg(
        first_frame = ('first_frame', 'min'),
        last_frame = ('last_frame', 'max'),
        startTime = ('startTime', 'min'),
        endTime = ('endTime', 'max'),
        totalScore = ('totalScore', 'sum'),
        cnt = ('cnt', 'sum')).reset_index()

superspells['score'] = superspells.totalScore/superspells.cnt
superspells.drop(columns=['totalScore', 'cnt'], inplace=True)

print(superspells.to_csv(index=False))

# output
#uniqueId,title,superspellnum,object,first_frame,last_frame,startTime,endTime,score
#file1,name1,1,fire,10,23,30.6,32.2,0.8304779999999999


来源:https://stackoverflow.com/questions/59402149/how-to-merge-continuous-lines-of-a-csv-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!