问题
I've got a corrupt data frame with random header duplicates inside the data frame. How to ignore or delete these rows while loading the data frame?
Since this random header is in the data frame, pandas raise an error while loading. I would like to ignore this row while loading it with pandas. Or delete it somehow, before loading it with pandas.
The file looks like this:
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3 <- this is the random copy of the header inside the dataframe
0, 1, 1
0, 0, 0
1, 1, 1
I want:
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
0, 1, 1
0, 0, 0
1, 1, 1
回答1:
Throw in na_filter = False
to typecast your columns into strings. Then locate all rows with bad data then filter them out your dataframe.
>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)
>>> df
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
3 col1 col2 col3
4 0 1 1
5 0 0 0
6 1 1 1
>>> type(df.iloc[0,0])
<class 'str'>
Now that you parsed your data in each column as strings, locate all col1, col2, and col3
values in your df, create a new column if you find them each column using np.where()
as such:
>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])
>>> df
col1 col2 col3 Tag
0 0 1 1 Don't remove
1 0 0 0 Don't remove
2 1 1 1 Don't remove
3 col1 col2 col3 Remove
4 0 1 1 Don't remove
5 0 0 0 Don't remove
6 1 1 1 Don't remove
Now, filter out the one tagged as Removed
in the Tag
column using isin()
.
>>> df2 = df[~df['Tag'].isin(['Remove'])]
>>> df2
col1 col2 col3 Tag
0 0 1 1 Don't remove
1 0 0 0 Don't remove
2 1 1 1 Don't remove
4 0 1 1 Don't remove
5 0 0 0 Don't remove
6 1 1 1 Don't remove
Drop the Tag
column:
>>> df2 = df2[['col1', 'col2', 'col3']]
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
4 0 1 1
5 0 0 0
6 1 1 1
Finally typecast your dataframe into int, if you need it to be an integer:
>>> df2 = df2.astype(int)
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
4 0 1 1
5 0 0 0
6 1 1 1
>>> type(df2['col1'][0])
<class 'numpy.int32'>
Note: If you want standard index use:
>>> df2.reset_index(inplace = True, drop = True)
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
3 0 1 1
4 0 0 0
5 1 1 1
回答2:
You just need to do the following:
Assuming df_raw
is your original dataframe with the column headers present both as the column names and repeating in several other rows, your corrected dataframe is df
.
# Filter out only the rows without the headers in them.
headers = df_raw.columns.tolist()
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)
Assumption:
- We assume that the appearance of the first column header means that row has to be dropped.
In Detail
Now a detailed code block for anyone to
- create the data,
- write it into a csv file,
- load it as a dataframe, and then
- remove rows that are headers.
import numpy as np
import pandas as pd
# make a csv file to load as dataframe
data = '''col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1'''
# Write the data to a csv file
with open('data.csv', 'w') as f:
f.write(data)
# Load your data with header=None
df_raw = pd.read_csv('data.csv', header=None)
# Declare which row to find the header data:
# assuming the top one, we set this to zero.
header_row_number = 0
# Read in columns headers
headers = df_raw.iloc[header_row_number].tolist()
# Set new column headers
df_raw.columns = headers
# Filter out only the rows without the headers in them
# We assume that the appearance of the
# first column header means that row has to be dropped
# And reset index (and drop the old index column)
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)
df
来源:https://stackoverflow.com/questions/57808795/clean-wrong-header-inside-dataframe-with-python-pandas