问题
I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting the following memory error:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv(\'aphro.csv\',sep=\';\')
C:\\Python27\\lib\\site-packages\\pandas\\io\\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
450 infer_datetime_format=infer_datetime_format)
451
--> 452 return _read(filepath_or_buffer, kwds)
453
454 parser_f.__name__ = name
C:\\Python27\\lib\\site-packages\\pandas\\io\\parsers.pyc in _read(filepath_or_buffer, kwds)
242 return parser
243
--> 244 return parser.read()
245
246 _parser_defaults = {
C:\\Python27\\lib\\site-packages\\pandas\\io\\parsers.pyc in read(self, nrows)
693 raise ValueError(\'skip_footer not supported for iteration\')
694
--> 695 ret = self._engine.read(nrows)
696
697 if self.options.get(\'as_recarray\'):
C:\\Python27\\lib\\site-packages\\pandas\\io\\parsers.pyc in read(self, nrows)
1137
1138 try:
-> 1139 data = self._reader.read(nrows)
1140 except StopIteration:
1141 if nrows is None:
C:\\Python27\\lib\\site-packages\\pandas\\parser.pyd in pandas.parser.TextReader.read (pandas\\parser.c:7145)()
C:\\Python27\\lib\\site-packages\\pandas\\parser.pyd in pandas.parser.TextReader._read_low_memory (pandas\\parser.c:7369)()
C:\\Python27\\lib\\site-packages\\pandas\\parser.pyd in pandas.parser.TextReader._read_rows (pandas\\parser.c:8194)()
C:\\Python27\\lib\\site-packages\\pandas\\parser.pyd in pandas.parser.TextReader._convert_column_data (pandas\\parser.c:9402)()
C:\\Python27\\lib\\site-packages\\pandas\\parser.pyd in pandas.parser.TextReader._convert_tokens (pandas\\parser.c:10057)()
C:\\Python27\\lib\\site-packages\\pandas\\parser.pyd in pandas.parser.TextReader._convert_with_dtype (pandas\\parser.c:10361)()
C:\\Python27\\lib\\site-packages\\pandas\\parser.pyd in pandas.parser._try_int64 (pandas\\parser.c:17806)()
MemoryError:
Any help on this??
回答1:
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize
parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chucksize
parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize
rows, of course.)
回答2:
Chunking shouldn't always be the first port of call for this problem.
Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv
usecols
parameter.Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
Chunk via pandas or via csv library as a last resort.
回答3:
I proceeded like this:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
回答4:
For large data l recommend you use the library "dask"
e.g:
# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')
You can read more from the documentation here.
回答5:
The above answer is already satisfying the topic. Anyway, if you need all the data in memory - have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features
Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.
回答6:
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
回答7:
The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.
def get_from_action_data(fname, chunk_size=100000):
reader = pd.read_csv(fname, header=0, iterator=True)
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
df_ac = pd.concat(chunks, ignore_index=True)
回答8:
Solution 1:
Using pandas with large data
Solution 2:
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
回答9:
You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.
回答10:
If you use pandas read large file into chunk and then yield row by row, here is what I have done
import pandas as pd
def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ):
yield (chunk)
def _generator( filename, header=False,chunk_size = 10 ** 5):
chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
for row in chunk:
yield row
if __name__ == "__main__":
filename = r'file.csv'
generator = generator(filename=filename)
while True:
print(next(generator))
回答11:
Here follows an example:
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
回答12:
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible
回答13:
In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here's a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
来源:https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas