I am making a memory-based real-time calculation module of \"Big data\" using Pandas module of the Python environment.
So response time is the quality of this module and
Inspired by this question and @unutbu's answer, I wrote a parallel-version of map at github. The function is suitable for infinitely parallelizable processing of a read-only large data structure in a single machine with multiple cores. The basic idea is similar to @unutbu sugggested, using a temporary global variable to hold the big data structure (e.g., a data frame), and pass its "name" rather than the variable itself to workers. But all of this are encapsulated in a map function so that it is almost a drop-in replacement of the standard map function, with the help of pathos package. The example usage is as follows,
# Suppose we process a big dataframe with millions of rows.
size = 10**9
df = pd.DataFrame(np.random.randn(size, 4),
columns=['column_01', 'column_02',
'column_03', 'column_04'])
# divide df into sections of 10000 rows; each section will be
# processed by one worker at a time
section_size = 10000
sections = [xrange(start, start+section_size)
for start in xrange(0, size, section_size)]
# The worker function that processes one section of the
# df. The key assumption is that a child
# process does NOT modify the dataframe, but do some
# analysis or aggregation and return some result.
def func(section, df):
return some_processing(df.iloc[section])
num_cores = 4
# sections (local_args) specify which parts of a big object to be processed;
# global_arg holds the big object to be processed to avoid unnecessary copy;
# results are a list of objects each of which is the processing results
# of one part of a big object (i.e., one element in the iterable sections)
# in order.
results = map(func, sections, global_arg=df,
chunksize=10,
processes=num_cores)
# reduce results (assume it is a list of data frames)
result = pd.concat(results)
In some of my text mining tasks, naive parallel implementation that passes df directly to the worker function is even slower than the single-threaded version, due to expensive copy operation of large data frame. However, the above implementation can give >3 times speedup for those tasks with 4 cores, which seems pretty close to real light-weight multi-threading.