joblib

Large Pandas Dataframe parallel processing

家住魔仙堡 提交于 2019-12-04 10:55:00
问题 I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib. Eg. df = db.query("select id, a_lot_of_data from table") def process(id): temp_df = df.loc[id] temp_df.apply(another_function) Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list()) Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

Load and predict new data sklearn

一世执手 提交于 2019-12-03 21:15:23
I trained a Logistic model, cross-validated and saved it to file using joblib module. Now I want to load this model and predict new data with it. Is this the correct way to do this? Especially the standardization. Should I use scaler.fit() on my new data too? In the tutorials I followed, scaler.fit was only used on the training set, so I'm a bit lost here. Here is my code: #Loading the saved model with joblib model = joblib.load('model.pkl') # New data to predict pr = pd.read_csv('set_to_predict.csv') pred_cols = list(pr.columns.values)[:-1] # Standardize new data scaler = StandardScaler() X

Reusing model fitted by cross_val_score in sklearn using joblib

一世执手 提交于 2019-12-03 15:48:51
I created the following function in python: def cross_validate(algorithms, data, labels, cv=4, n_jobs=-1): print "Cross validation using: " for alg, predictors in algorithms: print alg print # Compute the accuracy score for all the cross validation folds. scores = cross_val_score(alg, data, labels, cv=cv, n_jobs=n_jobs) # Take the mean of the scores (because we have one for each fold) print scores print("Cross validation mean score = " + str(scores.mean())) name = re.split('\(', str(alg)) filename = str('%0.5f' %scores.mean()) + "_" + name[0] + ".pkl" # We might use this another time joblib

Memoizing SQL queries

帅比萌擦擦* 提交于 2019-12-03 13:36:02
Say I have a function that runs a SQL query and returns a dataframe: import pandas.io.sql as psql import sqlalchemy query_string = "select a from table;" def run_my_query(my_query): # username, host, port and database are hard-coded here engine = sqlalchemy.create_engine('postgresql://{username}@{host}:{port}/{database}'.format(username=username, host=host, port=port, database=database)) df = psql.read_sql(my_query, engine) return df # Run the query (this is what I want to memoize) df = run_my_query(my_query) I would like to: Be able to memoize my query above with one cache entry per value of

Writing a parallel loop

二次信任 提交于 2019-12-03 10:37:42
I am trying to run a parallel loop on a simple example. What am I doing wrong? from joblib import Parallel, delayed import multiprocessing def processInput(i): return i * i if __name__ == '__main__': # what are your inputs, and what operation do you want to # perform on each input. For example... inputs = range(1000000) num_cores = multiprocessing.cpu_count() results = Parallel(n_jobs=4)(delayed(processInput)(i) for i in inputs) print(results) The problem with the code is that when executed under Windows environments in Python 3, it opens num_cores instances of python to execute the parallel

Multiple processes sharing a single Joblib cache

旧巷老猫 提交于 2019-12-03 08:09:14
I'm using Joblib to cache results of a computationally expensive function in my python script. The function's input arguments and return values are numpy arrays. The cache works fine for a single run of my python script. Now I want to spawn multiple runs of my python script in parallel for sweeping some parameter in an experiment. (The definition of the function remains same across all the runs). Is there a way to share the joblib cache among multiple python scripts running in parallel? This would save a lot of function evaluations which are repeated across different runs but do not repeat

Large Pandas Dataframe parallel processing

亡梦爱人 提交于 2019-12-03 06:59:39
I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib . Eg. df = db.query("select id, a_lot_of_data from table") def process(id): temp_df = df.loc[id] temp_df.apply(another_function) Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list()) Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?) The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice,

How can we use tqdm in a parallel execution with joblib?

老子叫甜甜 提交于 2019-12-03 04:41:53
I want to run a function in parallel, and wait until all parallel nodes are done, using joblib. Like in the example: from math import sqrt from joblib import Parallel, delayed Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10)) But, I want that the execution will be seen in a single progressbar like with tqdm , showing how many jobs has been completed. How would you do that? If your problem consists of many parts, you could split the parts into k subgroups, run each subgroup in parallel and update the progressbar in between, resulting in k updates of the progress. This is demonstrated

how to save a scikit-learn pipline with keras regressor inside to disk?

こ雲淡風輕ζ 提交于 2019-12-03 02:39:37
I have a scikit-learn pipline with kerasRegressor in it: estimators = [ ('standardize', StandardScaler()), ('mlp', KerasRegressor(build_fn=baseline_model, nb_epoch=5, batch_size=1000, verbose=1)) ] pipeline = Pipeline(estimators) After, training the pipline, I am trying to save to disk using joblib... joblib.dump(pipeline, filename , compress=9) But I am getting an error: RuntimeError: maximum recursion depth exceeded How would you save the pipeline to disk? I struggled with the same problem as there are no direct ways to do this. Here is a hack which worked for me. I saved my pipeline into

Parallelizing four nested loops in Python

与世无争的帅哥 提交于 2019-12-02 21:16:53
I have a fairly straightforward nested for loop that iterates over four arrays: for a in a_grid: for b in b_grid: for c in c_grid: for d in d_grid: do_some_stuff(a,b,c,d) # perform calculations and write to file Maybe this isn't the most efficient way to perform calculations over a 4D grid to begin with. I know joblib is capable of parallelizing two nested for loops like this , but I'm having trouble generalizing it to four nested loops. Any ideas? I usually use code of this form: #!/usr/bin/env python3 import itertools import multiprocessing #Generate values for each parameter a = range(10) b