joblib | 易学教程

Large Pandas Dataframe parallel processing

阅读更多关于 Large Pandas Dataframe parallel processing

问题 I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib. Eg. df = db.query("select id, a_lot_of_data from table") def process(id): temp_df = df.loc[id] temp_df.apply(another_function) Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list()) Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

Load and predict new data sklearn

阅读更多关于 Load and predict new data sklearn

I trained a Logistic model, cross-validated and saved it to file using joblib module. Now I want to load this model and predict new data with it. Is this the correct way to do this? Especially the standardization. Should I use scaler.fit() on my new data too? In the tutorials I followed, scaler.fit was only used on the training set, so I'm a bit lost here. Here is my code: #Loading the saved model with joblib model = joblib.load('model.pkl') # New data to predict pr = pd.read_csv('set_to_predict.csv') pred_cols = list(pr.columns.values)[:-1] # Standardize new data scaler = StandardScaler() X

Reusing model fitted by cross_val_score in sklearn using joblib

阅读更多关于 Reusing model fitted by cross_val_score in sklearn using joblib

I created the following function in python: def cross_validate(algorithms, data, labels, cv=4, n_jobs=-1): print "Cross validation using: " for alg, predictors in algorithms: print alg print # Compute the accuracy score for all the cross validation folds. scores = cross_val_score(alg, data, labels, cv=cv, n_jobs=n_jobs) # Take the mean of the scores (because we have one for each fold) print scores print("Cross validation mean score = " + str(scores.mean())) name = re.split('\(', str(alg)) filename = str('%0.5f' %scores.mean()) + "_" + name[0] + ".pkl" # We might use this another time joblib

Memoizing SQL queries

阅读更多关于 Memoizing SQL queries

Say I have a function that runs a SQL query and returns a dataframe: import pandas.io.sql as psql import sqlalchemy query_string = "select a from table;" def run_my_query(my_query): # username, host, port and database are hard-coded here engine = sqlalchemy.create_engine('postgresql://{username}@{host}:{port}/{database}'.format(username=username, host=host, port=port, database=database)) df = psql.read_sql(my_query, engine) return df # Run the query (this is what I want to memoize) df = run_my_query(my_query) I would like to: Be able to memoize my query above with one cache entry per value of

Writing a parallel loop

阅读更多关于 Writing a parallel loop

I am trying to run a parallel loop on a simple example. What am I doing wrong? from joblib import Parallel, delayed import multiprocessing def processInput(i): return i * i if __name__ == '__main__': # what are your inputs, and what operation do you want to # perform on each input. For example... inputs = range(1000000) num_cores = multiprocessing.cpu_count() results = Parallel(n_jobs=4)(delayed(processInput)(i) for i in inputs) print(results) The problem with the code is that when executed under Windows environments in Python 3, it opens num_cores instances of python to execute the parallel

Multiple processes sharing a single Joblib cache

阅读更多关于 Multiple processes sharing a single Joblib cache

I'm using Joblib to cache results of a computationally expensive function in my python script. The function's input arguments and return values are numpy arrays. The cache works fine for a single run of my python script. Now I want to spawn multiple runs of my python script in parallel for sweeping some parameter in an experiment. (The definition of the function remains same across all the runs). Is there a way to share the joblib cache among multiple python scripts running in parallel? This would save a lot of function evaluations which are repeated across different runs but do not repeat

Large Pandas Dataframe parallel processing

阅读更多关于 Large Pandas Dataframe parallel processing

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib . Eg. df = db.query("select id, a_lot_of_data from table") def process(id): temp_df = df.loc[id] temp_df.apply(another_function) Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list()) Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?) The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice,

How can we use tqdm in a parallel execution with joblib?

阅读更多关于 How can we use tqdm in a parallel execution with joblib?

I want to run a function in parallel, and wait until all parallel nodes are done, using joblib. Like in the example: from math import sqrt from joblib import Parallel, delayed Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10)) But, I want that the execution will be seen in a single progressbar like with tqdm , showing how many jobs has been completed. How would you do that? If your problem consists of many parts, you could split the parts into k subgroups, run each subgroup in parallel and update the progressbar in between, resulting in k updates of the progress. This is demonstrated

how to save a scikit-learn pipline with keras regressor inside to disk?

阅读更多关于 how to save a scikit-learn pipline with keras regressor inside to disk?

I have a scikit-learn pipline with kerasRegressor in it: estimators = [ ('standardize', StandardScaler()), ('mlp', KerasRegressor(build_fn=baseline_model, nb_epoch=5, batch_size=1000, verbose=1)) ] pipeline = Pipeline(estimators) After, training the pipline, I am trying to save to disk using joblib... joblib.dump(pipeline, filename , compress=9) But I am getting an error: RuntimeError: maximum recursion depth exceeded How would you save the pipeline to disk? I struggled with the same problem as there are no direct ways to do this. Here is a hack which worked for me. I saved my pipeline into

Parallelizing four nested loops in Python

阅读更多关于 Parallelizing four nested loops in Python

I have a fairly straightforward nested for loop that iterates over four arrays: for a in a_grid: for b in b_grid: for c in c_grid: for d in d_grid: do_some_stuff(a,b,c,d) # perform calculations and write to file Maybe this isn't the most efficient way to perform calculations over a 4D grid to begin with. I know joblib is capable of parallelizing two nested for loops like this , but I'm having trouble generalizing it to four nested loops. Any ideas? I usually use code of this form: #!/usr/bin/env python3 import itertools import multiprocessing #Generate values for each parameter a = range(10) b