joblib

Parallelizing four nested loops in Python

夙愿已清 提交于 2019-12-20 10:25:32
问题 I have a fairly straightforward nested for loop that iterates over four arrays: for a in a_grid: for b in b_grid: for c in c_grid: for d in d_grid: do_some_stuff(a,b,c,d) # perform calculations and write to file Maybe this isn't the most efficient way to perform calculations over a 4D grid to begin with. I know joblib is capable of parallelizing two nested for loops like this, but I'm having trouble generalizing it to four nested loops. Any ideas? 回答1: I usually use code of this form: #!/usr

How to write to a shared variable in python joblib

柔情痞子 提交于 2019-12-17 20:34:32
问题 The following code parallelizes a for-loop. import networkx as nx; import numpy as np; from joblib import Parallel, delayed; import multiprocessing; def core_func(repeat_index, G, numpy_arrary_2D): for u in G.nodes(): numpy_arrary_2D[repeat_index][u] = 2; return; if __name__ == "__main__": G = nx.erdos_renyi_graph(100000,0.99); nRepeat = 5000; numpy_array = np.zeros([nRepeat,G.number_of_nodes()]); Parallel(n_jobs=4)(delayed(core_func)(repeat_index, G, numpy_array) for repeat_index in range

Bokeh and Joblib don't play together

佐手、 提交于 2019-12-13 17:30:22
问题 I have a Bokeh script which calls the data using a function wrapped with joblib's @memory.cache decorator. When I run the script as a python script the get_data function is fast (cached). When I call it using bokeh server --show code.py it seems like cache is lost and the function is re-evaluated, making data retrieval slow. How can I make Bokeh work nicely with Joblib? 回答1: It's hard to say for certain without being able to run an example that reproduces what you are seeing. But my guess is

Memoizing SQL queries

谁说胖子不能爱 提交于 2019-12-12 07:59:04
问题 Say I have a function that runs a SQL query and returns a dataframe: import pandas.io.sql as psql import sqlalchemy query_string = "select a from table;" def run_my_query(my_query): # username, host, port and database are hard-coded here engine = sqlalchemy.create_engine('postgresql://{username}@{host}:{port}/{database}'.format(username=username, host=host, port=port, database=database)) df = psql.read_sql(my_query, engine) return df # Run the query (this is what I want to memoize) df = run

How to reuse a selenium driver instance during parallel processing?

你说的曾经没有我的故事 提交于 2019-12-11 14:43:08
问题 To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges: Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process) Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to continue on empty results (I know that this is most likely wrong) Pseudocode: URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be

Updating batch image array in-place when using joblib

ⅰ亾dé卋堺 提交于 2019-12-11 07:56:27
问题 This is a follow-up question for my solution to the question below: How to apply a function in parallel to multiple images in a numpy array? My suggested solution works fine if the function process_image() has to return the result and then we can cache that to some list for later processing. Since I want to do this type of preprocessing for more than 100K images (with array shape (100000, 32, 32, 3) ), I want my solution to be very efficient. But, my list based approach will hog up lot of

How to load a model saved in joblib file from Google Cloud Storage bucket

坚强是说给别人听的谎言 提交于 2019-12-11 06:36:51
问题 I want to load a model which is saved as a joblib file from Google Cloud Storage bucket. When it is in local path, we can load it as follows (considering model_file is the full path in system): loaded_model = joblib.load(model_file) How can we do the same task with Google Cloud Storage? 回答1: I don't think that's possible, at least in a direct way. I though about a workaround, but the might not be as efficient as you want. By using the Google Cloud Storage client libraries [1] you can download

How to save sklearn model on s3 using joblib.dump?

本秂侑毒 提交于 2019-12-11 06:14:40
问题 I have a sklearn model and I want to save the pickle file on my s3 bucket using joblib.dump I used joblib.dump(model, 'model.pkl') to save the model locally, but I do not know how to save it to s3 bucket. s3_resource = boto3.resource('s3') s3_resource.Bucket('my-bucket').Object("model.pkl").put(Body=joblib.dump(model, 'model.pkl')) I expect the pickled file to be on my s3 bucket. 回答1: Here's a way that worked for me. Pretty straight forward and easy. I'm using joblib (it's better for storing

python parallel no space cant pickle

孤者浪人 提交于 2019-12-11 01:14:03
问题 I am using Parallel from joblib in my python to train a CNN. the code structure is like: crf = CRF() with Parallel(n_jobs=num_cores) as pal_worker: for epoch in range(n): temp = pal_worker(delayed(crf.runCRF)(x[i],y[i]) for i in range(m)) The code can run successfully for 1 or 2 epoch, the then an error occured says (I list the main point I think matters): ...... File "/data_shared/Docker/tsun/software/anaconda3/envs/pytorch04/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 104, in

load np.memmap without knowing shape

情到浓时终转凉″ 提交于 2019-12-10 17:14:14
问题 Is it possible to load a numpy.memmap without knowing the shape and still recover the shape of the data? data = np.arange(12, dtype='float32') data.resize((3,4)) fp = np.memmap(filename, dtype='float32', mode='w+', shape=(3,4)) fp[:] = data[:] del fp newfp = np.memmap(filename, dtype='float32', mode='r', shape=(3,4)) In the last line, I want to be able not to specify the shape and still get the variable newfp to have the shape (3,4) , just like it would happen with joblib.load . Is this