joblib | 易学教程

Sklearn joblib load function IO error from AWS S3

阅读更多关于 Sklearn joblib load function IO error from AWS S3

问题 I am trying to load a pkl dump of my classifier from sklearn-learn. The joblib dump does a much better compression than the cPickle dump for my object so I would like to stick with it. However, I am getting an error when trying to read the object from AWS S3. Cases: Pkl object hosted locally: pickle.load works, joblib.load works Pkl object pushed to Heroku with app (load from static folder): pickle.load works, joblib.load works Pkl object pushed to S3: pickle.load works, joblib.load returns

Writing a parallel loop

阅读更多关于 Writing a parallel loop

问题 I am trying to run a parallel loop on a simple example. What am I doing wrong? from joblib import Parallel, delayed import multiprocessing def processInput(i): return i * i if __name__ == '__main__': # what are your inputs, and what operation do you want to # perform on each input. For example... inputs = range(1000000) num_cores = multiprocessing.cpu_count() results = Parallel(n_jobs=4)(delayed(processInput)(i) for i in inputs) print(results) The problem with the code is that when executed

cannot cast array data when a saved classifier is called

阅读更多关于 cannot cast array data when a saved classifier is called

I have created a classifier using https://gist.github.com/zacstewart/5978000 example. To train the classifier I am using following code import os import numpy NEWLINE = '\n' SKIP_FILES = set(['cmds']) def read_files(path): for root, dir_names, file_names in os.walk(path): for path in dir_names: read_files(os.path.join(root, path)) for file_name in file_names: if file_name not in SKIP_FILES: file_path = os.path.join(root, file_name) if os.path.isfile(file_path): past_header, lines = False, [] f = open(file_path) for line in f: if past_header: lines.append(line) elif line == NEWLINE: past_header

What batch_size and pre_dispatch in joblib exactly mean

阅读更多关于 What batch_size and pre_dispatch in joblib exactly mean

问题 From documentation here https://pythonhosted.org/joblib/parallel.html#parallel-reference-documentation It's not clear for me what exactly batch_size and pre_dispatch means. Let's consider case when we are using 'multiprocessing' backend, 2 jobs (2 processes) and we have 10 tasks to compute. As i understand: batch_size - controls amount of pickled tasks at one time, so if you set batch_size = 5 - joblib will pickle and send 5 tasks immediately to each process, and after arriving there they

Saving an sklearn `FunctionTransformer` with the function it wraps

阅读更多关于 Saving an sklearn `FunctionTransformer` with the function it wraps

问题 I am using sklearn 's Pipeline and FunctionTransformer with a custom function from sklearn.externals import joblib from sklearn.preprocessing import FunctionTransformer from sklearn.pipeline import Pipeline This is my code: def f(x): return x*2 pipe = Pipeline([("times_2", FunctionTransformer(f))]) joblib.dump(pipe, "pipe.joblib") del pipe del f pipe = joblib.load("pipe.joblib") # Causes an exception And I get this error: AttributeError: module '__ main__' has no attribute 'f' How can this be

Combine tornado gen.coroutine and joblib mem.cache decorators

阅读更多关于 Combine tornado gen.coroutine and joblib mem.cache decorators

问题 Imagine having a function, which handles a heavy computational job, that we wish to execute asynchronously in a Tornado application context. Moreover, we would like to lazily evaluate the function, by storing its results to the disk, and not rerunning the function twice for the same arguments. Without caching the result (memoization) one would do the following: def complex_computation(arguments): ... return result @gen.coroutine def complex_computation_caller(arguments): ... result = complex

Multiprocessing with JDBC connection and pooling

阅读更多关于 Multiprocessing with JDBC connection and pooling

问题 I would like to create a parallel process which gets Data from a Database. I am using a JDBC connector which works quite fine if i run my program not in parallel: conn = jaydebeapi.connect("com.teradata.jdbc.TeraDriver", "jdbc:teradata://DBNAME"+str(i)+"/LOGMECH=LDAP", ["LIB_NAME", "PWD"], "/home/user/TeraJDBC/terajdbc4.jar:/home/user/TeraJDBC/tdgssconfig.jar", ) curs = conn.cursor() However I want to fasten that process and so I am using: from joblib import Parallel, delayed, parallel

In a PyQt5 application, is it possible to run sklearn with parallel jobs without freezing

阅读更多关于 In a PyQt5 application, is it possible to run sklearn with parallel jobs without freezing

问题 Is it possible to run, in a qt application, without freezing the gui, let's say a sklearn gird search that use several jobs parallel ( n_jobs > 1 )? The problem is that joblib that is used for parallelizing sklearn code cannot run multiprocess into a thread. For example, I'm using Gridsearch to find the best parameters for a svr, which is quite computionnaly intensive. This question has been asked several times, but no solution found: pyqt5-run-sklearn-calculations-on-a-separate-qthread,

cannot cast array data when a saved classifier is called

阅读更多关于 cannot cast array data when a saved classifier is called

问题 I have created a classifier using https://gist.github.com/zacstewart/5978000 example. To train the classifier I am using following code import os import numpy NEWLINE = '\n' SKIP_FILES = set(['cmds']) def read_files(path): for root, dir_names, file_names in os.walk(path): for path in dir_names: read_files(os.path.join(root, path)) for file_name in file_names: if file_name not in SKIP_FILES: file_path = os.path.join(root, file_name) if os.path.isfile(file_path): past_header, lines = False, []

How to use nested loops in joblib library in python

阅读更多关于 How to use nested loops in joblib library in python

问题 Actual code looks like: def compute_score(row_list,column_list): for i in range(len(row_list)): for j in range(len(column_list)): tf_score = self.compute_tf(column_list[j],row_list[i]) I am tying to achieve multi-processing i.e. at every iteration of j I want to pool column_list . Since compute_tf function is slow I want to multi-process it. I've found have to do it using joblib in Python, But I am unable to workaround with nested loops. Parallel(n_jobs=2)(delayed(self.compute_tf)<some_way_to