joblib

Sklearn joblib load function IO error from AWS S3

做~自己de王妃 提交于 2019-12-10 15:27:44
问题 I am trying to load a pkl dump of my classifier from sklearn-learn. The joblib dump does a much better compression than the cPickle dump for my object so I would like to stick with it. However, I am getting an error when trying to read the object from AWS S3. Cases: Pkl object hosted locally: pickle.load works, joblib.load works Pkl object pushed to Heroku with app (load from static folder): pickle.load works, joblib.load works Pkl object pushed to S3: pickle.load works, joblib.load returns

Writing a parallel loop

穿精又带淫゛_ 提交于 2019-12-09 08:55:46
问题 I am trying to run a parallel loop on a simple example. What am I doing wrong? from joblib import Parallel, delayed import multiprocessing def processInput(i): return i * i if __name__ == '__main__': # what are your inputs, and what operation do you want to # perform on each input. For example... inputs = range(1000000) num_cores = multiprocessing.cpu_count() results = Parallel(n_jobs=4)(delayed(processInput)(i) for i in inputs) print(results) The problem with the code is that when executed

cannot cast array data when a saved classifier is called

ε祈祈猫儿з 提交于 2019-12-09 05:47:29
I have created a classifier using https://gist.github.com/zacstewart/5978000 example. To train the classifier I am using following code import os import numpy NEWLINE = '\n' SKIP_FILES = set(['cmds']) def read_files(path): for root, dir_names, file_names in os.walk(path): for path in dir_names: read_files(os.path.join(root, path)) for file_name in file_names: if file_name not in SKIP_FILES: file_path = os.path.join(root, file_name) if os.path.isfile(file_path): past_header, lines = False, [] f = open(file_path) for line in f: if past_header: lines.append(line) elif line == NEWLINE: past_header

What batch_size and pre_dispatch in joblib exactly mean

半城伤御伤魂 提交于 2019-12-08 19:26:39
问题 From documentation here https://pythonhosted.org/joblib/parallel.html#parallel-reference-documentation It's not clear for me what exactly batch_size and pre_dispatch means. Let's consider case when we are using 'multiprocessing' backend, 2 jobs (2 processes) and we have 10 tasks to compute. As i understand: batch_size - controls amount of pickled tasks at one time, so if you set batch_size = 5 - joblib will pickle and send 5 tasks immediately to each process, and after arriving there they

Saving an sklearn `FunctionTransformer` with the function it wraps

柔情痞子 提交于 2019-12-08 19:13:46
问题 I am using sklearn 's Pipeline and FunctionTransformer with a custom function from sklearn.externals import joblib from sklearn.preprocessing import FunctionTransformer from sklearn.pipeline import Pipeline This is my code: def f(x): return x*2 pipe = Pipeline([("times_2", FunctionTransformer(f))]) joblib.dump(pipe, "pipe.joblib") del pipe del f pipe = joblib.load("pipe.joblib") # Causes an exception And I get this error: AttributeError: module '__ main__' has no attribute 'f' How can this be

Combine tornado gen.coroutine and joblib mem.cache decorators

流过昼夜 提交于 2019-12-08 12:37:13
问题 Imagine having a function, which handles a heavy computational job, that we wish to execute asynchronously in a Tornado application context. Moreover, we would like to lazily evaluate the function, by storing its results to the disk, and not rerunning the function twice for the same arguments. Without caching the result (memoization) one would do the following: def complex_computation(arguments): ... return result @gen.coroutine def complex_computation_caller(arguments): ... result = complex

Multiprocessing with JDBC connection and pooling

点点圈 提交于 2019-12-08 11:12:50
问题 I would like to create a parallel process which gets Data from a Database. I am using a JDBC connector which works quite fine if i run my program not in parallel: conn = jaydebeapi.connect("com.teradata.jdbc.TeraDriver", "jdbc:teradata://DBNAME"+str(i)+"/LOGMECH=LDAP", ["LIB_NAME", "PWD"], "/home/user/TeraJDBC/terajdbc4.jar:/home/user/TeraJDBC/tdgssconfig.jar", ) curs = conn.cursor() However I want to fasten that process and so I am using: from joblib import Parallel, delayed, parallel

In a PyQt5 application, is it possible to run sklearn with parallel jobs without freezing

心已入冬 提交于 2019-12-08 10:25:00
问题 Is it possible to run, in a qt application, without freezing the gui, let's say a sklearn gird search that use several jobs parallel ( n_jobs > 1 )? The problem is that joblib that is used for parallelizing sklearn code cannot run multiprocess into a thread. For example, I'm using Gridsearch to find the best parameters for a svr, which is quite computionnaly intensive. This question has been asked several times, but no solution found: pyqt5-run-sklearn-calculations-on-a-separate-qthread,

cannot cast array data when a saved classifier is called

Deadly 提交于 2019-12-08 03:34:36
问题 I have created a classifier using https://gist.github.com/zacstewart/5978000 example. To train the classifier I am using following code import os import numpy NEWLINE = '\n' SKIP_FILES = set(['cmds']) def read_files(path): for root, dir_names, file_names in os.walk(path): for path in dir_names: read_files(os.path.join(root, path)) for file_name in file_names: if file_name not in SKIP_FILES: file_path = os.path.join(root, file_name) if os.path.isfile(file_path): past_header, lines = False, []

How to use nested loops in joblib library in python

こ雲淡風輕ζ 提交于 2019-12-08 01:32:26
问题 Actual code looks like: def compute_score(row_list,column_list): for i in range(len(row_list)): for j in range(len(column_list)): tf_score = self.compute_tf(column_list[j],row_list[i]) I am tying to achieve multi-processing i.e. at every iteration of j I want to pool column_list . Since compute_tf function is slow I want to multi-process it. I've found have to do it using joblib in Python, But I am unable to workaround with nested loops. Parallel(n_jobs=2)(delayed(self.compute_tf)<some_way_to