joblib

Bringing a classifier to production

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-07 05:00:31
问题 I've saved my classifier pipeline using joblib: vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3)) pac_clf = PassiveAggressiveClassifier(C=1) vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)]) vec_clf.fit(X_train,y_train) joblib.dump(vec_clf, 'class.pkl', compress=9) Now i'm trying to use it in a production env: def classify(title): #load classifier and predict classifier = joblib.load('class.pkl') #vectorize/transform the new title then predict vectorizer =

Where is the memory leak? How to timeout threads during multiprocessing in python?

*爱你&永不变心* 提交于 2019-12-06 19:46:19
问题 It is unclear how to properly timeout workers of joblib's Parallel in python. Others have had similar questions here, here, here and here. In my example I am utilizing a pool of 50 joblib workers with threading backend. Parallel Call (threading): output = Parallel(n_jobs=50, backend = 'threading') (delayed(get_output)(INPUT) for INPUT in list) Here, Parallel hangs without errors as soon as len(list) <= n_jobs but only when n_jobs => -1 . In order to circumvent this issue, people give

What does the delayed() function do (when used with joblib in Python)

放肆的年华 提交于 2019-12-06 16:32:43
问题 I've read through the documentation, but I don't understand what is meant by: The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax. I'm using it to iterate over the list I want to operate on (allImages) as follows: def joblib_loop(): Parallel(n_jobs=8)(delayed(getHog)(i) for i in allImages) This returns my HOG features, like I want (and with the speed gain using all my 8 cores), but I'm just not sure what it is actually doing

Joblib Parallel uses only one core if started from QThread

冷暖自知 提交于 2019-12-06 14:43:43
问题 I'm developing a GUI which carries out some heavy number crunching. To speed up things I use joblib's Parallel execution together with pyqt's QThreads to avoid the GUI from becoming unresponsive. The Parallel execution works fine so far, but if embedded in the GUI and run in its own thread it utilizes only one of my 4 cores. Anything fundamental I missed in the threading/multiprocessing world? Here a rough sketch of my setup: class ThreadRunner(QtCore.QObject): start = QtCore.pyqtSignal()

How to use nested loops in joblib library in python

谁都会走 提交于 2019-12-06 11:24:44
Actual code looks like: def compute_score(row_list,column_list): for i in range(len(row_list)): for j in range(len(column_list)): tf_score = self.compute_tf(column_list[j],row_list[i]) I am tying to achieve multi-processing i.e. at every iteration of j I want to pool column_list . Since compute_tf function is slow I want to multi-process it. I've found have to do it using joblib in Python, But I am unable to workaround with nested loops. Parallel(n_jobs=2)(delayed(self.compute_tf)<some_way_to_use_nested_loops>) This is what is to be achieved. It would be a great help if any solution on this is

Gracefull python joblib kill

牧云@^-^@ 提交于 2019-12-06 04:50:41
问题 Is it possible to gracefully kill a joblib process (threading backend), and still return the so far computed results ? parallel = Parallel(n_jobs=4, backend="threading") result = parallel(delayed(dummy_f)(x) for x in range(100)) For the moment I came up with two solutions parallel._aborted = True which waits for the started jobs to finish (in my case it can be very long) parallel._terminate_backend() which hangs if jobs are still in the pipe ( parallel._jobs not empty) Is there a way to

Workaround for 32-/64-bit serialization exception on sklearn RandomForest model

时光毁灭记忆、已成空白 提交于 2019-12-05 09:44:20
If we serialize randomforest model using joblib on a 64-bit machine, and then unpack on a 32-bit machine, there is an exception: ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long' This question has been asked before: Scikits-Learn RandomForrest trained on 64bit python wont open on 32bit python . But the question has not been answered from since 2014. Sample code to learn the model (On a 64-bit machine): modelPath="../" featureVec=... labelVec = ... forest = RandomForestClassifier() randomSearch = RandomizedSearchCV(forest, param_distributions=param_dict, cv=10, scoring=

Load and predict new data sklearn

风格不统一 提交于 2019-12-05 03:56:17
问题 I trained a Logistic model, cross-validated and saved it to file using joblib module. Now I want to load this model and predict new data with it. Is this the correct way to do this? Especially the standardization. Should I use scaler.fit() on my new data too? In the tutorials I followed, scaler.fit was only used on the training set, so I'm a bit lost here. Here is my code: #Loading the saved model with joblib model = joblib.load('model.pkl') # New data to predict pr = pd.read_csv('set_to

Where is the memory leak? How to timeout threads during multiprocessing in python?

隐身守侯 提交于 2019-12-05 02:24:01
It is unclear how to properly timeout workers of joblib's Parallel in python. Others have had similar questions here , here , here and here . In my example I am utilizing a pool of 50 joblib workers with threading backend. Parallel Call (threading): output = Parallel(n_jobs=50, backend = 'threading') (delayed(get_output)(INPUT) for INPUT in list) Here, Parallel hangs without errors as soon as len(list) <= n_jobs but only when n_jobs => -1 . In order to circumvent this issue, people give instructions on how to create a timeout decorator to the Parallel function ( get_output(INPUT) ) in the

What does the delayed() function do (when used with joblib in Python)

旧时模样 提交于 2019-12-04 22:14:22
I've read through the documentation , but I don't understand what is meant by: The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax. I'm using it to iterate over the list I want to operate on (allImages) as follows: def joblib_loop(): Parallel(n_jobs=8)(delayed(getHog)(i) for i in allImages) This returns my HOG features, like I want (and with the speed gain using all my 8 cores), but I'm just not sure what it is actually doing. My Python knowledge is alright at best, and it's very possible that I'm missing something basic. Any