joblib

Joblib userwarning while trying to cache results

百般思念 提交于 2020-01-13 08:40:48
问题 I get the foll. userwarning when trying to cache results using joblib: from tempfile import mkdtemp cachedir = mkdtemp() from joblib import Memory memory = Memory(cachedir=cachedir, verbose=0) @memory.cache def get_nc_var3d(path_nc, var, year): """ Get value from netcdf for variable var for year :param path_nc: :param var: :param year: :return: """ try: hndl_nc = open_or_die(path_nc) val = hndl_nc.variables[var][int(year), :, :] except: val = numpy.nan logger.info('Error in getting var ' +

How to serialize a CountVectorizer with a custom tokenize function with joblib

拈花ヽ惹草 提交于 2020-01-04 11:04:23
问题 I use a CountVectorizer with a custom tokenize method. When I serialize it, then unserialize it, I get the following error message : AttributeError: module '__main__' has no attribute 'tokenize' How can I "serialize" the tokenize method ? Here is a small example : import nltk from nltk.stem.snowball import FrenchStemmer stemmer = FrenchStemmer() def stem_tokens(tokens, stemmer): stemmed = [] for item in tokens: stemmed.append(stemmer.stem(item)) return stemmed def tokenize(text): tokens =

How to serialize a CountVectorizer with a custom tokenize function with joblib

时光怂恿深爱的人放手 提交于 2020-01-04 11:03:58
问题 I use a CountVectorizer with a custom tokenize method. When I serialize it, then unserialize it, I get the following error message : AttributeError: module '__main__' has no attribute 'tokenize' How can I "serialize" the tokenize method ? Here is a small example : import nltk from nltk.stem.snowball import FrenchStemmer stemmer = FrenchStemmer() def stem_tokens(tokens, stemmer): stemmed = [] for item in tokens: stemmed.append(stemmer.stem(item)) return stemmed def tokenize(text): tokens =

Similar errors in MultiProcessing. Mismatch number of arguments to function

ε祈祈猫儿з 提交于 2020-01-04 05:49:44
问题 I couldn't find a better way to describe the error I'm facing, but this error seems to come up everytime I try to implement Multiprocessing to a loop call. I've used both sklearn.externals.joblib as well as multiprocessing.Process but error are similar though different. Original Loop on which want to apply Multiprocessing, where one iteration in executed in single thread/process for dd in final_col_dates: idx1 = final_col_dates.tolist().index(dd) dataObj = GetPrevDataByDate(d1, a, dd, self

Printed output not displayed when using joblib in jupyter notebook

落花浮王杯 提交于 2020-01-03 10:56:26
问题 So I am using joblib to parallelize some code and I noticed that I couldn't print things when using it inside a jupyter notebook. I tried using doing the same example in ipython and it worked perfectly. Here is a minimal (not) working example to write in a jupyter notebook cell from joblib import Parallel, delayed Parallel(n_jobs=8)(delayed(print)(i) for i in range(10)) So I am getting the output as [None, None, None, None, None, None, None, None, None, None] but nothing is printed. Actually,

Multiprocessing backed parallel loops cannot be nested below threads

僤鯓⒐⒋嵵緔 提交于 2019-12-28 16:51:54
问题 What is the reason of such issue in joblib? 'Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1' What should I do to avoid such issue? Actually I need to implement XMLRPC server which run heavy computation in background thread and report current progress through polling from UI client. It uses scikit-learn which are based on joblib. P.S.: I've simply changed name of the thread to "MainThread" to avoid such warning and everything looks working good (run in

Joblib simple example parallel example slower than simple

淺唱寂寞╮ 提交于 2019-12-25 16:54:06
问题 from math import sqrt from joblib import Parallel, delayed import time if __name__ == '__main__': st= time.time() #[sqrt(i ** 2) for i in range(100000)] #this part in non parellel Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(100000)) print time.time()-st now the non parelle part runs in 0.4 sec while parallel part runs for 18 sec .. I am confused why would this happen 回答1: Parallel processes (which joblib creates) require copying data. Imagine it this way: you have two people who

Printing a Parellel Function Outputs in True Order w/Python

断了今生、忘了曾经 提交于 2019-12-24 19:29:29
问题 Looking to print everything in order, for a Python parallelized script. Note the c3 is printed prior to the b2 -- out of order. Any way to make the below function with a wait feature? If you rerun, sometimes the print order is correct for shorter batches. However, looking for a reproducible solution to this issue. from joblib import Parallel, delayed, parallel_backend import multiprocessing testFrame = [['a',1], ['b', 2], ['c', 3]] def testPrint(letr, numbr): print(letr + str(numbr)) return

Scoring returning a numpy.core.memmap instead of a numpy.Number in grid search

穿精又带淫゛_ 提交于 2019-12-24 18:02:45
问题 We are able (only within the context of our application atm) to reproduce on Ubuntu 15.04 and OS X with scikit 0.17 the following problem when using GridSearchCV with a LogisticRegression on larger data sets. ........................................................................... /Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/pipeline.py in fit(self=Pipeline(steps=[('cpencoder', <cpml.whitebox.Lin...s', refit=True, scoring=u'roc_auc', verbose=1))]), X= Unnamed:

Scoring returning a numpy.core.memmap instead of a numpy.Number in grid search

天大地大妈咪最大 提交于 2019-12-24 18:01:08
问题 We are able (only within the context of our application atm) to reproduce on Ubuntu 15.04 and OS X with scikit 0.17 the following problem when using GridSearchCV with a LogisticRegression on larger data sets. ........................................................................... /Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/pipeline.py in fit(self=Pipeline(steps=[('cpencoder', <cpml.whitebox.Lin...s', refit=True, scoring=u'roc_auc', verbose=1))]), X= Unnamed: