data-science

pyspark.ml pipelines: are custom transformers necessary for basic preprocessing tasks?

梦想的初衷 提交于 2019-11-29 05:13:38
Getting started with pyspark.ml and the pipelines API, I find myself writing custom transformers for typical preprocessing tasks in order to use them in a pipeline. Examples: from pyspark.ml import Pipeline, Transformer class CustomTransformer(Transformer): # lazy workaround - a transformer needs to have these attributes _defaultParamMap = dict() _paramMap = dict() _params = dict() class ColumnSelector(CustomTransformer): """Transformer that selects a subset of columns - to be used as pipeline stage""" def __init__(self, columns): self.columns = columns def _transform(self, data): return data

quantile normalization on pandas dataframe

ぐ巨炮叔叔 提交于 2019-11-28 21:39:35
Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python? PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below: 5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06 8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6

fit_transform() takes 2 positional arguments but 3 were given with LabelBinarizer

北战南征 提交于 2019-11-28 18:10:22
问题 I am totally new to Machine Learning and I have been working with unsupervised learning technique. Image shows my sample Data(After all Cleaning) Screenshot : Sample Data I have this two Pipline built to Clean the Data: num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] print(type(num_attribs)) num_pipeline = Pipeline([ ('selector', DataFrameSelector(num_attribs)), ('imputer', Imputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler',

How to load a model from an HDF5 file in Keras?

我怕爱的太早我们不能终老 提交于 2019-11-28 15:17:35
How to load a model from an HDF5 file in Keras? What I tried: model = Sequential() model.add(Dense(64, input_dim=14, init='uniform')) model.add(LeakyReLU(alpha=0.3)) model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None)) model.add(Dropout(0.5)) model.add(Dense(64, init='uniform')) model.add(LeakyReLU(alpha=0.3)) model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None)) model.add(Dropout(0.5)) model.add(Dense(2, init='uniform')) model.add(Activation('softmax')) sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='binary

Filter pandas dataframe by list

£可爱£侵袭症+ 提交于 2019-11-28 14:50:14
I have a dataframe that has a row called "Hybridization REF". I would like to filter so that I only get the data for the items that have the same label as one of the items in my list. Basically, I'd like to do the following: dataframe[dataframe["Hybridization REF'].apply(lambda: x in list)] but that syntax is not correct. You can use .loc or column filtering: df = pd.DataFrame(data=np.random.rand(5,5),columns=list('ABCDE'),index=list('abcde')) df A B C D E a 0.460537 0.174788 0.167554 0.298469 0.630961 b 0.728094 0.275326 0.405864 0.302588 0.624046 c 0.953253 0.682038 0.802147 0.105888 0

Update pandas dataframe based on matching columns of a second dataframe

大兔子大兔子 提交于 2019-11-28 14:10:17
I have two pandas dataframes ( df_1 , df_2 ) with the same columns, but in one dataframe ( df_1 ) some values of one column are missing. So I want to fill in those missing values from df_2 , but only when the the values of two columns match. Here is a little example what my data looks like: df_1: df_2: I tried to add the missing values with: df_1.update(df_2, overwrite=False) But the problem is, that it will fill in the values, even when just one column matches. I want to fill in the value when the columns "housenumber" AND "street" matches. I think you need set_index for Multiindex in both

reshape multi id repeated variable readings from long to wide

ⅰ亾dé卋堺 提交于 2019-11-28 11:46:21
This is what I have: id<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2) measure<-c("speed","weight","time","speed","weight","time","speed","weight","time", "speed","weight","time","speed","weight","time","speed","weight","time") value<-c(1.23,10.3,33,1.44,10.4,31,1.21,10.1,33,4.25,12.5,38,1.74,10.8,31,3.21,10.3,33) testdf<-data.frame(id,measure,value) This is what I want: id<-c(1,1,1,2,2,2) speed<-c(1.23,1.44,1.21,4.25,1.74,3.21) weight<-c(10.3,10.4,10.1,12.5,10.8,10.3) time<-c(33,31,33,37,31,33) res<-data.frame(id,speed,weight,time) The issue lies in that my variables speed weight and time are

ValueError: Wrong number of items passed - Meaning and suggestions?

微笑、不失礼 提交于 2019-11-28 08:54:43
I am receiving the error: ValueError: Wrong number of items passed 3, placement implies 1 , and I am struggling to figure out where, and how I may begin addressing the problem. I don't really understand the meaning of the error; which is making it difficult for me to troubleshoot. I have also included the block of code that is triggering the error in my Jupyter Notebook. The data is tough to attach; so I am not looking for anyone to try and re-create this error for me. I am just looking for some feedback on how I could address this error. KeyError Traceback (most recent call last) C:\Users

How to perform feature selection with gridsearchcv in sklearn in python

喜夏-厌秋 提交于 2019-11-28 01:16:20
I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows. X = df[[my_features]] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc') rfecv.fit(X,y) print("Optimal number of features : %d" % rfecv.n_features_) features=list(X.columns[rfecv.support_]) I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows. X = df[[my

pyspark.ml pipelines: are custom transformers necessary for basic preprocessing tasks?

旧街凉风 提交于 2019-11-27 19:02:09
问题 Getting started with pyspark.ml and the pipelines API, I find myself writing custom transformers for typical preprocessing tasks in order to use them in a pipeline. Examples: from pyspark.ml import Pipeline, Transformer class CustomTransformer(Transformer): # lazy workaround - a transformer needs to have these attributes _defaultParamMap = dict() _paramMap = dict() _params = dict() class ColumnSelector(CustomTransformer): """Transformer that selects a subset of columns - to be used as