data-science

How to get the unique pairs from the given data frame column with file handling?

与世无争的帅哥 提交于 2020-02-08 02:31:13
问题 sample data from dataframe: Pairs (8, 8), (8, 8), (8, 8), (8, 8), (8, 8) (6, 7), (7, 7), (7, 7), (7, 6), (6, 7) (2, 12), (12, 3), (3, 4), (4, 12), (12, 12) ``` new_col = [] for e in content.Pairs: new_col.append(list(dict.fromkeys(e))) content['Unique'] = new_col ``` output expected is unique pairs from Pair column like this: (8, 8),(6, 7),(7, 6),(7, 7),(2, 12) so on what I am getting is this result when trying the above code: Unique ['8', ''] ['6', '7', ''] ['2', '12', '3', '4', ''] what is

How to get the unique pairs from the given data frame column with file handling?

為{幸葍}努か 提交于 2020-02-08 02:30:09
问题 sample data from dataframe: Pairs (8, 8), (8, 8), (8, 8), (8, 8), (8, 8) (6, 7), (7, 7), (7, 7), (7, 6), (6, 7) (2, 12), (12, 3), (3, 4), (4, 12), (12, 12) ``` new_col = [] for e in content.Pairs: new_col.append(list(dict.fromkeys(e))) content['Unique'] = new_col ``` output expected is unique pairs from Pair column like this: (8, 8),(6, 7),(7, 6),(7, 7),(2, 12) so on what I am getting is this result when trying the above code: Unique ['8', ''] ['6', '7', ''] ['2', '12', '3', '4', ''] what is

How to fix OverflowError: Overflow in int64 addition

拈花ヽ惹草 提交于 2020-02-02 13:13:27
问题 I'm trying to subtract column df['date_of_admission'] from the column df['DOB'] to find the difference between then and store the age value in df['age'] column, however, I'm getting this error: OverflowError: Overflow in int64 addition DOB date_of_admission age 2000-05-07 2019-01-19 12:26:00 1965-01-30 2019-03-21 02:23:12 NaT 2018-11-02 18:30:10 1981-05-01 2019-05-08 12:26:00 1957-01-10 2018-12-31 04:01:15 1968-07-14 2019-01-28 15:05:09 NaT 2018-04-13 06:20:01 NaT 2019-02-15 01:01:57 2001-02

How to remove extra commas from data in Python

眉间皱痕 提交于 2020-01-26 04:40:09
问题 I have a CSV file through which I am trying to load data into my SQL table containing 2 columns. I have 2 columns and the data is separated by commas, which identify the next field. The second column contains text and some commas in that text. Because of the extra commas I am not able to load data into my SQL table as it looks like it has extra columns. I have millions of rows of data. How can I remove these extra commas? Data: Number Address "12345" , "123 abc street, Unit 345" "67893" ,

Fill timestamp gaps in large dataset

99封情书 提交于 2020-01-24 13:55:46
问题 I have a dataset with like 100K+ rows, one column on this dataset is a Datetime column, let's name it A . My Dataset is sorted by column A. I want to "Fill gaps" of my Dataset, i.e : if i have these two rows following each others : 0 2019-03-13 08:12:20 1 2019-03-13 08:12:25 I want to make add missing seconds between them, as a result, i'll have this : 0 2019-03-13 08:12:20 1 2019-03-13 08:12:21 2 2019-03-13 08:12:22 3 2019-03-13 08:12:23 4 2019-03-13 08:12:24 5 2019-03-13 08:12:25 I don't

melt column by substring of the columns name in pandas (python)

 ̄綄美尐妖づ 提交于 2020-01-24 12:52:47
问题 I have dataframe: subject A_target_word_gd A_target_word_fd B_target_word_gd B_target_word_fd subject_type 1 1 2 3 4 mild 2 11 12 13 14 moderate And I want to melt it to a dataframe that will look: cond subject subject_type value_type value A 1 mild gd 1 A 1 mild fg 2 B 1 mild gd 3 B 1 mild fg 4 A 2 moderate gd 11 A 2 moderate fg 12 B 2 moderate gd 13 B 2 moderate fg 14 ... ... Meaning, to melt based on the delimiter of the columns name. What is the best way to do that? 回答1: One more approach

melt column by substring of the columns name in pandas (python)

ⅰ亾dé卋堺 提交于 2020-01-24 12:52:12
问题 I have dataframe: subject A_target_word_gd A_target_word_fd B_target_word_gd B_target_word_fd subject_type 1 1 2 3 4 mild 2 11 12 13 14 moderate And I want to melt it to a dataframe that will look: cond subject subject_type value_type value A 1 mild gd 1 A 1 mild fg 2 B 1 mild gd 3 B 1 mild fg 4 A 2 moderate gd 11 A 2 moderate fg 12 B 2 moderate gd 13 B 2 moderate fg 14 ... ... Meaning, to melt based on the delimiter of the columns name. What is the best way to do that? 回答1: One more approach

How to prepare training data for image classification

China☆狼群 提交于 2020-01-24 12:19:09
问题 I'm new to Machine Learning and have some problems with image classification. Using a simple classifier technique K Nearest Neighbours I'm trying to distinguish Cats from Dogs. My code so far: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline DATADIR = "/Users/me/Desktop/ds2/ML_image_classification/kagglecatsanddogs_3367a/PetImages" CATEGORIES = ['Dog', 'Cat'] IMG_SIZE = 30 data = [] categories = [] for category in CATEGORIES: path

Random Forest Classifier :To which class corresponds the probabilities

我只是一个虾纸丫 提交于 2020-01-24 11:17:11
问题 I am using the RandomForestClassifier from pyspark.ml.classification I run the model on a binary class dataset and display the probabilities. I have the following in the col probabilities : +-----+----------+---------------------------------------+ |label|prediction|probability | +-----+----------+---------------------------------------+ |0.0 |0.0 |[0.9005918461098429,0.0994081538901571]| |1.0 |1.0 |[0.6051335859900139,0.3948664140099861]| +-----+----------+-----------------------------------

SVC classifier taking too much time for training

僤鯓⒐⒋嵵緔 提交于 2020-01-24 01:05:12
问题 I am using SVC classifier with Linear kernel to train my model. Train data: 42000 records model = SVC(probability=True) model.fit(self.features_train, self.labels_train) y_pred = model.predict(self.features_test) train_accuracy = model.score(self.features_train,self.labels_train) test_accuracy = model.score(self.features_test, self.labels_test) It takes more than 2 hours to train my model. Am I doing something wrong? Also, what can be done to improve the time Thanks in advance 回答1: There are