data-science | 易学教程

How to get the unique pairs from the given data frame column with file handling?

阅读更多关于 How to get the unique pairs from the given data frame column with file handling?

问题 sample data from dataframe: Pairs (8, 8), (8, 8), (8, 8), (8, 8), (8, 8) (6, 7), (7, 7), (7, 7), (7, 6), (6, 7) (2, 12), (12, 3), (3, 4), (4, 12), (12, 12) ``` new_col = [] for e in content.Pairs: new_col.append(list(dict.fromkeys(e))) content['Unique'] = new_col ``` output expected is unique pairs from Pair column like this: (8, 8),(6, 7),(7, 6),(7, 7),(2, 12) so on what I am getting is this result when trying the above code: Unique ['8', ''] ['6', '7', ''] ['2', '12', '3', '4', ''] what is

How to get the unique pairs from the given data frame column with file handling?

阅读更多关于 How to get the unique pairs from the given data frame column with file handling?

How to fix OverflowError: Overflow in int64 addition

阅读更多关于 How to fix OverflowError: Overflow in int64 addition

问题 I'm trying to subtract column df['date_of_admission'] from the column df['DOB'] to find the difference between then and store the age value in df['age'] column, however, I'm getting this error: OverflowError: Overflow in int64 addition DOB date_of_admission age 2000-05-07 2019-01-19 12:26:00 1965-01-30 2019-03-21 02:23:12 NaT 2018-11-02 18:30:10 1981-05-01 2019-05-08 12:26:00 1957-01-10 2018-12-31 04:01:15 1968-07-14 2019-01-28 15:05:09 NaT 2018-04-13 06:20:01 NaT 2019-02-15 01:01:57 2001-02

How to remove extra commas from data in Python

阅读更多关于 How to remove extra commas from data in Python

问题 I have a CSV file through which I am trying to load data into my SQL table containing 2 columns. I have 2 columns and the data is separated by commas, which identify the next field. The second column contains text and some commas in that text. Because of the extra commas I am not able to load data into my SQL table as it looks like it has extra columns. I have millions of rows of data. How can I remove these extra commas? Data: Number Address "12345" , "123 abc street, Unit 345" "67893" ,

Fill timestamp gaps in large dataset

阅读更多关于 Fill timestamp gaps in large dataset

问题 I have a dataset with like 100K+ rows, one column on this dataset is a Datetime column, let's name it A . My Dataset is sorted by column A. I want to "Fill gaps" of my Dataset, i.e : if i have these two rows following each others : 0 2019-03-13 08:12:20 1 2019-03-13 08:12:25 I want to make add missing seconds between them, as a result, i'll have this : 0 2019-03-13 08:12:20 1 2019-03-13 08:12:21 2 2019-03-13 08:12:22 3 2019-03-13 08:12:23 4 2019-03-13 08:12:24 5 2019-03-13 08:12:25 I don't

melt column by substring of the columns name in pandas (python)

阅读更多关于 melt column by substring of the columns name in pandas (python)

问题 I have dataframe: subject A_target_word_gd A_target_word_fd B_target_word_gd B_target_word_fd subject_type 1 1 2 3 4 mild 2 11 12 13 14 moderate And I want to melt it to a dataframe that will look: cond subject subject_type value_type value A 1 mild gd 1 A 1 mild fg 2 B 1 mild gd 3 B 1 mild fg 4 A 2 moderate gd 11 A 2 moderate fg 12 B 2 moderate gd 13 B 2 moderate fg 14 ... ... Meaning, to melt based on the delimiter of the columns name. What is the best way to do that? 回答1: One more approach

melt column by substring of the columns name in pandas (python)

阅读更多关于 melt column by substring of the columns name in pandas (python)

How to prepare training data for image classification

阅读更多关于 How to prepare training data for image classification

问题 I'm new to Machine Learning and have some problems with image classification. Using a simple classifier technique K Nearest Neighbours I'm trying to distinguish Cats from Dogs. My code so far: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline DATADIR = "/Users/me/Desktop/ds2/ML_image_classification/kagglecatsanddogs_3367a/PetImages" CATEGORIES = ['Dog', 'Cat'] IMG_SIZE = 30 data = [] categories = [] for category in CATEGORIES: path

Random Forest Classifier :To which class corresponds the probabilities

阅读更多关于 Random Forest Classifier :To which class corresponds the probabilities

问题 I am using the RandomForestClassifier from pyspark.ml.classification I run the model on a binary class dataset and display the probabilities. I have the following in the col probabilities : +-----+----------+---------------------------------------+ |label|prediction|probability | +-----+----------+---------------------------------------+ |0.0 |0.0 |[0.9005918461098429,0.0994081538901571]| |1.0 |1.0 |[0.6051335859900139,0.3948664140099861]| +-----+----------+-----------------------------------

SVC classifier taking too much time for training

阅读更多关于 SVC classifier taking too much time for training

问题 I am using SVC classifier with Linear kernel to train my model. Train data: 42000 records model = SVC(probability=True) model.fit(self.features_train, self.labels_train) y_pred = model.predict(self.features_test) train_accuracy = model.score(self.features_train,self.labels_train) test_accuracy = model.score(self.features_test, self.labels_test) It takes more than 2 hours to train my model. Am I doing something wrong? Also, what can be done to improve the time Thanks in advance 回答1: There are