data-analysis

What to do with missing values when plotting with seaborn?

耗尽温柔 提交于 2019-12-09 14:45:42
问题 I replaced the missing values with NaN using lambda following function: data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x) ,where data is the dataframe I am working on. Using seaborn afterwards,I tried to plot one of its attributes,alcconsumption using seaborn.distplot as follows: seaborn.distplot(data['alcconsumption'],hist=True,bins=100) plt.xlabel('AlcoholConsumption') plt.ylabel('Frequency(normalized 0->1)') It's giving me the following error:

When should I use C++ instead of SQL?

自作多情 提交于 2019-12-09 14:23:31
问题 I am a C++ programmer who occasionally uses MySQL to work with databases, but my SQL knowledge is rather limited. However I am surely willing to change that. At the moment I am trying to do analysis(!) on the data I have in my database solely with SQL queries. But I am about to give up, and instead import the data to C++ and do the analysis with C++ code. I have discussed this with my colleagues, and they also push me to use C++, saying that SQL is not meant for complex analysis but mainly

How to plot two DataFrame on same graph for comparison

醉酒当歌 提交于 2019-12-09 06:30:25
问题 I have two DataFrames (trail1 and trail2) with the following columns: Genre, City, and Number Sold. Now I want to create a bar graph of both data sets for a side by side comparison of Genre vs. total Number Sold. For each genre, I want to two bars: one representing trail 1 and the other representing trail 2. How can I achieve this using Pandas? I tried the following approach which did NOT work. gf1 = df1.groupby(['Genre']) gf2 = df2.groupby(['Genre']) gf1Plot = gf1.sum().unstack().plot(kind=

Fourier transform with python

旧城冷巷雨未停 提交于 2019-12-08 18:14:30
I have a set of data . It is obviously have some periodic nature. I want to find out what frequency it has by using the fourier transformation and plot it out. Here is a shot of mine, but it seems not so good. This is the corresponding code, I don't konw why it fails: import numpy from pylab import * from scipy.fftpack import fft,fftfreq import matplotlib.pyplot as plt dataset = numpy.genfromtxt(fname='data.txt',skip_header=1) t = dataset[:,0] signal = dataset[:,1] npts=len(t) FFT = abs(fft(signal)) freqs = fftfreq(npts, t[1]-t[0]) subplot(211) plot(t[:npts], signal[:npts]) subplot(212) plot

Pandas: convert datetime timestamp to whether it's day or night?

白昼怎懂夜的黑 提交于 2019-12-08 09:39:30
问题 I am trying to determine if its a day or night based on list of timestamps. Will it be correct if I just check the hour between 7:00AM to 6:00PM to classify it as "day", otherwise "night"? Like I have done in below code. I am not sure of this because sometimes its day even after 6pm so whats the accurate way to differentiate between day or night using python? sample data: (timezone= utc/zulutime) timestamps = ['2015-03-25 21:15:00', '2015-06-27 18:24:00', '2015-06-27 18:22:00', '2015-06-27 18

Converting list in panda dataframe into columns

☆樱花仙子☆ 提交于 2019-12-08 09:36:52
问题 city state neighborhoods categories Dravosburg PA [asas,dfd] ['Nightlife'] Dravosburg PA [adad] ['Auto_Repair','Automotive'] I have above dataframe I want to convert each element of a list into column for eg: city state asas dfd adad Nightlife Auto_Repair Automotive Dravosburg PA 1 1 0 1 1 0 I am using following code to do this : def list2columns(df): """ to convert list in the columns of a dataframe """ columns=['categories','neighborhoods'] for col in columns: for i in range(len(df)): for

plotting a timeseries graph in python using matplotlib from a csv file

情到浓时终转凉″ 提交于 2019-12-08 07:50:44
问题 I have some csv data in the following format. Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09 L0 St vT 4R 0 0 0 0 0 0 0 0 0 L2 Tx st 4R 8 8 8 8 8 8 8 8 8 L2 Tx ss 4R 1 1 9 6 1 0 0 6 7 I want to plot a timeseries graph using the columns ( Ln , Dr , Tg , Lab ) as the keys and the 0:0n field as values on a timeseries graph. I have the following code. #!/usr/bin/env python import matplotlib.pyplot as plt import datetime import numpy as np import csv import sys with open("test.csv", 'r'

ECG Data Analysis on a real-time signal in Python

浪子不回头ぞ 提交于 2019-12-08 06:23:25
I am using Python to produce an electrocardiogram (ECG) from signals obtained by an Arduino. I want to perform some analysis on it, what type of analysis I do not know yet that is something I have yet to decide. However my question is, is it possible to do this analysis on a real time flow of data coming through the serial port, or is it easier/better to save the data first to suppose a text file and then perform analysis on it. Right now I can't wrap my head round how to do it. An extra note: I would at the very minimum like to detect the peaks of the signal (R wave) and the R-R interval (so

ECG Data Analysis on a real-time signal in Python

人走茶凉 提交于 2019-12-08 04:54:58
问题 I am using Python to produce an electrocardiogram (ECG) from signals obtained by an Arduino. I want to perform some analysis on it, what type of analysis I do not know yet that is something I have yet to decide. However my question is, is it possible to do this analysis on a real time flow of data coming through the serial port, or is it easier/better to save the data first to suppose a text file and then perform analysis on it. Right now I can't wrap my head round how to do it. An extra note

Pandas DataFrame find the max after Groupby two columns and get counts

一世执手 提交于 2019-12-08 02:15:16
问题 I have a dataframe df as following: userId pageId tag 0 3122471 e852 18 1 3122471 f3e2 18 2 3122471 7e93 18 3 3122471 2768 6 4 3122471 53d9 6 5 3122471 06d7 15 6 3122471 e31c 15 7 3122471 c6f3 2 8 1234123 fjwe 1 9 1234123 eiae 4 10 1234123 ieha 4 After using df.groupby(['userId', 'tag'])['pageId'].count() to group the data by userId and tag . I will get: userId tag 3122471 2 1 6 2 15 2 18 3 1234123 1 1 4 2 Now I want to find the tag that each user has the most. Just as following: userId tag