dataframe | 易学教程

Identify start and end time of a value per id in a data frame

阅读更多关于 Identify start and end time of a value per id in a data frame

问题 This relates to my previous question on identifying the occurrence of a value in a data frame per id. This time I am trying to identify consecutive measurements per id with a length of 4 or more. Ex. Below an example of the consecutive occurrence of w with the length of 4 id t1 t2 t3 t4 t5 t6 1 s s w w w w For the same id an example of the consecutive occurrence of w with the length of 4 as well 4 non-w occurrences after the last w id t3 t4 t5 t6 t7 t8 t9 t10 1 w w w w r s s s I would like to

Get average by months of a time series (all Januaries, all Februaries, etc)

阅读更多关于 Get average by months of a time series (all Januaries, all Februaries, etc)

问题 I have a time series of daily data from 1992 to 2018. So far I have converted to monthly data but I also need to obtain anomalies per month and I need to obtain the average of each month over all years to finish with 12 averages. One for each month from each individual average of each year. I have done the following using Pandas: df = pd.read_excel(filename, "Daily", index_col=0) df = df.resample("M").mean() I have been trying to find how out to obtain now the average of each month every the

How to create (correctly) a NumPy array from Pandas DF

阅读更多关于 How to create (correctly) a NumPy array from Pandas DF

问题 I'm trying to create a NumPy array for the "label" column from a pandas data-frame. My df: label vector 0 0 1:0.044509422 2:-0.03092437 3:0.054365806 4:-... 1 0 1:-0.007471546 2:-0.062329583 3:0.012314787 4... 2 0 1:-0.009525825 2:0.0028720177 3:0.0029517233 ... 3 1 1:-0.0040618754 2:-0.03754585 3:0.008025528 4... 4 0 1:0.039150625 2:-0.08689039 3:0.09603256 4:0.... ... ... ... 59996 1 1:0.01846487 2:-0.012882819 3:0.035375785 4:-... 59997 1 1:0.01435293 2:-0.00683616 3:0.009475072 4:-0...

problem with pandas efficiency when working with dates

阅读更多关于 problem with pandas efficiency when working with dates

问题 I have a piece of code that runs but that is not scaling well with bigger dataset AT ALL. We are talking about minutes with big datasets. Here is a toy dataset to illustrate the issue: Id Supplier Avg_NetAmountSpent Date Quantity NetAmount 0 185781 SAXON 2953.500000 2020-05-10 401 9294 1 185781 SAXON 2953.500000 2020-05-09 3502 8890 2 185781 SAXON 2953.500000 2020-05-08 7380 8381 3 185781 SAXON 2953.500000 2020-05-08 3384 1734 4 185781 SAXON 2953.500000 2020-05-08 4826 4910 612 467809 SAXONIS

Secondary axis in ploty for R and Shiny

阅读更多关于 Secondary axis in ploty for R and Shiny

问题 EDIT: Regarding my question 2, it seems it is a bug and hasn't been fixed yet as it is not their top priority at the moment. Someone asked to try katex instead of latex, but not sure how that works https://github.com/plotly/plotly.js/issues/559 I have attached an output for a code- https://i.stack.imgur.com/u65if.jpg. I am trying to plot two y axis and a common x axis using plotly. The issues I am facing are: I would like the primary and the secondary y axis ticks to share the same gridline.

Summing rows based on keyword within index

阅读更多关于 Summing rows based on keyword within index

问题 I am trying to sum multiple rows together based on a keyword that is part of the index - but it is not the entire index. For example, the index could look like Count 1234_Banana_Green 43 4321_Banana_Yellow 34 2244_Banana_Brown 23 12345_Apple_Red 45 I would like to sum all of the rows that have the same "keyword" within them and create a total "banana" row. Is there a way to do this without searching for the keyword "banana"? For my purposes, this keyword changes every time and I would like to

Pandas DataFrame merge, ends up with more rows

阅读更多关于 Pandas DataFrame merge, ends up with more rows

问题 I am doing a_df = a_df.merge(b_df, how='left', on=['col1', col2]) After this, a_df actually has more rows than before the operation. How is this possible? They both have millions of rows, so it's hard for me to narrow down the problem. Probably I am missing something about how left merge works. 回答1: Problem is with duplicates, so instead left join merge return all combination of dupplicates pairs of both DataFrame s, check sample below: a_df = pd.DataFrame({'A':list('abcdef'), 'B':[4,5,4,5,5

Removing Empty Dataframes with pandas

阅读更多关于 Removing Empty Dataframes with pandas

问题 I have written the following code to use regex to request pages, and look for strings that resemble interest rates. The overall code works; however, it is creating multiple empty dataframes and I can't get the code to drop the empty frames to clean up my output. I have been trying to use .dropna, .drop, and .empty to try and deprecate the dataframes but the output remains unchanged and keeps printing the empty dataframes with the information I have already. Is there an method I am not aware

Getting descriptive statistics with (analytic) weighting using describe() in python

阅读更多关于 Getting descriptive statistics with (analytic) weighting using describe() in python

问题 I was trying to translate code from Stata to Python The original code in Stata: by year, sort : summarize age [aweight = wt] Normally a simply describe() function will do dataframe.groupby("year")["age"].describe() But I could not find a way to translate the aweight option into the language of python i.e. to give descriptive statistics of a dataset under analytic/ variance weighting. codes to generate the dataset in python: dataframe = {'year': [2016,2016,2020, 2020], 'age': [41,65, 35,28],

(in R) Add metadata from a vector to a set of columns of a dataframe?

阅读更多关于 (in R) Add metadata from a vector to a set of columns of a dataframe?

问题 I would like to use values from a character vector that I created as label attributes for a set of variables in a dataframe. I thought this simple solution should work, yet it does not: x <- rep("text", time=19) %>% paste(1:19, sep = " ") #character vector with names of label attributes I want attr(mydataframe[var_names], "label") <- x #var_names and x have the same length Thanks for your help! 回答1: Hmisc supports column labels. Using the built in data frame anscombe having 8 columns: library