dataframe

Pandas get second minimum value from datetime column [duplicate]

a 夏天 提交于 2021-01-28 08:08:00
问题 This question already has answers here : How to extract the n-th maximum/minimum value in a column of a DataFrame in pandas? (2 answers) Closed 1 year ago . I have a data frame with a DateTime column, I can get minimum value by using df['Date'].min() How can I get the second, third... smallest values 回答1: Use nlargest or nsmallest For second largest, series.nlargest(2).iloc[-1] 回答2: Make sure your dates are in datetime first: df['Sampled_Date'] = pd.to_datetime(df['Sampled_Date']) Then drop

pyspark dataframe with json column to aggregate the json elements into a new column and remove duplicated

血红的双手。 提交于 2021-01-28 08:02:36
问题 I am trying to read a pyspark dataframe with json column on databricks. The dataframe: year month json_col 2010 09 [{"p_id":"vfdvtbe"}, {"p_id":"cdscs"}, {"p_id":"usdvwq"}] 2010 09 [{"p_id":"ujhbe"}, {"p_id":"cdscs"}, {"p_id":"yjev"}] 2007 10 [{"p_id":"ukerge"}, {"p_id":"ikrtw"}, {"p_id":"ikwca"}] 2007 10 [{"p_id":"unvwq"}, {"p_id":"cqwcq"}, {"p_id":"ikwca"}] I need a new dataframe with all duplicated "p_id" are removed and aggregated by year and month year month p_id (string) 2010 09 [

how to convert every row as column and value before colon into column name

▼魔方 西西 提交于 2021-01-28 07:50:42
问题 I am reading a file called kids_csv with header=None option, this file contains every row with specific alphabets along with : like ab: , ad: etc, I want the entire row to become a column where like ab: that's starting off the line needs to be designated as a column name. below is my dataframe: >>> df = pd.read_csv("kids_cvc",error_bad_lines=False, header=None) b'Skipping line 2: expected 13 fields, saw 14\nSkipping line 5: expected 13 fields, saw 14\nSkipping line 6: expected 13 fields, saw

Is it possible to use Pandas Overlap in a Dataframe?

谁说胖子不能爱 提交于 2021-01-28 07:38:07
问题 Python 3.7, Pandas 25 I have a Pandas Dataframe with columns for startdate and enddate. I am looking for ranges that overlap the range of my variable(s). Without being verbose and composing a series of greater than/less than statements with ands/ors to filter out the rows I need, I would like to use some sort of interval "overlap". It appears Pandas has this functionality: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.overlaps.html The following test works: range1

Selecting by both rows and columns in a symmetrical matrix in R

落花浮王杯 提交于 2021-01-28 07:31:10
问题 I have a symmetrical dataframe and would like to select a subset of the data to use for analysis. This means selecting both the desired rows and columns and maintaining the right order so the new dataframe is still a symmetrical cube. With example data: # Example data Sample <- c('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E') Sample_A <- c(0, 3.16, 1, 1.41, 3) Sample_B <- c(3.16, 0, 3, 2.83, 1) Sample_C <- c(1, 3, 0, 1, 2.83) Sample_D <- c(1.41, 2.83, 1, 0, 2.65) Sample_E <- c(3,

Subtraction of pandas dataframes

烂漫一生 提交于 2021-01-28 07:22:34
问题 I am trying to substract two pandas dataframes from each other, but get only NaN results: Dataframe 1: alpha beta 0 1 4 1 2 5 2 3 6 Dataframe 2: gamma 0 7 1 8 2 9 Dataframe operation: df3=df1-df2 Result: alpha beta gamma 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN However, if I convert everything to numpy matrices, it works: Matrix operation: matrix3=df1.as_matrix(['alpha','beta'])-df2.as_matrix(['gamma']) Result: [[-6 -3] [-6 -3] [-6 -3]] How can I make this work with pandas? 回答1: Either of

Pandas Timedelta to add decimal hours to existing timestamp

和自甴很熟 提交于 2021-01-28 07:09:49
问题 Aim: I would like to use Timedelta to add hours, in decimal format, to an existing timestamp. My current code is giving me an issue - probably because I don't know how to not create a list (been struggling for a while on how to address things). Heh. I have a dataframe named 'df' that looks roughly like the following: +---------------------+----------+ | Time | AddHours | +---------------------+----------+ | 2019-11-13 09:30:00 | 3.177481 | | 2019-11-13 09:30:00 | 2.752435 | | 2019-11-13 09:30

R - find first, second and third largest values by row

这一生的挚爱 提交于 2021-01-28 07:00:29
问题 I have some data containing numeric columns: df <- data.frame(v1 = c(0,1,2,3,4,5,6,7,8), v2 = c(5,6,3,21,24,7,8,9,6), v3 = c(23,5,24,87,6,32,5,48,6),v4 = c(2,32,6,58,5,21,4,5,87), v5 = c(5,23,65,86,4,12,115,5,24)) I need to create three new columns containing the first, second and third largest value per row. So the desired output would be this: v1 v2 v3 v4 v5 first second third 1 0 5 23 2 4 23 5 4 2 1 6 5 32 23 32 23 6 3 2 3 24 6 65 65 24 6 4 3 21 87 58 87 87 86 58 5 4 24 6 5 4 24 6 5 6 5 7

Spark: create a nested schema

隐身守侯 提交于 2021-01-28 06:50:41
问题 With spark, import spark.implicits._ val data = Seq( (1, ("value11", "value12")), (2, ("value21", "value22")), (3, ("value31", "value32")) ) val df = data.toDF("id", "v1") df.printSchema() The result is the following: root |-- id: integer (nullable = false) |-- v1: struct (nullable = true) | |-- _1: string (nullable = true) | |-- _2: string (nullable = true) Now if I want to create the schema myself, how should I process? val schema = StructType(Array( StructField("id", IntegerType),

Pandas replace values with NaN at random

蓝咒 提交于 2021-01-28 06:24:59
问题 I am testing the performance of a machine learning algorithm, specifically how it handles missing data and what kind of performance degrades are experienced when variables are missing. For example when 20% of variable x is missing the accuracy of the model goes down by a certain %. In order to do this I would like to simulate the missing data by replacing 20% of the rows in a dataframe column. Is there an existing way to do this? starting df: d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8]}