dataframe

How can I iterate through two dataframes and assign a new values based on compared columns?

筅森魡賤 提交于 2021-01-29 19:27:15
问题 I have two different dataframes: A, B. The column Event has similar data that I'm using to compare the two dataframes. I want to give Dataframe A a new column, dfA.newContext#. In order to do this, I'll need to use the Event column. I want to iterate through Dataframe A to find a match for Event and assign the dfB.context# to dfA.newContext# I think a loop would be the best way since I have a few conditions that I need to check. This might be asking a bit much but I'm really stuck.. I want to

How to write a function in R that will implement the “best subsets” approach to model selection?

耗尽温柔 提交于 2021-01-29 19:03:47
问题 So I need to write a function that takes a data-frame as input. The columns are my explanatory variables (except for the last column/right most column which is the response variable). I'm trying to fit a linear model and track each model's adjusted r-square as the criterion used to pick the best model. The model will use all the columns as the explanatory variables (except for the right-most column which will be the response variable). The function is supposed to create a tibble with a single

Grouping rows aggregate and function in r

[亡魂溺海] 提交于 2021-01-29 19:01:31
问题 I am new to r and I wanted to aggregate the following matrix k n m s 1 g 10 11.8 2.4 2 g 20 15.3 3.2 3 g 15 8.4 4.1 4 r 14 3.0 5.0 5 r 16 6.0 7.0 6 r 5 8.0 15.0 results : k n s m 1 g 15 3.233333 7.31667 2 r 11.66667 9 4.16667 This was my attempt : k <- c("g", "g", "g", "r","r","r") n <- c(10,20,15,14,16,5) m <- c(11.8, 15.3, 8.4,3,6,8) s <- c(2.4, 3.2, 4.1,5,7,15) data1 <- data.frame(k,n,m,s) data2 <- aggregate(m ~ k, FUN = function(t) ********* , data=data1) I am more interested in m here is

Replace column values based on column in another dataframe

こ雲淡風輕ζ 提交于 2021-01-29 18:47:38
问题 I would like to replace some column values in a df based on column in another data frame This is the head of the first df: df1 A tibble: 253 x 2 id sum_correct <int> <dbl> 1 866093 77 2 866097 95 3 866101 37 4 866102 65 5 866103 16 6 866104 72 7 866105 99 8 866106 90 9 866108 74 10 866109 92 and some sum_correct need to be replaced by the correct values in another df using the id to trigger the replacement df 2 A tibble: 14 x 2 id sum_correct <int> <dbl> 1 866103 61 2 866124 79 3 866152 85 4

Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

三世轮回 提交于 2021-01-29 18:10:15
问题 My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks. I have worked out part of it and now stuck by another problem. I have a small pyspark dataframe like : df1: +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ |topic| termIndices| termWeights| terms| +-----+---------------------------

How to create a “dynamic” column in R?

情到浓时终转凉″ 提交于 2021-01-29 17:43:10
问题 I'm coding a portfolio analysis tool based off back-tests. In short, I want to add a column that starts at X value which will be the initial capital plus the result of the first trade, and have the rest of the values updated from the % change of each trade, but I haven't sorted out a way to put that logic into code. The following code is a simplified example. profit <- c(10, 15, -5, -6, 20) change <- profit / 1000 balance <- c(1010, 1025, 1020, 1014, 1036) data <- data.frame(profit, change,

Extract p values and r values for all pairwise variables

孤街浪徒 提交于 2021-01-29 17:22:12
问题 I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated. Please note, I would like to do this more manually than using packages like Hmisc as that has created

Calculating price change & cumulative percentage change in price based on conditions on another column

守給你的承諾、 提交于 2021-01-29 17:00:34
问题 Background of the problem is I am trying to backtest a trading strategy and evaluate my portfolio performance over time. I am using Pandas DataFrame to manipulate the data. I've generated dummy data using data = {'position': [1, 0, 0, 0, -1, 0, 0, 1, 0, 0], 'close': [10,25,30,25,22,20,21,16,11,20], 'close_position' : [10,25,30,25,22,22,22,16,11,20]} df = pd.DataFrame(data = data) output df would be +-------+----------+------------+----------------+ | index | position | close | close_position

How to update dataframe cells using function return values

南笙酒味 提交于 2021-01-29 16:01:56
问题 I have the following dataframe called df1 , country ticker price 0 US MSFT 105.32 1 US AAPL 2 GERMANY NSU.DE 10.42 3 SG D05.SI 4 AUS WOW.AX I have a function called price_get that looks like this def price_get(ticker): price = somefunction return price The function has to go online to look up the value so it takes a few seconds to run each time. I want to only use this function on the cells which don't have a price in them, (price cells are empty). So the function would only be used on rows 1

Grouping people in pandas dataframe with customized function

最后都变了- 提交于 2021-01-29 15:57:15
问题 Introduction: I have a pandas dataframe with people who live in different locations (latitude, longitude, floor number). I want to cluster 3 people each in one group. This means, at the end of this process, every person is assigned to one particular group. My dataframe has the length of multiples of 9 (e.g 18 people). The tricky part is, people in the same group are not allowed to have same location in terms of latitude and longitude. What is going wrong? After I apply my function to the