subset

Subsetting data frame by factor level

百般思念 提交于 2019-12-22 08:34:29
问题 I have a big data frame with state names in one colum and different indexes in the other columns. I want to subset by state and create an object suitable for minimization of the index or a data frame with the calculation already given. Here's one simple (short) example of what I have m x y 1 A 1.0 2 A 2.0 3 A 1.5 4 B 3.0 5 B 3.5 6 C 7.0 I want to get this m x y 1 A 1.0 2 B 3.0 3 C 7.0 I don't know if a function with a for loop is necessary. Like minimize<-function(x,...) for (i in m$x){ do

How to slice a dataframe by selecting a range of columns and rows based on names and not indexes?

守給你的承諾、 提交于 2019-12-22 05:29:21
问题 This is a follow-up question of the question I asked here. There I learned a) how to do this for columns (see below) and b) that the selection of rows and columns seems to be quite differently handled in R which means that I cannot use the same approach for rows. So suppose I have a pandas dataframe like this: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(10, size=(6, 6)), columns=['c' + str(i) for i in range(6)], index=["r" + str(i) for i in range(6)]) c0 c1 c2

How to pick the the T1 and T2 threshold values for Canopy Clustering?

谁都会走 提交于 2019-12-22 04:27:12
问题 I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need to specify 2 threshold values for the canopy: T1 and T2, where points in the inner threshold are strongly tied to that canopy and the points in the wider threshold are less tied to that canopy. How are these threshold, or distances from the canopy

Number of distinct sums from non-empty groupings of (possibly very large) lists

旧街凉风 提交于 2019-12-22 01:23:33
问题 Assume that you are given a set of coin types (maximum 20 distinct types) and from each type you have maximum 10^5 instances, such that the total number of all the coins in your list is maximum 10^6. What is the number of distinct sums you can make from non-empty groupings of these coins. for example, you are given the following lists: coins=[10, 50, 100] quantity=[1, 2, 1] which means you have 1 coin of 10, and 2 coins of 50, and 1 coin of 100. Now the output should be possibleSums(coins,

R: How to apply moving averages to subset of columns in a data frame?

一曲冷凌霜 提交于 2019-12-22 00:44:19
问题 I have a dataframe (training.set) that is 150 observations of 83 variables. I want to transform 82 of those columns with some moving averages. The problem is the results end up only being 150 numeric values (i.e. 1 column). How would I apply the moving average function across each column individually in the data and keep the 83rd column unchanged? I feel like this is super simple, but I can't find a solution. My current code # apply moving average on training.set data to 82 of 83 rows library

PySpark: Search For substrings in text and subset dataframe

落花浮王杯 提交于 2019-12-21 22:00:30
问题 I am brand new to pyspark and want to translate my existing pandas / python code to PySpark . I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned. Below is the Python code I tried in PySpark: def pilot_discrep(input_file): df = input_file searchfor = ['cat', 'dog', 'frog', 'fleece'] df = df[df['original_problem'].str.contains('|'.join(searchfor))] return df When I try to run the above, I get the following

R: Deleting elements from a vector based on element length

我只是一个虾纸丫 提交于 2019-12-21 21:29:39
问题 How can I delete elements from a vector of strings depending on the number of characters or length of the strings? df <- c("asdf","fweafewwf","af","","","aewfawefwef","awefWEfawefawef") > df [1] "asdf" "fweafewwf" "af" "" "" "aewfawefwef" "awefWEfawefawef" For example, I may want to delete all elements of df with a length smaller than 5, so the output would be: > df [1]"fweafewwf" "aewfawefwef" "awefWEfawefawef" Thanks! 回答1: Just use nchar : > df[nchar(df) > 5] [1] "fweafewwf" "aewfawefwef"

Subsref with cells

那年仲夏 提交于 2019-12-21 17:53:41
问题 This issue appeared when I was answering this question. It should be some stupid error I am doing, but I can't get what error it is… myMatrix = [22 33; 44 55] Returns: >> subsref(myMatrix, struct('type','()','subs',{{[1 2]}} ) ); ans = 22 44 While using it with cells: myCell = {2 3; 4 5} Returns: >> subsref(myCell,struct('type','{}','subs',{{[1 2]}} ) ); ans = 2 % WHATTT?? Shouldn't this be 2 and 4 Matlab?? Checking the subsref documentation, we see: See how MATLAB calls subsref for the

R Subset data.frame from max value of one vector and grouped by another [duplicate]

这一生的挚爱 提交于 2019-12-21 14:59:06
问题 This question already has answers here : How to select the row with the maximum value in each group (10 answers) Closed 2 years ago . >ID<-c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C') >WK<-c(1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5) >NumSuccess<-c(0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 3) >Data<-data.frame(ID, WK, NumSuccess) I am trying to create a subset data.frame "Data2" based on the value in "NumSuccesses" that corresponds to the Max Value in "WK" grouped by "ID". Resulting data.frame should

Remove the rows of data frame whose cells match a given vector

一笑奈何 提交于 2019-12-21 06:16:47
问题 I have big data frame with various numbers of columns and rows. I would to search the data frame for values of a given vector and remove the rows of the cells that match the values of this given vector. I'd like to have this as a function because I have to run it on multiple data frames of variable rows and columns and I wouls like to avoid for loops. for example ff<-structure(list(j.1 = 1:13, j.2 = 2:14, j.3 = 3:15), .Names = c("j.1","j.2", "j.3"), row.names = c(NA, -13L), class = "data