subset | 易学教程

Subsetting data frame by factor level

阅读更多关于 Subsetting data frame by factor level

问题 I have a big data frame with state names in one colum and different indexes in the other columns. I want to subset by state and create an object suitable for minimization of the index or a data frame with the calculation already given. Here's one simple (short) example of what I have m x y 1 A 1.0 2 A 2.0 3 A 1.5 4 B 3.0 5 B 3.5 6 C 7.0 I want to get this m x y 1 A 1.0 2 B 3.0 3 C 7.0 I don't know if a function with a for loop is necessary. Like minimize<-function(x,...) for (i in m$x){ do

How to slice a dataframe by selecting a range of columns and rows based on names and not indexes?

阅读更多关于 How to slice a dataframe by selecting a range of columns and rows based on names and not indexes?

问题 This is a follow-up question of the question I asked here. There I learned a) how to do this for columns (see below) and b) that the selection of rows and columns seems to be quite differently handled in R which means that I cannot use the same approach for rows. So suppose I have a pandas dataframe like this: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(10, size=(6, 6)), columns=['c' + str(i) for i in range(6)], index=["r" + str(i) for i in range(6)]) c0 c1 c2

How to pick the the T1 and T2 threshold values for Canopy Clustering?

阅读更多关于 How to pick the the T1 and T2 threshold values for Canopy Clustering?

问题 I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need to specify 2 threshold values for the canopy: T1 and T2, where points in the inner threshold are strongly tied to that canopy and the points in the wider threshold are less tied to that canopy. How are these threshold, or distances from the canopy

Number of distinct sums from non-empty groupings of (possibly very large) lists

阅读更多关于 Number of distinct sums from non-empty groupings of (possibly very large) lists

问题 Assume that you are given a set of coin types (maximum 20 distinct types) and from each type you have maximum 10^5 instances, such that the total number of all the coins in your list is maximum 10^6. What is the number of distinct sums you can make from non-empty groupings of these coins. for example, you are given the following lists: coins=[10, 50, 100] quantity=[1, 2, 1] which means you have 1 coin of 10, and 2 coins of 50, and 1 coin of 100. Now the output should be possibleSums(coins,

R: How to apply moving averages to subset of columns in a data frame?

阅读更多关于 R: How to apply moving averages to subset of columns in a data frame?

问题 I have a dataframe (training.set) that is 150 observations of 83 variables. I want to transform 82 of those columns with some moving averages. The problem is the results end up only being 150 numeric values (i.e. 1 column). How would I apply the moving average function across each column individually in the data and keep the 83rd column unchanged? I feel like this is super simple, but I can't find a solution. My current code # apply moving average on training.set data to 82 of 83 rows library

PySpark: Search For substrings in text and subset dataframe

阅读更多关于 PySpark: Search For substrings in text and subset dataframe

问题 I am brand new to pyspark and want to translate my existing pandas / python code to PySpark . I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned. Below is the Python code I tried in PySpark: def pilot_discrep(input_file): df = input_file searchfor = ['cat', 'dog', 'frog', 'fleece'] df = df[df['original_problem'].str.contains('|'.join(searchfor))] return df When I try to run the above, I get the following

R: Deleting elements from a vector based on element length

阅读更多关于 R: Deleting elements from a vector based on element length

问题 How can I delete elements from a vector of strings depending on the number of characters or length of the strings? df <- c("asdf","fweafewwf","af","","","aewfawefwef","awefWEfawefawef") > df [1] "asdf" "fweafewwf" "af" "" "" "aewfawefwef" "awefWEfawefawef" For example, I may want to delete all elements of df with a length smaller than 5, so the output would be: > df [1]"fweafewwf" "aewfawefwef" "awefWEfawefawef" Thanks! 回答1: Just use nchar : > df[nchar(df) > 5] [1] "fweafewwf" "aewfawefwef"

Subsref with cells

阅读更多关于 Subsref with cells

问题 This issue appeared when I was answering this question. It should be some stupid error I am doing, but I can't get what error it is… myMatrix = [22 33; 44 55] Returns: >> subsref(myMatrix, struct('type','()','subs',{{[1 2]}} ) ); ans = 22 44 While using it with cells: myCell = {2 3; 4 5} Returns: >> subsref(myCell,struct('type','{}','subs',{{[1 2]}} ) ); ans = 2 % WHATTT?? Shouldn't this be 2 and 4 Matlab?? Checking the subsref documentation, we see: See how MATLAB calls subsref for the

R Subset data.frame from max value of one vector and grouped by another [duplicate]

阅读更多关于 R Subset data.frame from max value of one vector and grouped by another [duplicate]

问题 This question already has answers here : How to select the row with the maximum value in each group (10 answers) Closed 2 years ago . >ID<-c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C') >WK<-c(1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5) >NumSuccess<-c(0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 3) >Data<-data.frame(ID, WK, NumSuccess) I am trying to create a subset data.frame "Data2" based on the value in "NumSuccesses" that corresponds to the Max Value in "WK" grouped by "ID". Resulting data.frame should

Remove the rows of data frame whose cells match a given vector

阅读更多关于 Remove the rows of data frame whose cells match a given vector

问题 I have big data frame with various numbers of columns and rows. I would to search the data frame for values of a given vector and remove the rows of the cells that match the values of this given vector. I'd like to have this as a function because I have to run it on multiple data frames of variable rows and columns and I wouls like to avoid for loops. for example ff<-structure(list(j.1 = 1:13, j.2 = 2:14, j.3 = 3:15), .Names = c("j.1","j.2", "j.3"), row.names = c(NA, -13L), class = "data