outliers | 易学教程

Replace outliers by quantiles in R

阅读更多关于 Replace outliers by quantiles in R

问题 I have been trying to replace outliers 1.5*IQR +/- upper/lower quantile by the upper and lower quantile with the following code: `lower.quantile <- as.numeric(summary(loans$dINC_A)[2]) lower.quantile [1] 9000 upper.quantile <- as.numeric(summary(loans$dINC_A)[5]) > upper.quantile [1] 21240 IQR <- upper.quantile - lower.quantile # I replace outliers by the lower/upper bound values loans$INC_A[ loans$dINC_A < (lower.quantile-1.5*IQR) ] <- lower.quantile loans$INC_A[ loans$dINC_A > (upper

python pandas How to remove outliers from a dataframe and replace with an average value of preceding records

阅读更多关于 python pandas How to remove outliers from a dataframe and replace with an average value of preceding records

问题 I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules. i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)

Remove remains in a letter image with Python

阅读更多关于 Remove remains in a letter image with Python

问题 I have a set of images that represent letters extracted from an image of a word. In some images there are remains of the adjacent letters and I want to eliminate them but I do not know how. Some samples I'm working with openCV and I've tried two ways and none works. With findContours: def is_contour_bad(c): return len(c) < 50 gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) edged = cv2.Canny(gray, 50, 100) contours = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)

OpenCV Surf and Outliers detection

阅读更多关于 OpenCV Surf and Outliers detection

问题 I know there are already several questions with the same subject asked here, but I couldn't find any help. So I want to compare 2 images in order to see how similar they are and I'm using the well known find_obj.cpp demo to extract surf descriptors and then for the matching I use the flannFindPairs. But as you know this method doesn't discard the outliers and I'd like to know the number of true positive matches so I can figure how similar those two images are. I have already seen this

R Language - Sorting data into ranges; averaging; ignore outliers

阅读更多关于 R Language - Sorting data into ranges; averaging; ignore outliers

问题 I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers. The data consists of 2 columns WindSpeed and Power , so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other. What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x

How can I use the index-structures in ELKI?

阅读更多关于 How can I use the index-structures in ELKI?

问题 These are quotes form http://elki.dbs.ifi.lmu.de/ : "Essentially, we bind the abstract distance query to a database, and then get a nearest neighbor search for this distance. At this point, ELKI will automatically choose the most appropriate kNN query class. If there exist an appropriate index for our distance function (not every index can accelerate every distance!), it will automatically be used here." "The getKNNForDBID method may boil down to a slow linear scan, but when the database has

Boxplot : Outliers Labels Python

阅读更多关于 Boxplot : Outliers Labels Python

问题 I'm making a time series boxplot using seaborn package but I can't put a label on my outliers. My data is a dataFrame of 3 columns : [Month , Id , Value] that we can fake like that : ### Sample Data ### Month = numpy.repeat(numpy.arange(1,11),10) Id = numpy.arange(1,101) Value = numpy.random.randn(100) ### As a pandas DataFrame ### Ts = pandas.DataFrame({'Value' : Value,'Month':Month, 'Id': Id}) ### Time series boxplot ### ax = seaborn.boxplot(x="Month",y="Value",data=Ts) I have one boxplot

Remove unsorted/outlier elements in nearly-sorted array

阅读更多关于 Remove unsorted/outlier elements in nearly-sorted array

问题 Given an array like [15, 14, 12, 3, 10, 4, 2, 1] . How can I determine which elements are out of order and remove them (the number 3 in this case). I don't want to sort the list, but detect outliers and remove them. Another example: [13, 12, 4, 9, 8, 6, 7, 3, 2] I want to be able to remove #4 and #7 so that I end up with: [13, 12, 9, 8, 6, 3, 2] There's also a problem that arises when you have this scenario: [15, 13, 12, 7, 10, 5, 4, 3] You could either remove 7 or 10 to make this array

deleting outlier in r with account of nominal var

阅读更多关于 deleting outlier in r with account of nominal var

Say, i have three columns x <- c(-10, 1:6, 50) x1<- c(-20, 1:6, 60) z<- c(1,2,3,4,5,6,7,8) check outliers for x bx <- boxplot(x) bx$out check outliers for x1 bx1 <- boxplot(x1) bx1$out now we must delete outliers x <- x[!(x %in% bx$out)] x x1 <- x1[!(x1 %in% bx1$out)] x1 but we have variable Z(nominal) and we must remove observations, which correspond to the outlier of variables x and x1, in our case it is 1 and 8 obs. of Z How to do it? in output we must have x x1 z Na Na Na 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 Na Na Na Try this solution: x_to_remove<-which(x %in% bx$out) x <- x[!(x %in% bx

R: outlier cleaning for each column in a dataframe by using quantiles 0.05 and 0.95

阅读更多关于 R: outlier cleaning for each column in a dataframe by using quantiles 0.05 and 0.95

问题 I am a R-novice. I want to do some outlier cleaning and over-all-scaling from 0 to 1 before putting the sample into a random forest. g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10) If i do a simple scaling from 0 - 1 the result would be: > round((g - min(g))/abs(max(g) - min(g)),1) [1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 So my idea is to replace the values of each column that are greater than the 0.95-quantile with the next value smaller than the 0