Implementation of skyline query or efficient frontier

前端未结

关注

 6  1788

慢半拍i

I know there must be an easy answer to this but somehow I can\'t seem to find it...

I have a data frame with 2 numeric columns. I would like to remove from it, the r

相关标签:

6条回答

隐瞒了意图╮

2020-12-06 02:47

Edit (2015-03-02): For a more efficient solution, please see Patrick Roocks' rPref, a package for "Database Preferences and Skyline Computation", (also linked to in his answer below). To show that it finds the same solution as my code here, I've appended an example using it to my original answer here.

Riffing off of Vincent Zoonekynd's enlightening response, here's an algorithm that's fully vectorized, and likely more efficient:

set.seed(100)
d <- data.frame(x = rnorm(100), y = rnorm(100))

D   <- d[order(d$x, d$y, decreasing=TRUE), ]
res <- D[which(!duplicated(cummax(D$y))), ]
#             x         y
# 64  2.5819589 0.7946803
# 20  2.3102968 1.6151907
# 95 -0.5302965 1.8952759
# 80 -2.0744048 2.1686003


# And then, if you would prefer the rows to be in 
# their original order, just do:
d[sort(as.numeric(rownames(res))), ]
#            x         y
# 20  2.3102968 1.6151907
# 64  2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759

Or, using the rPref package:

library(rPref)
psel(d, high(x) | high(y))
#             x         y
# 20  2.3102968 1.6151907
# 64  2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759

0 讨论(0)

庸人自扰

2020-12-06 02:57

Here is an sqldf solution where DF is the data frame of data:

library(sqldf)
sqldf("select * from DF a
 where not exists (
   select * from DF b
   where b.Col1 >= a.Col1 and b.Col2 >  a.Col2  
      or b.Col1 >  a.Col1 and b.Col2 >= a.Col2
 )"
)

0 讨论(0)

借酒劲吻你

2020-12-06 03:01

In one line:

d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
d[!apply(d,1,max)<max(apply(d,1,min)),]

     [,1] [,2]
[1,]    4    7
[2,]    5    6

Edit: In light of your precision in jbaums' response, here's how to check for both columns separately.

d <- matrix(c(2, 3, 3, 7, 5, 6, 4, 8), nrow=4, byrow=TRUE)
d[apply(d,1,min)>min(apply(d,1,max)) ,]

     [,1] [,2]
[1,]    5    6
[2,]    4    8

0 讨论(0)

耶瑟儿～

2020-12-06 03:03

d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
d2 <- sapply(d[, 1], function(x) x < d[, 1]) & 
      sapply(d[, 2], function(x) x < d[, 2])
d2 <- apply(d2, 2, any)
result <- d[!d2, ]

0 讨论(0)

梦如初夏

2020-12-06 03:11

That problem is called a "skyline query" by database administrators (they may have other algorithms) and an "efficient frontier" by economists. Plotting the data can make it clear what we are looking for.

n <- 40
d <- data.frame(
  x = rnorm(n),
  y = rnorm(n)
)
# We want the "extreme" points in the following plot
par(mar=c(1,1,1,1))
plot(d, axes=FALSE, xlab="", ylab="")
for(i in 1:n) {
  polygon( c(-10,d$x[i],d$x[i],-10), c(-10,-10,d$y[i],d$y[i]), 
  col=rgb(.9,.9,.9,.2))
}

The algorithm is as follows: sort the points along the first coordinate, keep each observation unless it is worse than the last retained one.

d <- d[ order(d$x, decreasing=TRUE), ]
result <- d[1,]
for(i in seq_len(nrow(d))[-1] ) {
  if( d$y[i] > result$y[nrow(result)] ) {
    result <- rbind(result, d[i,])  # inefficient
  } 
}
points(result, cex=3, pch=15)

Skyline

0 讨论(0)

我在风中等你

2020-12-06 03:11
This question is pretty old, but meanwhile there is a new solution. I hope it is ok to do some self-promotion here: I developed a package rPref which does an efficient Skyline computation due to C++ algorithms. With installed rPref package the query from the question can be done via (assuming that df is the name of data set):
```
library(rPref)
psel(df, high(Col1) | high(Col2))
```
This removes only those tuples, where some other tuple is better in both dimensions.

If one requires the other tuple to be strictly better in just one dimension (and better or equal in the other dimension), use high(Col1) * high(Col2) instead.
0 讨论(0)
发布评论:

提交评论
- 加载中...