Implementation of skyline query or efficient frontier

前端 未结 6 1777
慢半拍i
慢半拍i 2020-12-06 02:44

I know there must be an easy answer to this but somehow I can\'t seem to find it...

I have a data frame with 2 numeric columns. I would like to remove from it, the r

相关标签:
6条回答
  • 2020-12-06 02:47

    Edit (2015-03-02): For a more efficient solution, please see Patrick Roocks' rPref, a package for "Database Preferences and Skyline Computation", (also linked to in his answer below). To show that it finds the same solution as my code here, I've appended an example using it to my original answer here.


    Riffing off of Vincent Zoonekynd's enlightening response, here's an algorithm that's fully vectorized, and likely more efficient:

    set.seed(100)
    d <- data.frame(x = rnorm(100), y = rnorm(100))
    
    D   <- d[order(d$x, d$y, decreasing=TRUE), ]
    res <- D[which(!duplicated(cummax(D$y))), ]
    #             x         y
    # 64  2.5819589 0.7946803
    # 20  2.3102968 1.6151907
    # 95 -0.5302965 1.8952759
    # 80 -2.0744048 2.1686003
    
    
    # And then, if you would prefer the rows to be in 
    # their original order, just do:
    d[sort(as.numeric(rownames(res))), ]
    #            x         y
    # 20  2.3102968 1.6151907
    # 64  2.5819589 0.7946803
    # 80 -2.0744048 2.1686003
    # 95 -0.5302965 1.8952759
    

    Or, using the rPref package:

    library(rPref)
    psel(d, high(x) | high(y))
    #             x         y
    # 20  2.3102968 1.6151907
    # 64  2.5819589 0.7946803
    # 80 -2.0744048 2.1686003
    # 95 -0.5302965 1.8952759
    
    0 讨论(0)
  • 2020-12-06 02:57

    Here is an sqldf solution where DF is the data frame of data:

    library(sqldf)
    sqldf("select * from DF a
     where not exists (
       select * from DF b
       where b.Col1 >= a.Col1 and b.Col2 >  a.Col2  
          or b.Col1 >  a.Col1 and b.Col2 >= a.Col2
     )"
    )
    
    0 讨论(0)
  • 2020-12-06 03:01

    In one line:

    d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
    d[!apply(d,1,max)<max(apply(d,1,min)),]
    
         [,1] [,2]
    [1,]    4    7
    [2,]    5    6
    

    Edit: In light of your precision in jbaums' response, here's how to check for both columns separately.

    d <- matrix(c(2, 3, 3, 7, 5, 6, 4, 8), nrow=4, byrow=TRUE)
    d[apply(d,1,min)>min(apply(d,1,max)) ,]
    
         [,1] [,2]
    [1,]    5    6
    [2,]    4    8
    
    0 讨论(0)
  • 2020-12-06 03:03
    d <- matrix(c(2, 3, 4, 7, 5, 6), nrow=3, byrow=TRUE)
    d2 <- sapply(d[, 1], function(x) x < d[, 1]) & 
          sapply(d[, 2], function(x) x < d[, 2])
    d2 <- apply(d2, 2, any)
    result <- d[!d2, ]
    
    0 讨论(0)
  • 2020-12-06 03:11

    That problem is called a "skyline query" by database administrators (they may have other algorithms) and an "efficient frontier" by economists. Plotting the data can make it clear what we are looking for.

    n <- 40
    d <- data.frame(
      x = rnorm(n),
      y = rnorm(n)
    )
    # We want the "extreme" points in the following plot
    par(mar=c(1,1,1,1))
    plot(d, axes=FALSE, xlab="", ylab="")
    for(i in 1:n) {
      polygon( c(-10,d$x[i],d$x[i],-10), c(-10,-10,d$y[i],d$y[i]), 
      col=rgb(.9,.9,.9,.2))
    }
    

    The algorithm is as follows: sort the points along the first coordinate, keep each observation unless it is worse than the last retained one.

    d <- d[ order(d$x, decreasing=TRUE), ]
    result <- d[1,]
    for(i in seq_len(nrow(d))[-1] ) {
      if( d$y[i] > result$y[nrow(result)] ) {
        result <- rbind(result, d[i,])  # inefficient
      } 
    }
    points(result, cex=3, pch=15)
    

    Skyline

    0 讨论(0)
  • 2020-12-06 03:11

    This question is pretty old, but meanwhile there is a new solution. I hope it is ok to do some self-promotion here: I developed a package rPref which does an efficient Skyline computation due to C++ algorithms. With installed rPref package the query from the question can be done via (assuming that df is the name of data set):

    library(rPref)
    psel(df, high(Col1) | high(Col2))
    

    This removes only those tuples, where some other tuple is better in both dimensions.

    If one requires the other tuple to be strictly better in just one dimension (and better or equal in the other dimension), use high(Col1) * high(Col2) instead.

    0 讨论(0)
提交回复
热议问题