Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date

后端 未结 4 866
难免孤独
难免孤独 2020-11-30 06:28

I am facing the common task of calculating the age (in years, months, or weeks) given the date of birth and an arbitrary date. The thing is that quite often I have to do thi

4条回答
  •  悲哀的现实
    2020-11-30 07:31

    I have been hammering away at this and finally have something which is a) perfectly accurate* (in contrast to all of the other options presented thus far) and b) reasonably fast (see my benchmarks in the other answer). It relies on a bunch of arithmetic I did by hand and the wonderful foverlaps function from the data.table package.

    The essence of the approach is to work from the integer representation of Dates, as well as to recognize that all birth dates fall in one of four 1461 (= 365 * 4 + 1)-day cycles, depending on when the next year is when it will take 366 days for your birthday to come.

    Here's the function:

    library(data.table)
    get_age <- function(birthdays, ref_dates){
      x <- data.table(bday <- unclass(birthdays),
                      #rem: how many days has it been since the lapse of the
                      #  most recent quadrennium since your birth?
                      rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
      #cycle_type: which of the four years following your birthday
      #  was the one that had 366 days? 
      x[ , cycle_type := 
           foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                     #these intervals were calculated by hand;
                     #  e.g., 59 is Feb. 28, 1970. I made the judgment
                     #  call to say that those born on Feb. 29 don't
                     #  have their "birthday" until the following March 1st.
                     data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                                end = c(58L, 423L, 789L, 1154L, 1460L), 
                                val = c(3L, 2L, 1L, 4L, 3L),
                                key = "start,end"))$val]
      I4 <- diag(4L)[ , -4L] #for conciseness below
      #The `by` approach might seem a little abstruse for those
      #  not familiar with `data.table`; see the edit history
      #  for a more palatable version (which is also slightly slower)
      x[ , extra := 
           foverlaps(data.table(start = rem, end = rem),
                     data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                         I4[.BY[[1L]],])),
                                end = c(st[-1L] - 1L, 1461L),
                                int_yrs = 0:3, key = "start,end")
           )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
      #grand finale -- 4 years for every quadrennium, plus the fraction:
      4L * ((ref - bday) %/% 1461L) + x$extra
    }
    

    Comparing on your main example:

    toy_df <- data.frame(
      birthdate = birthdate,
      givendate = givendate,
      arithmetic = as.numeric((givendate - birthdate) / 365.25),
      lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
      eeptools = age_calc(dob = birthdate, enddate = givendate,
                          units = "years"),
      mine = get_age(birthdate, givendate)
    )
    
    toy_df
    #     birthdate  givendate arithmetic lubridate   eeptools       mine
    # 1  1978-12-30 2015-12-31 37.0020534 37.027397 37.0027397 37.0027322 #eeptools wrong: will be 366 days until 12/31/16, so fraction is 1/366
    # 2  1978-12-31 2015-12-31 36.9993155 37.024658 37.0000000 37.0000000
    # 3  1979-01-01 2015-12-31 36.9965777 37.021918 36.9972603 36.9972603
    # 4  1962-12-30 2015-12-31 53.0020534 53.038356 53.0027397 53.0027322 #same problem
    # 5  1962-12-31 2015-12-31 52.9993155 53.035616 53.0000000 53.0000000
    # 6  1963-01-01 2015-12-31 52.9965777 53.032877 52.9972603 52.9972603
    # 7  2000-06-16 2050-06-17 50.0013689 50.035616 50.0000000 50.0027397 #eeptools wrong: not exactly the birthday
    # 8  2000-06-17 2050-06-17 49.9986311 50.032877 50.9972603 50.0000000 #eeptools wrong: _is_ exactly the birthday
    # 9  2000-06-18 2050-06-17 49.9958932 50.030137 49.9945205 49.9972603 #eeptools wrong: fraction should be 364/365
    # 10 2007-03-18 2008-03-19  1.0047912  1.005479  1.0027322  1.0027397 #eeptools wrong: 2/29 already passed, only 365 days until 3/19/2009
    # 11 2007-03-19 2008-03-19  1.0020534  1.002740  1.0000000  1.0000000
    # 12 2007-03-20 2008-03-19  0.9993155  1.000000  0.9966839  0.9972678 #eeptools wrong: we passed 2/29, so should be 365/366
    # 13 1968-02-29 2015-02-28 46.9979466 47.030137 46.9977019 46.9972603 #my judgment: birthday occurs on 3/1 for 2/29 babies, so 364/365 the way there
    # 14 1968-02-29 2015-03-01 47.0006845 47.032877 47.0000000 47.0000000
    # 15 1968-02-29 2015-03-02 47.0034223 47.035616 47.0027397 47.0027322
    

    This style of approach could be extended to handle months/weeks pretty easily. Months will be a bit long-winded (have to specify 4 years' worth of month lengths), so I didn't bother; weeks is easy (weeks are unaffected by leap year considerations, so we can just divide by 7).

    I also made a lot of progress on doing this with base functionalities, but a) it was quite ugly (needs a non-linear transformation of 0-1460 to avoid doing nested ifelse statements, etc.) and b) in the end a for loop (in the form of apply over the whole list of dates) was unavoidable, so I decided that would slow things down too much. (the transformation is x1 = (unclass(birthdays) - 59) %% 1461; x2 = x1 * (729 - x1) / 402232 + x1, for posterity)

    I've added this function to my package.

    *(for dates ranges when non-leap centuries are not a concern; I believe the extension to handle such dates shouldn't be too burdensome, however)

提交回复
热议问题