replace missing value based on linear prediction of nearby cells

情到浓时终转凉″ 提交于 2020-01-24 19:39:12

问题


I have a dataset (tsset) that has observations in some years but not others:

year x
1990 600
1991 .
1992 .
1993 .
1994 .
1995 1100
1996 .
1997 .
1998 1700

Suppose I am willing to make the assumption that every missing observation between two non-missing years (say 1990 and 1995 for example) can be imputed by a linear prediction between the said non-missing years, which makes the data like

year  x
1990  600
1991 [700]
1992 [800]
1993 [900]
1994 [1000]
1995  1100
1996 [1300]
1997 [1500]
1998  1700

Is there anyway to do this efficiently? I am currently using something like cond(year>1990 & year <1995, [Value if True], [Value if False]), but I do not know a good way to for all years from 1991 to 1994 to find 1990 as their lower bound and 1995 as the upper bound.

Stata's documentation demonstrates the technique of using x[_n-1] if I simply want to fill missing values from the previous cell, but not sure how this can be extended to solve my problem as described above.


回答1:


What you ask for is linear interpolation. ipolate to do it has been a command in Stata for most of its history. No loops are entailed.

clear 
input year x
1990 600
1991 .
1992 .
1993 .
1994 .
1995 1100
1996 .
1997 .
1998 1700
end 
ipolate x year, gen(xint) 
list , sep(0)

     +--------------------+
     | year      x   xint |
     |--------------------|
  1. | 1990    600    600 |
  2. | 1991      .    700 |
  3. | 1992      .    800 |
  4. | 1993      .    900 |
  5. | 1994      .   1000 |
  6. | 1995   1100   1100 |
  7. | 1996      .   1300 |
  8. | 1997      .   1500 |
  9. | 1998   1700   1700 |
     +--------------------+

Note that the original variable remains intact, which is prudent as a matter of an analysis audit trail.

ipolate extends to interpolation done separately within distinct groups, most commonly in practice panel or longitudinal data with different panels (people, firms, countries, stations, sites, whatever) with distinct identifiers followed over time.

There are naturally many other kinds of interpolation.

mipolate (SSC) is a user-written program that generalizes ipolate. See here for a discussion or just install it with ssc install mipolate and read its help.



来源:https://stackoverflow.com/questions/40192443/replace-missing-value-based-on-linear-prediction-of-nearby-cells

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!