问题
I have a data frame that has the data from the Gini Index of countries. Plenty of the values are NA
, so i want to create a new data frame that has, for each country, the most recent Gini Index measured for it. For example, if Brazil has a value for 2012, 2013 and 2015, the new data frame will have only the value of 2015. This is how the data looks like:
Country.Name Country.Code X2014 X2015 X2016 X2017
8 Argentina ARG 41.4 NA 42.4 NA
9 Armenia ARM 31.5 32.4 32.5 NA
13 Austria AUT 30.5 30.5 NA NA
16 Belgium BEL 28.1 27.7 NA NA
17 Benin BEN NA 47.8 NA NA
18 Burkina Faso BFA 35.3 NA NA NA
19 Bangladesh BGD NA NA 32.4 NA
20 Bulgaria BGR 37.4 NA NA NA
23 Bosnia and Herzegovina BIH NA 32.7 NA NA
24 Belarus BLR 27.2 26.7 27.0 NA
27 Bolivia BOL 47.8 46.7 44.6 NA
28 Brazil BRA 51.5 51.3 NA NA
31 Bhutan BTN NA NA NA 37.4
36 Switzerland CHE 32.5 32.3 NA NA
38 Chile CHL NA 47.7 NA NA
40 Cote d'Ivoire CIV NA 41.5 NA NA
41 Cameroon CMR 46.6 NA NA NA
44 Colombia COL 52.8 51.1 50.8 NA
47 Costa Rica CRI 48.6 48.4 48.7 NA
52 Cyprus CYP 35.6 34.0 NA NA
53 Czech Republic CZE 25.9 25.9 NA NA
54 Germany DEU NA 31.7 NA NA
57 Denmark DNK 28.4 28.2 NA NA
58 Dominican Republic DOM 44.1 44.7 45.3 NA
65 Ecuador ECU 45.0 46.0 45.0 NA
66 Egypt, Arab Rep. EGY NA 31.8 NA NA
69 Spain ESP 36.1 36.2 NA NA
70 Estonia EST 34.6 32.7 NA NA
71 Ethiopia ETH NA 39.1 NA NA
74 Finland FIN 26.8 27.1 NA NA
76 France FRA 32.3 32.7 NA NA
79 Gabon GAB NA NA NA 38.0
80 United Kingdom GBR 34.0 33.2 NA NA
81 Georgia GEO 37.3 36.4 36.5 NA
85 Gambia, The GMB NA 35.9 NA NA
88 Greece GRC 35.8 36.0 NA NA
91 Guatemala GTM 48.3 NA NA NA
96 Honduras HND 50.4 49.6 50.0 NA
98 Croatia HRV 32.1 31.1 NA NA
100 Hungary HUN 30.9 30.4 NA NA
110 Ireland IRL 31.9 31.8 NA NA
111 Iran, Islamic Rep. IRN 38.8 NA NA NA
113 Iceland ISL 27.8 NA NA NA
115 Italy ITA 34.7 35.4 NA NA
119 Kazakhstan KAZ 27.0 26.9 NA NA
120 Kenya KEN NA 40.8 NA NA
121 Kyrgyz Republic KGZ 26.8 29.0 26.8 NA
130 Liberia LBR 33.2 NA NA NA
137 Sri Lanka LKA NA NA 39.8 NA
142 Lithuania LTU 37.7 37.4 NA NA
143 Luxembourg LUX 31.2 33.8 NA NA
144 Latvia LVA 35.1 34.2 NA NA
149 Moldova MDA 26.8 27.0 26.3 NA
153 Mexico MEX 45.8 NA 43.4 NA
156 Macedonia, FYR MKD 35.6 NA NA NA
158 Malta MLT 29.0 29.4 NA NA
159 Myanmar MMR NA 38.1 NA NA
161 Montenegro MNE 31.9 NA NA NA
162 Mongolia MNG 32.0 NA 32.3 NA
164 Mozambique MOZ 54.0 NA NA NA
165 Mauritania MRT 32.6 NA NA NA
168 Malaysia MYS NA 41.0 NA NA
170 Namibia NAM NA 59.1 NA NA
172 Niger NER 34.3 NA NA NA
174 Nicaragua NIC 46.2 NA NA NA
175 Netherlands NLD 28.6 28.2 NA NA
176 Norway NOR 26.8 27.5 NA NA
183 Pakistan PAK NA 33.5 NA NA
184 Panama PAN 50.6 50.8 50.4 NA
185 Peru PER 43.4 43.5 43.8 NA
193 Portugal PRT 35.6 35.5 NA NA
194 Paraguay PRY 50.7 47.6 47.9 NA
195 West Bank and Gaza PSE NA NA 33.7 NA
200 Romania ROU 36.0 35.9 NA NA
201 Russian Federation RUS 39.9 37.7 NA NA
210 El Salvador SLV 41.6 40.6 40.0 NA
220 Slovak Republic SVK 26.1 26.5 NA NA
221 Slovenia SVN 25.7 25.4 NA NA
222 Sweden SWE 28.4 29.2 NA NA
231 Togo TGO NA 43.1 NA NA
232 Thailand THA 37.0 36.0 NA NA
233 Tajikistan TJK NA 34.0 NA NA
236 Timor-Leste TLS 28.7 NA NA NA
243 Turkey TUR 41.2 42.9 41.9 NA
246 Uganda UGA NA NA 42.8 NA
247 Ukraine UKR 24.0 25.5 25.0 NA
249 Uruguay URY 40.1 40.2 39.7 NA
250 United States USA NA NA 41.5 NA
256 Vietnam VNM 34.8 NA 35.3 NA
260 Kosovo XKX 27.3 26.4 26.5 NA
261 Yemen, Rep. YEM 36.7 NA NA NA
262 South Africa ZAF 63.0 NA NA NA
263 Zambia ZMB NA 57.1 NA NA
This is already a subset I made since I considered values older than 2014 not useful. I want to get the most recent value for each country in order to make an inequality rank. Any thoughts?
回答1:
You could use coalesce
:
library(tidyverse)
df %>% mutate(last = invoke(coalesce,df[6:3])) %>% head
# edit, more simply :
# df %>% mutate(last = coalesce(!!!df[6:3])) %>% head
# Country.Name Country.Code X2014 X2015 X2016 X2017 last
# 1 Argentina ARG 41.4 NA 42.4 NA 42.4
# 2 Armenia ARM 31.5 32.4 32.5 NA 32.5
# 3 Austria AUT 30.5 30.5 NA NA 30.5
# 4 Belgium BEL 28.1 27.7 NA NA 27.7
# 5 Benin BEN NA 47.8 NA NA 47.8
# 6 Burkina Faso BFA 35.3 NA NA NA 35.3
In base R (less efficient, same output) :
df$last <- apply(df[6:3],1,function(x) na.omit(x)[1])
回答2:
A tidyverse
option
library(tidyverse)
df %>%
gather(Year, Index, starts_with("X")) %>%
mutate(Year = as.numeric(str_replace(Year, "X", ""))) %>%
group_by(Country.Code) %>%
arrange(Country.Code, desc(Year)) %>%
filter(!is.na(Index)) %>%
slice(1)
ungroup()
## A tibble: 93 x 4
# Country.Name Country.Code Year Index
# <fct> <fct> <dbl> <dbl>
# 1 Argentina ARG 2016 42.4
# 2 Armenia ARM 2016 32.5
# 3 Austria AUT 2015 30.5
# 4 Belgium BEL 2015 27.7
# 5 Benin BEN 2015 47.8
# 6 Burkina Faso BFA 2014 35.3
# 7 Bangladesh BGD 2016 32.4
# 8 Bulgaria BGR 2014 37.4
# 9 Bosnia and Herzegovina BIH 2015 32.7
#10 Belarus BLR 2016 27
Explanation: Reshape yearly Gini data from wide to long, group by Country.Code
, sort entries by descending Year
, remove NA
rows and keep only the latest entry per Country.Code
.
Sample data
df <- read.table(text =
"Country.Name Country.Code X2014 X2015 X2016 X2017
8 Argentina ARG 41.4 NA 42.4 NA
9 Armenia ARM 31.5 32.4 32.5 NA
13 Austria AUT 30.5 30.5 NA NA
16 Belgium BEL 28.1 27.7 NA NA
17 Benin BEN NA 47.8 NA NA
18 'Burkina Faso' BFA 35.3 NA NA NA
19 Bangladesh BGD NA NA 32.4 NA
20 Bulgaria BGR 37.4 NA NA NA
23 'Bosnia and Herzegovina' BIH NA 32.7 NA NA
24 Belarus BLR 27.2 26.7 27.0 NA
27 Bolivia BOL 47.8 46.7 44.6 NA
28 Brazil BRA 51.5 51.3 NA NA
31 Bhutan BTN NA NA NA 37.4
36 Switzerland CHE 32.5 32.3 NA NA
38 Chile CHL NA 47.7 NA NA
40 'Cote d Ivoire' CIV NA 41.5 NA NA
41 Cameroon CMR 46.6 NA NA NA
44 Colombia COL 52.8 51.1 50.8 NA
47 'Costa Rica' CRI 48.6 48.4 48.7 NA
52 Cyprus CYP 35.6 34.0 NA NA
53 'Czech Republic' CZE 25.9 25.9 NA NA
54 Germany DEU NA 31.7 NA NA
57 Denmark DNK 28.4 28.2 NA NA
58 'Dominican Republic' DOM 44.1 44.7 45.3 NA
65 Ecuador ECU 45.0 46.0 45.0 NA
66 'Egypt, Arab Rep.' EGY NA 31.8 NA NA
69 Spain ESP 36.1 36.2 NA NA
70 Estonia EST 34.6 32.7 NA NA
71 Ethiopia ETH NA 39.1 NA NA
74 Finland FIN 26.8 27.1 NA NA
76 France FRA 32.3 32.7 NA NA
79 Gabon GAB NA NA NA 38.0
80 'United Kingdom' GBR 34.0 33.2 NA NA
81 Georgia GEO 37.3 36.4 36.5 NA
85 'Gambia, The' GMB NA 35.9 NA NA
88 Greece GRC 35.8 36.0 NA NA
91 Guatemala GTM 48.3 NA NA NA
96 Honduras HND 50.4 49.6 50.0 NA
98 Croatia HRV 32.1 31.1 NA NA
100 Hungary HUN 30.9 30.4 NA NA
110 Ireland IRL 31.9 31.8 NA NA
111 'Iran, Islamic Rep.' IRN 38.8 NA NA NA
113 Iceland ISL 27.8 NA NA NA
115 Italy ITA 34.7 35.4 NA NA
119 Kazakhstan KAZ 27.0 26.9 NA NA
120 Kenya KEN NA 40.8 NA NA
121 'Kyrgyz Republic' KGZ 26.8 29.0 26.8 NA
130 Liberia LBR 33.2 NA NA NA
137 'Sri Lanka' LKA NA NA 39.8 NA
142 Lithuania LTU 37.7 37.4 NA NA
143 Luxembourg LUX 31.2 33.8 NA NA
144 Latvia LVA 35.1 34.2 NA NA
149 Moldova MDA 26.8 27.0 26.3 NA
153 Mexico MEX 45.8 NA 43.4 NA
156 'Macedonia, FYR' MKD 35.6 NA NA NA
158 Malta MLT 29.0 29.4 NA NA
159 Myanmar MMR NA 38.1 NA NA
161 Montenegro MNE 31.9 NA NA NA
162 Mongolia MNG 32.0 NA 32.3 NA
164 Mozambique MOZ 54.0 NA NA NA
165 Mauritania MRT 32.6 NA NA NA
168 Malaysia MYS NA 41.0 NA NA
170 Namibia NAM NA 59.1 NA NA
172 Niger NER 34.3 NA NA NA
174 Nicaragua NIC 46.2 NA NA NA
175 Netherlands NLD 28.6 28.2 NA NA
176 Norway NOR 26.8 27.5 NA NA
183 Pakistan PAK NA 33.5 NA NA
184 Panama PAN 50.6 50.8 50.4 NA
185 Peru PER 43.4 43.5 43.8 NA
193 Portugal PRT 35.6 35.5 NA NA
194 Paraguay PRY 50.7 47.6 47.9 NA
195 'West Bank and Gaza' PSE NA NA 33.7 NA
200 Romania ROU 36.0 35.9 NA NA
201 'Russian Federation' RUS 39.9 37.7 NA NA
210 'El Salvador' SLV 41.6 40.6 40.0 NA
220 'Slovak Republic' SVK 26.1 26.5 NA NA
221 Slovenia SVN 25.7 25.4 NA NA
222 Sweden SWE 28.4 29.2 NA NA
231 Togo TGO NA 43.1 NA NA
232 Thailand THA 37.0 36.0 NA NA
233 Tajikistan TJK NA 34.0 NA NA
236 Timor-Leste TLS 28.7 NA NA NA
243 Turkey TUR 41.2 42.9 41.9 NA
246 Uganda UGA NA NA 42.8 NA
247 Ukraine UKR 24.0 25.5 25.0 NA
249 Uruguay URY 40.1 40.2 39.7 NA
250 'United States' USA NA NA 41.5 NA
256 Vietnam VNM 34.8 NA 35.3 NA
260 Kosovo XKX 27.3 26.4 26.5 NA
261 'Yemen, Rep.' YEM 36.7 NA NA NA
262 'South Africa' ZAF 63.0 NA NA NA
263 Zambia ZMB NA 57.1 NA NA", header = T)
回答3:
Another option with dplyr and tidyr is the following. I used df
by Maurits Evers. You reformat your data from a wide format to a long format with gather()
. Then, you define a group variable with Country.Name
. For each country, you get indices for non-NA values and choose the max index number. You use it to subset your data with slice()
.
gather(df, key = year, value = value, -Country.Name, -Country.Code) %>%
group_by(Country.Name) %>%
slice(max(which(!is.na(value))))
Country.Name Country.Code year value
<fct> <fct> <chr> <dbl>
1 Argentina ARG X2016 42.4
2 Armenia ARM X2016 32.5
3 Austria AUT X2015 30.5
4 Bangladesh BGD X2016 32.4
5 Belarus BLR X2016 27
6 Belgium BEL X2015 27.7
7 Benin BEN X2015 47.8
8 Bhutan BTN X2017 37.4
9 Bolivia BOL X2016 44.6
10 Bosnia and Herzegovina BIH X2015 32.7
... with 83 more rows
回答4:
Since you only care about those four years, a simple way to do it could be to check for an NA value for each year looking backwards and map to a separate column
df$mostRecent = NA
#Moving backwards, if most recent value is NA then check the previous year
df$mostRecent[is.na(df$mostRecent)] <- df$X2017[is.na(df$mostRecent)]
df$mostRecent[is.na(df$mostRecent)] <- df$X2016[is.na(df$mostRecent)]
df$mostRecent[is.na(df$mostRecent)] <- df$X2015[is.na(df$mostRecent)]
df$mostRecent[is.na(df$mostRecent)] <- df$X2014[is.na(df$mostRecent)]
来源:https://stackoverflow.com/questions/52684613/how-can-i-create-a-new-dataframe-comparing-values-and-getting-only-most-recent-d