Stata: Extracting values and save them as scalars (and more)

匿名 (未验证) 提交于 2019-12-03 09:14:57

问题:

This question is a follow-up question from Stata: replace, if, forvalues. Consider this data:

set seed 123456 set obs 5000 g firmid = "firm" + string(_n)    /* Observation (firm) id */ g nw = floor(100*runiform())      /* Number of workers in a firm */ g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */ g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */ 

The first 10 observations are:

     +--------------------------------------+      | firmid   nw         lat          lon |      |--------------------------------------|   1. |  firm1   81   39.915526   -75.505018 |   2. |  firm2   35   39.548523   -75.201567 |   3. |  firm3   10   39.657866    -75.17988 |   4. |  firm4   83   39.957938   -75.898837 |   5. |  firm5   56   39.575881   -75.169157 |   6. |  firm6   73   39.886184   -75.857255 |   7. |  firm7   27    39.33288   -75.724665 |   8. |  firm8   75   39.165549    -75.96502 |   9. |  firm9   64   39.688819   -75.232764 |  10. | firm10   76   39.012228   -75.166272 |      +--------------------------------------+ 

I need to calculate the distances between firm 1 and all other firms. So, the vincenty command looks like:

. scalar theLat = 39.915526 . scalar theLon = -75.505018 . vincenty lat lon theLat theLon, hav(distance_km) inkm 

The vincenty command creates the distance_km variable that has distances between each observation and firm 1. Here, I manually copy and paste the two numbers that are 39.915526 and -75.505018.

Question 1: What's the syntax that extracts those numbers?

Now, I can keep observations where distances_km <= 2. And,

. egen near_nw_sum = sum(nw) 

will create the sum of workers within 2 kilometers of the firm 1. (Or, the collapse command may do the job.)

Question 2: I have to do this for all firms, and the final data should look like:

     +-----------------------------------------------------------------+      | firmid   nw         lat          lon            near_nw_sum     |      |-----------------------------------------------------------------|   1. |  firm1   81   39.915526   -75.505018  (# workers near firm1)    |   2. |  firm2   35   39.548523   -75.201567  (# workers near firm2)    |   3. |  firm3   10   39.657866    -75.17988  (# workers near firm3)    |   4. |  firm4   83   39.957938   -75.898837  (# workers near firm4)    |   5. |  firm5   56   39.575881   -75.169157  (# workers near firm5)    |   6. |  firm6   73   39.886184   -75.857255  (# workers near firm6)    |   7. |  firm7   27    39.33288   -75.724665  (# workers near firm7)    |   8. |  firm8   75   39.165549    -75.96502  (# workers near firm8)    |   9. |  firm9   64   39.688819   -75.232764  (# workers near firm9)    |  10. | firm10   76   39.012228   -75.166272  (# workers near firm10)   |      +-----------------------------------------------------------------+ 

Creating the near_nw_sum variable is my final goal. I need your help here for my weak data management skill.

回答1:

The following is basically the same strategy found here and is based on your "final goal". Again, it can be useful depending on the size of your original dataset.joinby creates observations so you may exceed the Stata limit. However, I believe it does what you want.

clear all set more off  set seed 123456 set obs 10 g firmid = _n   /* Observation (firm) id */ g nw = floor(100*runiform())      /* Number of workers in a firm */ g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */ g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */ gen dum = 1 list  * joinby procedure tempfile main save "`main'"  rename (firmid lat lon nw) =0 joinby dum using "`main'" drop dum  * Pretty print sort firmid0 firmid order firmid0 firmid list, sepby(firmid0)  * Uncomment if you do not want to include workers in the "base" firm. *drop if firmid0 == firmid  * Compute distance vincenty lat0 lon0 lat lon, hav(distance_km) inkm keep if distance_km <= 40 // an arbitrary distance list, sepby(firmid0)  * Compute workers of nearby-firms collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0) list 

What it does is form pairwise combinations of firms to compute distances and sum workers of nearby-firms. No need here to extract scalars as asked in Question 1. Also, no need to complicate the variable firmid converting to string.

The following overcomes the problem of the Stata limit on number of observations.

clear all set more off  * Create empty database gen x = . tempfile results save "`results'", replace  * Create input for exercise set seed 123456 set obs 500 g firmid = _n   /* Observation (firm) id */ g nw = floor(100*runiform())      /* Number of workers in a firm */ g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */ g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */ gen dum = 1 *list  * Save number of firms local size = _N display "`size'"  * joinby procedure tempfile main save "`main'"  timer clear 1 timer clear 2 timer clear 3 timer clear 4  quietly {     timer on 1     forvalues i=1/`size'{         timer on 2         use "`main'" in `i', clear // assumed sorted on firmid         rename (firmid lat lon nw) =0          joinby dum using "`main'", unmatched(using)         drop _merge dum         order firmid0 firmid         timer off 2          timer on 3         vincenty lat0 lon0 lat lon, hav(dist) inkm         timer off 3         keep if dist <= 40 // an arbitrary distance          timer on 4         collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)          append using "`results'"         save "`results'", replace         timer off 4     }     timer off 1 }  use "`results'", clear sort firmid0 drop x list  timer list 

However inefficicent, some testing using timer shows that most of the computation time goes into the vincenty command which you won't be able to escape. The following is the time (in seconds) for 10,000 observations with an Intel Core i5 processor and a conventional hard drive (not SSD). Timer 1 is the total while 2, 3, 4 are the components (approx.). Timer 3 corresponds to vincenty:

. timer list    1:   1953.99 /        1 =    1953.9940    2:    169.19 /    10000 =       0.0169    3:   1669.95 /    10000 =       0.1670    4:     94.47 /    10000 =       0.0094 

Of course, note that in both codes duplicate computations of distances are made (e.g. both the distances between firm1-firm2 and firm2-firm1 are computed) and this you can probably avoid. As it stands, for 110,000 observations it will take a long time. On the positive side, I noticed this second setup demands very little RAM as compared to the same amount of observations in the first setup. In fact, my 4GB machine freezes with the latter.

Also note that even though I use the same seed as you do, data is different because I create different numbers of observations (not 5000), which makes a difference in the variable creation process.

(By the way, if you wanted to save the value as a scalar you could use subscripting: scalar latitude = lat[1]).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!