Count how many observations in the rest of the dat fits multiple conditions? (R)

流过昼夜 提交于 2019-12-23 04:26:39

问题


friends,

I am new in R programming. I have been trying to write a user-defined function for days but not yet nailed it. This is a dataset called event, containing thousands of events (observations) and I selected several rows to show you the data structure. It contains the "STATEid," "date" of occurrence, and geographical coordinates in two variables "LON" "LAT."

I am writing to calculate a new variable (column) for each row. This new variable should be: "Given any specific incident, count the rest of the dataset and calculate the number of events that's happened in the same state, within the circle of 50/100KM radius, in the next 30/60 days."

tail(event[,c("STATEid", "date", "LON", "LAT")])
         STATEid       date        LON      LAT
23611       ohio 1968-04-08  -80.64952 41.09978
23612    arizona       <NA> -112.00000 33.00000
23613   michigan 1970-05-12  -83.61299 42.24115
23614   michigan 1969-02-20  -83.61299 42.24115
23615 california 1984-11-04 -121.61691 39.14045
23616   illinois 1979-09-29  -87.83285 42.44613

I have been writing some of the functions like below,

PostVio30 = function (x) {sum(event$viold [event$date<= x+30 &event$date>x], na.rm=T)}
PostAct60 = function (x) {sum(event$CASE  [event$date<= x+60 &event$date>x], na.rm=T)}
PostVio60 = function (x) {sum(event$viold [event$date<= x+60 &event$date>x], na.rm=T)}

but they are not dynamically calculating for each row.....

The result is correct when entering a specific date and state ---- for example, when I enter "Alabama" and "1966-1-1" it correctly tells me there are 22 incidents occurred in the next 60 days. But how to lapply/sapply/mapply it to each row and ask it to calculate? And how to avoid manually enter the date/state information, please?

> POSTCOUNTING = function(ANYDATE, DATASET, N) {
+   {sum(DATASET$CASE[DATASET$date <= ANYDATE + N & DATASET$date>ANYDATE], na.rm=T)}
+ }
> PRECOUNTING = function(ANYDATE, DATASET, N) {
+   {sum(DATASET$CASE[DATASET$date < ANYDATE & DATASET$date>= ANYDATE - N], na.rm=T)}
+ }
> POSTCOUNTING(as.Date("1966-1-1"), X$alabama, 60)
[1] 22
> PRECOUNTING(as.Date("1966-1-1"), X$alabama, 60)
[1] 9

Alternatively, I have tried to make writing the function easier, with less conditions. For example, I tried to avoid writing statements on "STATEid" by splitting the date first:

X <- split(event, event$STATEid)
PostVio30 = function (x) {sum(event$viold [event$date<= x+30 &event$date>x], na.rm=T)}
X2 <- lapply(X, function(i) {i$PostVio30 = sapply(i$date, PostVio30)})

So I am here trying to learn from your wisdom. If you want I can share the data to give you a reproducible file.

Also - geographical distance calculation is somewhat tricky to me as well - this page identifies a function called gdist maybe plausible?

(Loop over a data.table rows with condition)

locations[, if (gdist(-159.58, 21.901, location_lon, location_lat, units="m") <= 50) .SD, id]
##    id location_lon location_lat
## 1: 11      -159.58       21.901

Thanks so much.

[Replying to another thread: Yes - the coordinates can vary within a state. The incidents could happen in different towns.]

My dput outcome looks like so:

> dput(tail(event[,c("STATEid", "date", "LON", "LAT")]))
structure(list(STATEid = structure(c(36L, 3L, 23L, 23L, 5L, 14L
), .Label = c("alabama", "alaska", "arizona", "arkansas", "california", 
"colorado", "connecticut", "delaware", "district of columbia", 
"florida", "georgia", "hawaii", "idaho", "illinois", "indiana", 
"iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", 
"massachusetts", "michigan", "minnesota", "mississippi", "missouri", 
"montana", "nebraska", "nevada", "new hampshire", "new jersey", 
"new mexico", "new york", "north carolina", "north dakota", "ohio", 
"oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
"south dakota", "tennessee", "texas", "utah", "vermont", "virginia", 
"washington", "west virginia", "wisconsin", "wyoming"), class = "factor"), 
    date = structure(c(-633, NA, 131, -315, 5421, 3558), class = "Date"), 
    LON = c(-80.6495194, -112, -83.6129939, -83.6129939, -121.6169108, 
    -87.8328505), LAT = c(41.0997803, 33, 42.2411499, 42.2411499, 
    39.1404477, 42.4461322)), .Names = c("STATEid", "date", "LON", 
"LAT"), row.names = 23611:23616, class = "data.frame")

Best,

Tom

(A quick update: problem solved - please see here and thanks to all community members: R - How to vectorize with apply family function and avoid while/for loops in this case?)

来源:https://stackoverflow.com/questions/48332680/count-how-many-observations-in-the-rest-of-the-dat-fits-multiple-conditions-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!