In the NYPD shooting incident database, a variable is called STATISTICAL_MURDER_FLAG which indicates if this shooting incident is classified as a murder. It has a binary outcome “TRUE” or “FALSE”. We will use logistic regression model to identify if a shooting can be classified as a murder.
At first, import data from the NYPD shooting incident database
historic_NYPD_df = read_csv("./data/NYPD_Shooting_Incident_Data__Historic_.csv")
year_to_date_df = read_csv("./data/NYPD_Shooting_Incident_Data__Year_To_Date_.csv")
df = bind_rows(historic_NYPD_df,year_to_date_df)
df = df %>%
separate(OCCUR_TIME,into = c("hour","minute","second"),sep = ":")%>%
separate(OCCUR_DATE,into = c("month","date","year"),sep = "/")%>%
mutate(hour=as.numeric(hour),
time_period = case_when(
hour>=0 & hour<6 ~ "MIDNIGHT",
hour>=6 & hour<12 ~ "MORNING",
hour>=12 & hour<18 ~ "AFTERNOON",
hour>=18 & hour<=23 ~ "EVENING"),
month = as.factor(month))%>%
select(BORO,STATISTICAL_MURDER_FLAG,PERP_AGE_GROUP,PERP_SEX,PERP_RACE,VIC_AGE_GROUP,VIC_SEX,VIC_RACE,time_period,month)%>%
mutate(STATISTICAL_MURDER_FLAG = ifelse(STATISTICAL_MURDER_FLAG=="TRUE", 1, 0),
BORO = as.factor(BORO),
PERP_AGE_GROUP = as.factor(PERP_AGE_GROUP),
PERP_SEX = as.factor(PERP_SEX),
PERP_RACE = as.factor(PERP_RACE),
VIC_AGE_GROUP = as.factor(VIC_AGE_GROUP),
VIC_SEX = as.factor(VIC_SEX),
VIC_RACE = as.factor(VIC_RACE))%>%
na.omit()
there are some irregular value and we need to deal with them.
df=df[df$PERP_AGE_GROUP != "UNKNOWN", ]
df=df[df$PERP_AGE_GROUP != "1020", ]
df=df[df$PERP_AGE_GROUP != "940", ]
df=df[df$PERP_AGE_GROUP != "224", ]
df=df[df$PERP_SEX != "U", ]
df=df[df$PERP_RACE != "UNKNOWN", ]
df=df[df$VIC_SEX != "U", ]
df=df[df$VIC_AGE_GROUP != "UNKNOWN", ]
First, build the initial logistic regression model with all variables
fit = glm(STATISTICAL_MURDER_FLAG ~ BORO+PERP_AGE_GROUP+PERP_SEX+PERP_RACE+VIC_AGE_GROUP+VIC_SEX+VIC_RACE+time_period+month,data=df,family = binomial())
summary(fit)
##
## Call:
## glm(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP +
## PERP_SEX + PERP_RACE + VIC_AGE_GROUP + VIC_SEX + VIC_RACE +
## time_period + month, family = binomial(), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5206 -0.7615 -0.6757 -0.5437 2.0936
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.53114 294.98715 -0.083 0.93372
## BOROBROOKLYN -0.14968 0.05431 -2.756 0.00585 **
## BOROMANHATTAN -0.15997 0.06977 -2.293 0.02187 *
## BOROQUEENS -0.12844 0.06798 -1.890 0.05882 .
## BOROSTATEN ISLAND -0.22599 0.11305 -1.999 0.04560 *
## PERP_AGE_GROUP18-24 0.09834 0.07788 1.263 0.20668
## PERP_AGE_GROUP25-44 0.37407 0.07962 4.698 2.63e-06 ***
## PERP_AGE_GROUP45-64 0.67019 0.12182 5.501 3.77e-08 ***
## PERP_AGE_GROUP65+ 0.85839 0.29406 2.919 0.00351 **
## PERP_SEXM -0.17752 0.12486 -1.422 0.15507
## PERP_RACEASIAN / PACIFIC ISLANDER 12.12249 229.62597 0.053 0.95790
## PERP_RACEBLACK 11.76403 229.62588 0.051 0.95914
## PERP_RACEBLACK HISPANIC 11.60375 229.62589 0.051 0.95970
## PERP_RACEWHITE 12.35674 229.62593 0.054 0.95708
## PERP_RACEWHITE HISPANIC 11.84585 229.62589 0.052 0.95886
## VIC_AGE_GROUP18-24 0.21315 0.08094 2.634 0.00845 **
## VIC_AGE_GROUP25-44 0.32565 0.08050 4.045 5.23e-05 ***
## VIC_AGE_GROUP45-64 0.32568 0.10666 3.053 0.00226 **
## VIC_AGE_GROUP65+ 0.53173 0.22826 2.329 0.01983 *
## VIC_SEXM -0.09107 0.06689 -1.361 0.17337
## VIC_RACEASIAN / PACIFIC ISLANDER 11.61377 185.17392 0.063 0.94999
## VIC_RACEBLACK 11.34832 185.17385 0.061 0.95113
## VIC_RACEBLACK HISPANIC 11.14533 185.17386 0.060 0.95201
## VIC_RACEUNKNOWN 10.98051 185.17433 0.059 0.95271
## VIC_RACEWHITE 11.41763 185.17389 0.062 0.95083
## VIC_RACEWHITE HISPANIC 11.47296 185.17386 0.062 0.95060
## time_periodEVENING 0.11915 0.05919 2.013 0.04411 *
## time_periodMIDNIGHT 0.14721 0.06183 2.381 0.01726 *
## time_periodMORNING 0.27665 0.08934 3.097 0.00196 **
## month02 -0.07987 0.12270 -0.651 0.51512
## month03 -0.10884 0.11545 -0.943 0.34582
## month04 0.07014 0.10866 0.645 0.51861
## month05 0.04562 0.10351 0.441 0.65938
## month06 -0.14814 0.10462 -1.416 0.15677
## month07 -0.07027 0.10264 -0.685 0.49360
## month08 -0.11833 0.10302 -1.149 0.25071
## month09 0.07111 0.10602 0.671 0.50239
## month10 -0.05511 0.11087 -0.497 0.61910
## month11 -0.02810 0.11497 -0.244 0.80692
## month12 0.15643 0.11312 1.383 0.16672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13593 on 12422 degrees of freedom
## Residual deviance: 13345 on 12383 degrees of freedom
## AIC: 13425
##
## Number of Fisher Scoring iterations: 11
we can see from the above result that some variables do not pass the significance test. Now, we can use the stepwise method to choose variables.
step(fit)
## Start: AIC=13425.06
## STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP + PERP_SEX +
## PERP_RACE + VIC_AGE_GROUP + VIC_SEX + VIC_RACE + time_period +
## month
##
## Df Deviance AIC
## - month 11 13362 13420
## - VIC_SEX 1 13347 13425
## - PERP_SEX 1 13347 13425
## <none> 13345 13425
## - BORO 4 13356 13428
## - time_period 3 13356 13430
## - VIC_RACE 6 13365 13433
## - VIC_AGE_GROUP 4 13365 13437
## - PERP_RACE 5 13370 13440
## - PERP_AGE_GROUP 4 13408 13480
##
## Step: AIC=13420.04
## STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP + PERP_SEX +
## PERP_RACE + VIC_AGE_GROUP + VIC_SEX + VIC_RACE + time_period
##
## Df Deviance AIC
## - VIC_SEX 1 13364 13420
## - PERP_SEX 1 13364 13420
## <none> 13362 13420
## - BORO 4 13373 13423
## - time_period 3 13372 13424
## - VIC_RACE 6 13382 13428
## - VIC_AGE_GROUP 4 13383 13433
## - PERP_RACE 5 13387 13435
## - PERP_AGE_GROUP 4 13424 13474
##
## Step: AIC=13419.75
## STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP + PERP_SEX +
## PERP_RACE + VIC_AGE_GROUP + VIC_RACE + time_period
##
## Df Deviance AIC
## - PERP_SEX 1 13366 13420
## <none> 13364 13420
## - BORO 4 13374 13422
## - time_period 3 13374 13424
## - VIC_RACE 6 13384 13428
## - VIC_AGE_GROUP 4 13384 13432
## - PERP_RACE 5 13389 13435
## - PERP_AGE_GROUP 4 13427 13475
##
## Step: AIC=13419.64
## STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP + PERP_RACE +
## VIC_AGE_GROUP + VIC_RACE + time_period
##
## Df Deviance AIC
## <none> 13366 13420
## - BORO 4 13376 13422
## - time_period 3 13376 13424
## - VIC_RACE 6 13386 13428
## - VIC_AGE_GROUP 4 13387 13433
## - PERP_RACE 5 13391 13435
## - PERP_AGE_GROUP 4 13429 13475
##
## Call: glm(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP +
## PERP_RACE + VIC_AGE_GROUP + VIC_RACE + time_period, family = binomial(),
## data = df)
##
## Coefficients:
## (Intercept) BOROBROOKLYN
## -24.5963 -0.1516
## BOROMANHATTAN BOROQUEENS
## -0.1575 -0.1258
## BOROSTATEN ISLAND PERP_AGE_GROUP18-24
## -0.2108 0.0914
## PERP_AGE_GROUP25-44 PERP_AGE_GROUP45-64
## 0.3705 0.6618
## PERP_AGE_GROUP65+ PERP_RACEASIAN / PACIFIC ISLANDER
## 0.8322 11.9581
## PERP_RACEBLACK PERP_RACEBLACK HISPANIC
## 11.5806 11.4243
## PERP_RACEWHITE PERP_RACEWHITE HISPANIC
## 12.1770 11.6648
## VIC_AGE_GROUP18-24 VIC_AGE_GROUP25-44
## 0.2117 0.3249
## VIC_AGE_GROUP45-64 VIC_AGE_GROUP65+
## 0.3387 0.5701
## VIC_RACEASIAN / PACIFIC ISLANDER VIC_RACEBLACK
## 11.6048 11.3289
## VIC_RACEBLACK HISPANIC VIC_RACEUNKNOWN
## 11.1229 10.9427
## VIC_RACEWHITE VIC_RACEWHITE HISPANIC
## 11.4126 11.4548
## time_periodEVENING time_periodMIDNIGHT
## 0.1103 0.1295
## time_periodMORNING
## 0.2818
##
## Degrees of Freedom: 12422 Total (i.e. Null); 12396 Residual
## Null Deviance: 13590
## Residual Deviance: 13370 AIC: 13420
Now, we test the reduced model from the above result
fit_2 = glm(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP +
PERP_RACE + VIC_AGE_GROUP + VIC_RACE + time_period, family = binomial(),
data = df)
summary(fit_2)
##
## Call:
## glm(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP +
## PERP_RACE + VIC_AGE_GROUP + VIC_RACE + time_period, family = binomial(),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4753 -0.7608 -0.6741 -0.5528 2.0646
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.59626 294.62999 -0.083 0.93347
## BOROBROOKLYN -0.15162 0.05418 -2.799 0.00513 **
## BOROMANHATTAN -0.15754 0.06963 -2.263 0.02366 *
## BOROQUEENS -0.12585 0.06781 -1.856 0.06345 .
## BOROSTATEN ISLAND -0.21083 0.11273 -1.870 0.06145 .
## PERP_AGE_GROUP18-24 0.09140 0.07777 1.175 0.23990
## PERP_AGE_GROUP25-44 0.37055 0.07952 4.660 3.17e-06 ***
## PERP_AGE_GROUP45-64 0.66181 0.12152 5.446 5.15e-08 ***
## PERP_AGE_GROUP65+ 0.83216 0.29153 2.854 0.00431 **
## PERP_RACEASIAN / PACIFIC ISLANDER 11.95815 228.86346 0.052 0.95833
## PERP_RACEBLACK 11.58060 228.86337 0.051 0.95964
## PERP_RACEBLACK HISPANIC 11.42430 228.86338 0.050 0.96019
## PERP_RACEWHITE 12.17704 228.86342 0.053 0.95757
## PERP_RACEWHITE HISPANIC 11.66480 228.86337 0.051 0.95935
## VIC_AGE_GROUP18-24 0.21169 0.08055 2.628 0.00859 **
## VIC_AGE_GROUP25-44 0.32493 0.08016 4.054 5.04e-05 ***
## VIC_AGE_GROUP45-64 0.33869 0.10647 3.181 0.00147 **
## VIC_AGE_GROUP65+ 0.57005 0.22743 2.507 0.01219 *
## VIC_RACEASIAN / PACIFIC ISLANDER 11.60479 185.54895 0.063 0.95013
## VIC_RACEBLACK 11.32889 185.54888 0.061 0.95131
## VIC_RACEBLACK HISPANIC 11.12294 185.54889 0.060 0.95220
## VIC_RACEUNKNOWN 10.94271 185.54936 0.059 0.95297
## VIC_RACEWHITE 11.41265 185.54892 0.062 0.95096
## VIC_RACEWHITE HISPANIC 11.45482 185.54889 0.062 0.95077
## time_periodEVENING 0.11031 0.05901 1.869 0.06156 .
## time_periodMIDNIGHT 0.12948 0.06141 2.108 0.03501 *
## time_periodMORNING 0.28184 0.08915 3.161 0.00157 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13593 on 12422 degrees of freedom
## Residual deviance: 13366 on 12396 degrees of freedom
## AIC: 13420
##
## Number of Fisher Scoring iterations: 11
We can see that PERP_RACE and VIC_RACE still don’t pass the significance test, now we drop these two and run again
fit_3 = glm(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP +
VIC_AGE_GROUP + time_period, family = binomial(),
data = df)
summary(fit_3)
##
## Call:
## glm(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_AGE_GROUP +
## VIC_AGE_GROUP + time_period, family = binomial(), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2752 -0.7583 -0.6824 -0.5674 1.9931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.69171 0.10140 -16.683 < 2e-16 ***
## BOROBROOKLYN -0.14598 0.05199 -2.808 0.004990 **
## BOROMANHATTAN -0.15839 0.06935 -2.284 0.022369 *
## BOROQUEENS -0.08578 0.06639 -1.292 0.196360
## BOROSTATEN ISLAND -0.14688 0.11058 -1.328 0.184080
## PERP_AGE_GROUP18-24 0.09275 0.07765 1.194 0.232306
## PERP_AGE_GROUP25-44 0.38709 0.07932 4.880 1.06e-06 ***
## PERP_AGE_GROUP45-64 0.73046 0.12020 6.077 1.23e-09 ***
## PERP_AGE_GROUP65+ 0.99871 0.28496 3.505 0.000457 ***
## VIC_AGE_GROUP18-24 0.21832 0.08043 2.714 0.006640 **
## VIC_AGE_GROUP25-44 0.33706 0.07999 4.214 2.51e-05 ***
## VIC_AGE_GROUP45-64 0.37219 0.10571 3.521 0.000430 ***
## VIC_AGE_GROUP65+ 0.63876 0.22417 2.849 0.004380 **
## time_periodEVENING 0.11094 0.05880 1.887 0.059185 .
## time_periodMIDNIGHT 0.12776 0.06117 2.088 0.036759 *
## time_periodMORNING 0.28131 0.08879 3.168 0.001534 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13593 on 12422 degrees of freedom
## Residual deviance: 13424 on 12407 degrees of freedom
## AIC: 13456
##
## Number of Fisher Scoring iterations: 4
From the above result, we use dummy variables for BOROQUEENS,BOROSTATEN ISLAND,PERP_AGE_GROUP18-24
df$QUEENS = ifelse(df$BORO=="QUEENS",1,0)
df$STATEN = ifelse(df$BORO=="STATEN ISLAND",1,0)
df$BROOKLYN = ifelse(df$BORO=="BROOKLYN",1,0)
df$MANHATTAN = ifelse(df$BORO=="MANHATTAN",1,0)
df$PERP_18 =ifelse(df$PERP_AGE_GROUP=="18-24",1,0)
df$PERP_25 =ifelse(df$PERP_AGE_GROUP=="25-44",1,0)
df$PERP_45 =ifelse(df$PERP_AGE_GROUP=="45-64",1,0)
df$PERP_65 =ifelse(df$PERP_AGE_GROUP=="65+",1,0)
Run logistic regression with these dummy variables again and use stepwise method to filter.
fit_4 = glm(formula = STATISTICAL_MURDER_FLAG ~ QUEENS + STATEN + BROOKLYN + MANHATTAN + PERP_18 + PERP_25 + PERP_45 + PERP_65 +
VIC_AGE_GROUP + time_period, family = binomial(),
data = df)
summary(fit_4)
##
## Call:
## glm(formula = STATISTICAL_MURDER_FLAG ~ QUEENS + STATEN + BROOKLYN +
## MANHATTAN + PERP_18 + PERP_25 + PERP_45 + PERP_65 + VIC_AGE_GROUP +
## time_period, family = binomial(), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2752 -0.7583 -0.6824 -0.5674 1.9931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.69171 0.10140 -16.683 < 2e-16 ***
## QUEENS -0.08578 0.06639 -1.292 0.196360
## STATEN -0.14688 0.11058 -1.328 0.184080
## BROOKLYN -0.14598 0.05199 -2.808 0.004990 **
## MANHATTAN -0.15839 0.06935 -2.284 0.022369 *
## PERP_18 0.09275 0.07765 1.194 0.232306
## PERP_25 0.38709 0.07932 4.880 1.06e-06 ***
## PERP_45 0.73046 0.12020 6.077 1.23e-09 ***
## PERP_65 0.99871 0.28496 3.505 0.000457 ***
## VIC_AGE_GROUP18-24 0.21832 0.08043 2.714 0.006640 **
## VIC_AGE_GROUP25-44 0.33706 0.07999 4.214 2.51e-05 ***
## VIC_AGE_GROUP45-64 0.37219 0.10571 3.521 0.000430 ***
## VIC_AGE_GROUP65+ 0.63876 0.22417 2.849 0.004380 **
## time_periodEVENING 0.11094 0.05880 1.887 0.059185 .
## time_periodMIDNIGHT 0.12776 0.06117 2.088 0.036759 *
## time_periodMORNING 0.28131 0.08879 3.168 0.001534 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13593 on 12422 degrees of freedom
## Residual deviance: 13424 on 12407 degrees of freedom
## AIC: 13456
##
## Number of Fisher Scoring iterations: 4
step(fit_4)
## Start: AIC=13455.9
## STATISTICAL_MURDER_FLAG ~ QUEENS + STATEN + BROOKLYN + MANHATTAN +
## PERP_18 + PERP_25 + PERP_45 + PERP_65 + VIC_AGE_GROUP + time_period
##
## Df Deviance AIC
## - PERP_18 1 13425 13455
## - QUEENS 1 13426 13456
## - STATEN 1 13426 13456
## <none> 13424 13456
## - MANHATTAN 1 13429 13459
## - time_period 3 13435 13461
## - BROOKLYN 1 13432 13462
## - PERP_65 1 13435 13465
## - VIC_AGE_GROUP 4 13448 13472
## - PERP_25 1 13449 13479
## - PERP_45 1 13460 13490
##
## Step: AIC=13455.34
## STATISTICAL_MURDER_FLAG ~ QUEENS + STATEN + BROOKLYN + MANHATTAN +
## PERP_25 + PERP_45 + PERP_65 + VIC_AGE_GROUP + time_period
##
## Df Deviance AIC
## - QUEENS 1 13427 13455
## - STATEN 1 13427 13455
## <none> 13425 13455
## - MANHATTAN 1 13431 13459
## - time_period 3 13436 13460
## - BROOKLYN 1 13433 13461
## - PERP_65 1 13436 13464
## - VIC_AGE_GROUP 4 13452 13474
## - PERP_45 1 13465 13493
## - PERP_25 1 13471 13499
##
## Step: AIC=13454.95
## STATISTICAL_MURDER_FLAG ~ STATEN + BROOKLYN + MANHATTAN + PERP_25 +
## PERP_45 + PERP_65 + VIC_AGE_GROUP + time_period
##
## Df Deviance AIC
## - STATEN 1 13428 13454
## <none> 13427 13455
## - MANHATTAN 1 13431 13457
## - BROOKLYN 1 13433 13459
## - time_period 3 13438 13460
## - PERP_65 1 13437 13463
## - VIC_AGE_GROUP 4 13453 13473
## - PERP_45 1 13466 13492
## - PERP_25 1 13472 13498
##
## Step: AIC=13454.15
## STATISTICAL_MURDER_FLAG ~ BROOKLYN + MANHATTAN + PERP_25 + PERP_45 +
## PERP_65 + VIC_AGE_GROUP + time_period
##
## Df Deviance AIC
## <none> 13428 13454
## - MANHATTAN 1 13432 13456
## - BROOKLYN 1 13434 13458
## - time_period 3 13439 13459
## - PERP_65 1 13438 13462
## - VIC_AGE_GROUP 4 13454 13472
## - PERP_45 1 13467 13491
## - PERP_25 1 13473 13497
##
## Call: glm(formula = STATISTICAL_MURDER_FLAG ~ BROOKLYN + MANHATTAN +
## PERP_25 + PERP_45 + PERP_65 + VIC_AGE_GROUP + time_period,
## family = binomial(), data = df)
##
## Coefficients:
## (Intercept) BROOKLYN MANHATTAN
## -1.6644 -0.1089 -0.1221
## PERP_25 PERP_45 PERP_65
## 0.3088 0.6528 0.9053
## VIC_AGE_GROUP18-24 VIC_AGE_GROUP25-44 VIC_AGE_GROUP45-64
## 0.2287 0.3490 0.3814
## VIC_AGE_GROUP65+ time_periodEVENING time_periodMIDNIGHT
## 0.6483 0.1145 0.1293
## time_periodMORNING
## 0.2822
##
## Degrees of Freedom: 12422 Total (i.e. Null); 12410 Residual
## Null Deviance: 13590
## Residual Deviance: 13430 AIC: 13450
now,we can run the model for the last time
fit_5 = glm(formula = STATISTICAL_MURDER_FLAG ~ BROOKLYN + MANHATTAN +
PERP_25 + PERP_45 + PERP_65 + VIC_AGE_GROUP + time_period,
family = binomial(), data = df)
summary(fit_5)
##
## Call:
## glm(formula = STATISTICAL_MURDER_FLAG ~ BROOKLYN + MANHATTAN +
## PERP_25 + PERP_45 + PERP_65 + VIC_AGE_GROUP + time_period,
## family = binomial(), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2510 -0.7560 -0.6878 -0.5868 1.9705
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.66438 0.08477 -19.634 < 2e-16 ***
## BROOKLYN -0.10888 0.04665 -2.334 0.019581 *
## MANHATTAN -0.12209 0.06540 -1.867 0.061934 .
## PERP_25 0.30883 0.04583 6.739 1.60e-11 ***
## PERP_45 0.65281 0.10141 6.437 1.22e-10 ***
## PERP_65 0.90532 0.27732 3.265 0.001096 **
## VIC_AGE_GROUP18-24 0.22868 0.07965 2.871 0.004093 **
## VIC_AGE_GROUP25-44 0.34900 0.07904 4.416 1.01e-05 ***
## VIC_AGE_GROUP45-64 0.38141 0.10510 3.629 0.000285 ***
## VIC_AGE_GROUP65+ 0.64829 0.22390 2.895 0.003786 **
## time_periodEVENING 0.11452 0.05874 1.950 0.051220 .
## time_periodMIDNIGHT 0.12930 0.06107 2.117 0.034243 *
## time_periodMORNING 0.28217 0.08871 3.181 0.001469 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13593 on 12422 degrees of freedom
## Residual deviance: 13428 on 12410 degrees of freedom
## AIC: 13454
##
## Number of Fisher Scoring iterations: 4
This model seems good with all variables having significance. Now we test for the accuracy
ifMudrer = function(glm.model, threshold = 0.5) {
pre.pro = predict(glm.model, type = "response")
pre = if_else(pre.pro >= threshold, 1, 0)
}
recall_accuracy(df$STATISTICAL_MURDER_FLAG,ifMudrer(fit_5))
## [1] 0.7634227
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
roc(df$STATISTICAL_MURDER_FLAG, predict(fit_5, type = "response"), algorithm = 3) %>%
plot(auc.polygon = T, print.thres = T, print.auc = T)
The area under the curve can be used to quantitatively measure the effect of the model. The area under the ROC curve is denoted as AUC, so the larger AUC is, the better the model is. AUC for our model is 0.575 which is acceptable.
At first, we would like to see the significance for each variable
exp(fit_5$coefficients)
## (Intercept) BROOKLYN MANHATTAN PERP_25
## 0.1893076 0.8968336 0.8850719 1.3618291
## PERP_45 PERP_65 VIC_AGE_GROUP18-24 VIC_AGE_GROUP25-44
## 1.9209265 2.4727271 1.2569399 1.4176470
## VIC_AGE_GROUP45-64 VIC_AGE_GROUP65+ time_periodEVENING time_periodMIDNIGHT
## 1.4643475 1.9122647 1.1213333 1.1380352
## time_periodMORNING
## 1.3259992
With the above results, we can have the following conclusion:
-Shooting incidents happen in BROOKLYN and MANHATTAN are less likely to be a murder.
-When perpetrators’ age increases, the probability that this shooting incident to become a murder also increases. This conclusion holds for victims.
-Shooting incidents happen in the morning are more like to be a murder.
For example, we have two shooting incidents now. We will use this model to predict which one has a higher probability to be a murder.
–A 36-year-old white female was killed by a 20-year-old black male at 10pm in Brooklyn.
–A 48-year-old Asian male was killed by a 54-year-old white male at 10am in Queens.
test_data = data_frame(BROOKLYN = c(1,0) , MANHATTAN = c(0,0) ,PERP_25 = c(1,0), PERP_45 = c(0,1), PERP_65 = c(0,0) , VIC_AGE_GROUP = c("25-44","45-64"),VIC_RACE = c("WHITE","ASIAN"),VIC_SEX = c("F","M"),time_period=c("EVENING","MORNING"))
test_data
## # A tibble: 2 × 9
## BROOKLYN MANHATTAN PERP_25 PERP_45 PERP_65 VIC_AGE_GROUP VIC_RACE VIC_SEX
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 0 1 0 0 25-44 WHITE F
## 2 0 0 0 1 0 45-64 ASIAN M
## # … with 1 more variable: time_period <chr>
test_data$prob = predict(fit_5, newdata=test_data, type="response")
test_data$prob
## 1 2
## 0.2687603 0.4138679
From the prediction, we can see that the second one has a higher possibility to be a murder.