1 Introduction

As new predictive policing tools come out, it’s critical to provide sufficient contexts for geospatial risk models and take generalizability into consideration. In this project, I’ll look into a generalizable geospatial risk prediction model which foucus on crime. Specifically, I select the city of Chicago as the study area to further analyze the current gun violence in the city, and try to predict gun violence based on some variables which might affect the emergence of gun violence through the geospatial risk model.

All data in this project come from Chicago Open Data site (https://data.cityofchicago.org/).

2 Data Wrangling

2.1 Loading Chicago Data

In first section, I load the gun violence data as well as some boundaries of Chicago, including police districts, police beat boundaries, and the city boundary.

2.2 Visualizing Gun Violence in Chicago

By plotting point data and density, we can identify the basic pattern of gun violence in Chicago, which is concentrated in the west and middle-south part, the areas where the majority of non-white population live.

# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = GunViolence, colour="red", size=0.01, show.legend = "point") +
  labs(title= "Gun Violence, Chicago - 2017") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(GunViolence)), 
                 aes(X, Y, fill =after_stat(level), alpha = after_stat(level)),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Gun Violence") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

2.3 Creating a fishnet

## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(GunViolence) %>% 
  mutate(countGunViolence = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countGunViolence = replace_na(countGunViolence, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countGunViolence), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of GunViolence for the fishnet") +
  mapTheme()

# For demo. requires updated mapview package
# xx <- mapview::mapview(crime_net, zcol = "countBurglaries")
# yy <- mapview::mapview(mutate(burglaries, ID = seq(1:n())))
# xx + yy

3 Feature Engineering

3.1 Modeling Spatial Features

Crime incidents often tend to occur in areas that go unnoticed, typically characterized by dereliction, poor sanitation, disrepair, inadequate lighting, and lax governance. Gun violence, in most cases, is concentrated on streets and in open spaces. To understand and model these patterns, I have selected eight risk factors related to the physical environment. These factors include 311 reports of abandoned cars, malfunctioning street lights, graffiti remediation, sanitation complaints, abandoned buildings, potholes, and garbage carts. Additionally, I’ve incorporated a neighborhood polygon layer and the locations of retail stores selling liquor to-go for a comprehensive analysis.

However, the quality and completeness of some of the data above can vary significantly and also change rapidly over time. Therefore, we should also draw attention to the reliability of the data source.

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")

# selected new features from open data Chicago.
PotHoles <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Pot-Holes-Reported-No-Duplica/bp3e-nw4d") %>%
  mutate(year = substr(CREATION.DATE,1,4)) %>% filter(year == "2018") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Pot_Holes")

GarbageCarts <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
    mutate(year = substr(creation_date,1,4)) %>% 
    filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Garbage_Carts")

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

Then, those features are joined in fishnet and visualized spatially.

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation,PotHoles,GarbageCarts) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  left_join(fishnet, ., by = "uniqueID") %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol=4, top="Risk Factors by Fishnet"))

3.2 Nearest Neighbor Feature

In this part, I calculate the average nearest neighbor distance to hypothesize a smoother exposure relationship across space. By calculating the average distance from each grid (e.g., abandoned buildings, abandoned cars, graffiti, etc.) to its three nearest neighbors was calculated in order to understand, in a grid format, the distribution patterns of different elements over urban or geographic space. Variations in these distances may help to hypothesize whether risks or events are spatially evenly distributed or whether there is some smooth pattern of association. These results are important predictors in the geospatial risk regression model.

In this section, I compute the average nearest neighbor distance to explore the possibility of a more gradual exposure relationship across spatial areas. This involves calculating the average distance from each grid cell (representing elements like abandoned buildings, abandoned cars, graffiti, etc.) to its three closest neighboring grid cells, which further allows us to examine the distribution patterns of various elements across geographic space in a gridded format. These calculated distances will serve as crucial predictors in the geospatial risk regression model.

# convinience to reduce length of function names.
st_c    <- st_coordinates
st_coid <- st_centroid

## create NN from abandoned cars
  vars_net <-  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      PotHoles.nn=
        nn_function(st_c(st_coid(vars_net)), st_c(PotHoles),3),
      GarbageCarts.nn=
        nn_function(st_c(st_coid(vars_net)), st_c(PotHoles),3))

## Visualize the NN feature
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

ggplot() +
      geom_sf(data = vars_net.long.nn , aes(fill=value), colour=NA) +
      facet_wrap(~Variable, nrow=2)+
      scale_fill_viridis(name="NN Distance") +
      labs(title="Nearest Neighbors Distance") +
      mapTheme()

Then, I join the nearest neighbor features to the fishnet, after which I further join the fishnet to neighborhoods and districts.

## important to drop the geometry from joining features
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

# for live demo
# mapview::mapview(final_net, zcol = "District")

4 Exploreing the Spatial Process of Gun Violence

4.1 Local Moran’s I Statistics

I further analyze the local spatial process of gun violence in Chicago. The figure shows that relatively high values of I represent strong and statistically significant evidence of local clustering. Further evidence of this can be seen in the p-value (<= 0.05) and significant hotspot maps. This test provides insight into the scale, location and intensity of gun violence hotspots.

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

# print(final_net.weights, zero.policy=TRUE)

## see ?localmoran
local_morans <- localmoran(final_net$countGunViolence, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Gun_Violence_Count = countGunViolence, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Gun Violence"))

4.2 Distance to Hot spot

Then I measure average nearest neighbor distance from each cell centroid to its nearest significant cluster(hot spot location).

# generates warning from NN
final_net <- final_net %>% 
  mutate(GunViolence.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(GunViolence.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           GunViolence.isSig == 1))), 
                       k = 1))

## What does k = 1 represent?

ggplot() +
      geom_sf(data = final_net, aes(fill=GunViolence.isSig.dist), colour=NA) +
      scale_fill_viridis(name="NN Distance") +
      labs(title="Gun Violence NN Distance") +
      mapTheme()

4.3 Correlation Tests

Based on the Pearson R correlation analysis, most variables exhibit a positive correlation with gun violence. However, there are exceptions such as graffiti (r=0.03), liquor retail (r=0.1), potholes (r=0.13), and abandoned cars (r=0.15), which have a near-zero or weak positive correlation. In contrast, when evaluating the nearest neighbor distance of these variables, all demonstrate a negative relationship with gun violence.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District) %>%
    gather(Variable, Value, -countGunViolence)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countGunViolence, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countGunViolence)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Gun Violence count as a function of risk factors") +
  plotTheme()

ggplot(final_net, aes(x = countGunViolence)) +
  geom_histogram(fill = "gray40",color ="white") + 
  labs(title = "Distribution of Gun Violence Counts", 
       x = "Count of Gun Violence",  
       y = "Frequency")

5 Poisson Regression

Prepare for the regression: The first variable list contains only basic spatial features, and the second take significant hotspots of gun violence into consideration.

# View(crossValidate)

## define the variables we want

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "PotHoles.nn")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "PotHoles.nn", "GarbageCarts.nn","GunViolence.isSig.dist")

5.1 Cross Validated Poisson Regression

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countGunViolence",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countGunViolence, Prediction, geometry)

## This hold out fold is 27 
## This hold out fold is 97 
## This hold out fold is 33 
## This hold out fold is 64 
## This hold out fold is 6 
## This hold out fold is 84 
## This hold out fold is 32 
## This hold out fold is 5 
## This hold out fold is 96 
## This hold out fold is 102 
## This hold out fold is 56 
## This hold out fold is 2 
## This hold out fold is 43 
## This hold out fold is 40 
## This hold out fold is 42 
## This hold out fold is 9 
## This hold out fold is 14 
## This hold out fold is 71 
## This hold out fold is 79 
## This hold out fold is 24 
## This hold out fold is 46 
## This hold out fold is 88 
## This hold out fold is 65 
## This hold out fold is 70 
## This hold out fold is 93 
## This hold out fold is 31 
## This hold out fold is 61 
## This hold out fold is 89 
## This hold out fold is 66 
## This hold out fold is 81 
## This hold out fold is 77 
## This hold out fold is 10 
## This hold out fold is 22 
## This hold out fold is 51 
## This hold out fold is 100 
## This hold out fold is 12 
## This hold out fold is 18 
## This hold out fold is 41 
## This hold out fold is 26 
## This hold out fold is 101 
## This hold out fold is 39 
## This hold out fold is 55 
## This hold out fold is 54 
## This hold out fold is 86 
## This hold out fold is 37 
## This hold out fold is 74 
## This hold out fold is 98 
## This hold out fold is 57 
## This hold out fold is 58 
## This hold out fold is 68 
## This hold out fold is 53 
## This hold out fold is 16 
## This hold out fold is 3 
## This hold out fold is 28 
## This hold out fold is 21 
## This hold out fold is 94 
## This hold out fold is 83 
## This hold out fold is 13 
## This hold out fold is 67 
## This hold out fold is 52 
## This hold out fold is 8 
## This hold out fold is 11 
## This hold out fold is 99 
## This hold out fold is 73 
## This hold out fold is 30 
## This hold out fold is 29 
## This hold out fold is 38 
## This hold out fold is 36 
## This hold out fold is 20 
## This hold out fold is 92 
## This hold out fold is 47 
## This hold out fold is 78 
## This hold out fold is 45 
## This hold out fold is 95 
## This hold out fold is 82 
## This hold out fold is 50 
## This hold out fold is 44 
## This hold out fold is 90 
## This hold out fold is 63 
## This hold out fold is 59 
## This hold out fold is 75 
## This hold out fold is 1 
## This hold out fold is 17 
## This hold out fold is 7 
## This hold out fold is 85 
## This hold out fold is 87 
## This hold out fold is 23 
## This hold out fold is 49 
## This hold out fold is 48 
## This hold out fold is 72 
## This hold out fold is 62 
## This hold out fold is 35 
## This hold out fold is 4 
## This hold out fold is 34 
## This hold out fold is 60 
## This hold out fold is 25 
## This hold out fold is 19 
## This hold out fold is 15 
## This hold out fold is 80 
## This hold out fold is 76 
## This hold out fold is 91 
## This hold out fold is 69

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countGunViolence",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countGunViolence, Prediction, geometry)

## This hold out fold is 27 
## This hold out fold is 97 
## This hold out fold is 33 
## This hold out fold is 64 
## This hold out fold is 6 
## This hold out fold is 84 
## This hold out fold is 32 
## This hold out fold is 5 
## This hold out fold is 96 
## This hold out fold is 102 
## This hold out fold is 56 
## This hold out fold is 2 
## This hold out fold is 43 
## This hold out fold is 40 
## This hold out fold is 42 
## This hold out fold is 9 
## This hold out fold is 14 
## This hold out fold is 71 
## This hold out fold is 79 
## This hold out fold is 24 
## This hold out fold is 46 
## This hold out fold is 88 
## This hold out fold is 65 
## This hold out fold is 70 
## This hold out fold is 93 
## This hold out fold is 31 
## This hold out fold is 61 
## This hold out fold is 89 
## This hold out fold is 66 
## This hold out fold is 81 
## This hold out fold is 77 
## This hold out fold is 10 
## This hold out fold is 22 
## This hold out fold is 51 
## This hold out fold is 100 
## This hold out fold is 12 
## This hold out fold is 18 
## This hold out fold is 41 
## This hold out fold is 26 
## This hold out fold is 101 
## This hold out fold is 39 
## This hold out fold is 55 
## This hold out fold is 54 
## This hold out fold is 86 
## This hold out fold is 37 
## This hold out fold is 74 
## This hold out fold is 98 
## This hold out fold is 57 
## This hold out fold is 58 
## This hold out fold is 68 
## This hold out fold is 53 
## This hold out fold is 16 
## This hold out fold is 3 
## This hold out fold is 28 
## This hold out fold is 21 
## This hold out fold is 94 
## This hold out fold is 83 
## This hold out fold is 13 
## This hold out fold is 67 
## This hold out fold is 52 
## This hold out fold is 8 
## This hold out fold is 11 
## This hold out fold is 99 
## This hold out fold is 73 
## This hold out fold is 30 
## This hold out fold is 29 
## This hold out fold is 38 
## This hold out fold is 36 
## This hold out fold is 20 
## This hold out fold is 92 
## This hold out fold is 47 
## This hold out fold is 78 
## This hold out fold is 45 
## This hold out fold is 95 
## This hold out fold is 82 
## This hold out fold is 50 
## This hold out fold is 44 
## This hold out fold is 90 
## This hold out fold is 63 
## This hold out fold is 59 
## This hold out fold is 75 
## This hold out fold is 1 
## This hold out fold is 17 
## This hold out fold is 7 
## This hold out fold is 85 
## This hold out fold is 87 
## This hold out fold is 23 
## This hold out fold is 49 
## This hold out fold is 48 
## This hold out fold is 72 
## This hold out fold is 62 
## This hold out fold is 35 
## This hold out fold is 4 
## This hold out fold is 34 
## This hold out fold is 60 
## This hold out fold is 25 
## This hold out fold is 19 
## This hold out fold is 15 
## This hold out fold is 80 
## This hold out fold is 76 
## This hold out fold is 91 
## This hold out fold is 69

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countGunViolence",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countGunViolence, Prediction, geometry)

## This hold out fold is Riverdale 
## This hold out fold is Hegewisch 
## This hold out fold is West Pullman 
## This hold out fold is South Deering 
## This hold out fold is Morgan Park 
## This hold out fold is Mount Greenwood 
## This hold out fold is Roseland 
## This hold out fold is Pullman 
## This hold out fold is East Side 
## This hold out fold is Beverly 
## This hold out fold is Washington Heights 
## This hold out fold is Burnside 
## This hold out fold is Calumet Heights 
## This hold out fold is Chatham 
## This hold out fold is South Chicago 
## This hold out fold is Auburn Gresham 
## This hold out fold is Ashburn 
## This hold out fold is Avalon Park 
## This hold out fold is West Lawn 
## This hold out fold is Grand Crossing 
## This hold out fold is South Shore 
## This hold out fold is Chicago Lawn 
## This hold out fold is Englewood 
## This hold out fold is Woodlawn 
## This hold out fold is Clearing 
## This hold out fold is Jackson Park 
## This hold out fold is Washington Park 
## This hold out fold is Garfield Ridge 
## This hold out fold is West Elsdon 
## This hold out fold is Gage Park 
## This hold out fold is Hyde Park 
## This hold out fold is New City 
## This hold out fold is Fuller Park 
## This hold out fold is Archer Heights 
## This hold out fold is Brighton Park 
## This hold out fold is Grand Boulevard 
## This hold out fold is Kenwood 
## This hold out fold is Oakland 
## This hold out fold is Little Village 
## This hold out fold is Mckinley Park 
## This hold out fold is Bridgeport 
## This hold out fold is Armour Square 
## This hold out fold is Douglas 
## This hold out fold is Lower West Side 
## This hold out fold is North Lawndale 
## This hold out fold is Chinatown 
## This hold out fold is Near South Side 
## This hold out fold is Museum Campus 
## This hold out fold is Little Italy, UIC 
## This hold out fold is West Loop 
## This hold out fold is Austin 
## This hold out fold is Printers Row 
## This hold out fold is Garfield Park 
## This hold out fold is Grant Park 
## This hold out fold is United Center 
## This hold out fold is Greektown 
## This hold out fold is Loop 
## This hold out fold is Millenium Park 
## This hold out fold is Humboldt Park 
## This hold out fold is West Town 
## This hold out fold is River North 
## This hold out fold is Streeterville 
## This hold out fold is Ukrainian Village 
## This hold out fold is East Village 
## This hold out fold is Rush & Division 
## This hold out fold is Wicker Park 
## This hold out fold is Gold Coast 
## This hold out fold is Galewood 
## This hold out fold is Old Town 
## This hold out fold is Lincoln Park 
## This hold out fold is Belmont Cragin 
## This hold out fold is Hermosa 
## This hold out fold is Logan Square 
## This hold out fold is Bucktown 
## This hold out fold is Montclare 
## This hold out fold is Sheffield & DePaul 
## This hold out fold is Dunning 
## This hold out fold is Avondale 
## This hold out fold is North Center 
## This hold out fold is Lake View 
## This hold out fold is Portage Park 
## This hold out fold is Irving Park 
## This hold out fold is Boystown 
## This hold out fold is Wrigleyville 
## This hold out fold is Uptown 
## This hold out fold is Albany Park 
## This hold out fold is Lincoln Square 
## This hold out fold is Norwood Park 
## This hold out fold is Jefferson Park 
## This hold out fold is Sauganash,Forest Glen 
## This hold out fold is North Park 
## This hold out fold is Andersonville 
## This hold out fold is Edgewater 
## This hold out fold is West Ridge 
## This hold out fold is Edison Park 
## This hold out fold is Rogers Park

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",                           
  dependentVariable = "countGunViolence",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countGunViolence, Prediction, geometry)

## This hold out fold is Riverdale 
## This hold out fold is Hegewisch 
## This hold out fold is West Pullman 
## This hold out fold is South Deering 
## This hold out fold is Morgan Park 
## This hold out fold is Mount Greenwood 
## This hold out fold is Roseland 
## This hold out fold is Pullman 
## This hold out fold is East Side 
## This hold out fold is Beverly 
## This hold out fold is Washington Heights 
## This hold out fold is Burnside 
## This hold out fold is Calumet Heights 
## This hold out fold is Chatham 
## This hold out fold is South Chicago 
## This hold out fold is Auburn Gresham 
## This hold out fold is Ashburn 
## This hold out fold is Avalon Park 
## This hold out fold is West Lawn 
## This hold out fold is Grand Crossing 
## This hold out fold is South Shore 
## This hold out fold is Chicago Lawn 
## This hold out fold is Englewood 
## This hold out fold is Woodlawn 
## This hold out fold is Clearing 
## This hold out fold is Jackson Park 
## This hold out fold is Washington Park 
## This hold out fold is Garfield Ridge 
## This hold out fold is West Elsdon 
## This hold out fold is Gage Park 
## This hold out fold is Hyde Park 
## This hold out fold is New City 
## This hold out fold is Fuller Park 
## This hold out fold is Archer Heights 
## This hold out fold is Brighton Park 
## This hold out fold is Grand Boulevard 
## This hold out fold is Kenwood 
## This hold out fold is Oakland 
## This hold out fold is Little Village 
## This hold out fold is Mckinley Park 
## This hold out fold is Bridgeport 
## This hold out fold is Armour Square 
## This hold out fold is Douglas 
## This hold out fold is Lower West Side 
## This hold out fold is North Lawndale 
## This hold out fold is Chinatown 
## This hold out fold is Near South Side 
## This hold out fold is Museum Campus 
## This hold out fold is Little Italy, UIC 
## This hold out fold is West Loop 
## This hold out fold is Austin 
## This hold out fold is Printers Row 
## This hold out fold is Garfield Park 
## This hold out fold is Grant Park 
## This hold out fold is United Center 
## This hold out fold is Greektown 
## This hold out fold is Loop 
## This hold out fold is Millenium Park 
## This hold out fold is Humboldt Park 
## This hold out fold is West Town 
## This hold out fold is River North 
## This hold out fold is Streeterville 
## This hold out fold is Ukrainian Village 
## This hold out fold is East Village 
## This hold out fold is Rush & Division 
## This hold out fold is Wicker Park 
## This hold out fold is Gold Coast 
## This hold out fold is Galewood 
## This hold out fold is Old Town 
## This hold out fold is Lincoln Park 
## This hold out fold is Belmont Cragin 
## This hold out fold is Hermosa 
## This hold out fold is Logan Square 
## This hold out fold is Bucktown 
## This hold out fold is Montclare 
## This hold out fold is Sheffield & DePaul 
## This hold out fold is Dunning 
## This hold out fold is Avondale 
## This hold out fold is North Center 
## This hold out fold is Lake View 
## This hold out fold is Portage Park 
## This hold out fold is Irving Park 
## This hold out fold is Boystown 
## This hold out fold is Wrigleyville 
## This hold out fold is Uptown 
## This hold out fold is Albany Park 
## This hold out fold is Lincoln Square 
## This hold out fold is Norwood Park 
## This hold out fold is Jefferson Park 
## This hold out fold is Sauganash,Forest Glen 
## This hold out fold is North Park 
## This hold out fold is Andersonville 
## This hold out fold is Edgewater 
## This hold out fold is West Ridge 
## This hold out fold is Edison Park 
## This hold out fold is Rogers Park

5.2 Accuracy & Generalizability

In this section, I Calculate Errors across space by random k-fold and spatial cross validation. Among them, the LOGO-CV assumes the local spatial process from all other neighborhoods generalizes to the hold-out. When the local spatial process is not accounted for (ie. Just Risk Factors), some neighborhood hold-outs have MAEs greater than 8 gun violence. However, those large errors disappear when the Spatial Process features are added.

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countGunViolence,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countGunViolence,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countGunViolence,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countGunViolence,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf()

error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - countGunViolence, na.rm = T),
            MAE = mean(abs(Mean_Error), na.rm = T),
            SD_MAE =sd(Prediction - countGunViolence, na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  facet_wrap(~Regression) +  
  geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
       x="Mean Absolute Error", y="Count") +
  plotTheme()

The table below confirms our conclusion that the Spatial Process features improve the model. Comparatively, random k-fold has a better accuracy than spatial LOGO model. And the maps further visualize where the higher errors occur when the local spatial process is not accounted for. Not surprisingly, the largest errors are in the hotspot locations.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF")

Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	1.02	0.81
Random k-fold CV: Spatial Process	0.89	0.71
Spatial LOGO-CV: Just Risk Factors	2.52	2.17
Spatial LOGO-CV: Spatial Process	2.15	1.70

error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Gun Violence Errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

5.3 Generalizability by Neighborhood Context

Further, race data by Census tract is imported to divide the region into majority white and majority non white and test the generalizability. Based on the results, it is evident that the model tends to underestimate the values in Majority_Non_White neighborhoods on average, while overestimating values in Majority_White neighborhoods. The Spatial Process model not only exhibits lower errors overall but also demonstrates a smaller disparity in errors across different neighborhood contexts.

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
  st_centroid() %>%
  st_join(tracts18) %>%
  na.omit() %>%
  st_drop_geometry() %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context(spatial cross validation regression)") %>%
  kable_styling("striped", full_width = F)

Mean Error by neighborhood racial context(spatial cross validation regression)
Regression	Majority_Non_White	Majority_White
Spatial LOGO-CV: Just Risk Factors	-1.2269386	1.2950664
Spatial LOGO-CV: Spatial Process	-0.6449861	0.6484019

reg.summary %>% 
  filter(str_detect(Regression, "k-fold")) %>%
  st_centroid() %>%
  st_join(tracts18) %>%
  na.omit() %>%
  st_drop_geometry() %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context(random k-fold)") %>%
  kable_styling("striped", full_width = F)

Mean Error by neighborhood racial context(random k-fold)
Regression	Majority_Non_White	Majority_White
Random k-fold CV: Just Risk Factors	-1.1533630	1.2634720
Random k-fold CV: Spatial Process	-0.5713985	0.6233398

5.4 Density vs Predictions

The following plot demonstrates the computation of kernel density of the gun violence in 2017 at different search radii. The region of higher kernel density is the hotspot.

# demo of kernel width
Gun_ppp <- as.ppp(st_coordinates(GunViolence), W = st_bbox(final_net))
Gun_KD.1000 <- spatstat.explore::density.ppp(Gun_ppp, 1000)
Gun_KD.1500 <- spatstat.explore::density.ppp(Gun_ppp, 1500)
Gun_KD.2000 <- spatstat.explore::density.ppp(Gun_ppp, 2000)
Gun_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(Gun_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(Gun_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(Gun_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

Gun_KD.df$Legend <- factor(Gun_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=Gun_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(name="Density") +
  labs(title = "Kernel density with 3 different search radii") +
  mapTheme(title_size = 14)

as.data.frame(Gun_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(GunViolence, 1500), size = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2017 Gun Violence") +
     mapTheme(title_size = 14)

GunViolence18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>%
  filter(str_detect(Description, "HANDGUN")) %>%
        dplyr::select(-Date, -Updated.On) %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

Gun_KDE_sum <- as.data.frame(Gun_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(Gun_KDE_sum$value, 
                             n = 5, "fisher")
Gun_KDE_sf <- Gun_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(GunViolence18) %>% mutate(countGunViolence = 1), ., sum) %>%
    mutate(countGunViolence = replace_na(countGunViolence, 0))) %>%
  dplyr::select(label, Risk_Category, countGunViolence)

ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
Gun_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(GunViolence18) %>% mutate(countGunViolence = 1), ., sum) %>%
      mutate(countGunViolence = replace_na(countGunViolence, 0))) %>%
  dplyr::select(label,Risk_Category, countGunViolence)

The map is generated of the risk categories for both model types with a sample of gun violence points overlaid. As we can see, the model result matches better with the 2018 gun violence cases. Specifically, the highest risk category is uniquely targeted to places with a high density of gun violence points, indicating that the risk precition model fits well than the kernel density in terms of predictions.

rbind(Gun_KDE_sf, Gun_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(GunViolence18, 3000), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 Gun Violence risk predictions; 2018 Gun Violence") +
    mapTheme(title_size = 14)

Finally, I calculate the rate of 2018 gun violence points by risk category and model type. As is clearly shown, the risk predictions capture a greater share of 2018 gun violence in the higher risk category(4th, 5th) relative to the Kernel density.

rbind(Gun_KDE_sf, Gun_risk_sf) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countGunViolence = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countGunViolence / sum(countGunViolence)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, name = "Model") +
      labs(title = "Risk prediction vs. Kernel density, 2018 Gun Violence",
           y = "% of Test Set Gun Violence (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

6 Conclusion

In the project, I developed a generalizable geospatial risk prediction model which foucus on gun violence in the city of Chicago. In the model, I have selected eight risk factors related to the physical environment, which include 311 reports of abandoned cars, malfunctioning street lights, graffiti remediation, sanitation complaints, abandoned buildings, potholes, garbage carts and the locations of retail stores selling liquor to-go for a comprehensive analysis.

Overall, I would recommend my geospatial risk model as a new predictive policing tool in practice, and my reasons are listed as follows: 1.Improved Predictive Accuracy: My risk prediction model outperforms the traditional kernel density hotspot model in terms of accuracy. It provides more precise predictions of gun violence incidents. 2.Consistency and Precision: The Mean Absolute Error (MAE) of my random k-fold regression model with spatial processes is 0.98, and the Standard Deviation of MAE (SD_MAE) is 0.72. These metrics demonstrate the model’s consistent and accurate performance across different samples. 3.Generalizability: The model’s performance is not limited to a specific context. It exhibits a smaller absolute mean error in both majority white and non-white neighborhoods when analyzed in the context of racial demographics. This suggests that the model has a degree of generalizability in its predictions.

However, it’s important to acknowledge some limitations in the model. There is an inherent selection bias in the choice of risk factors, as I may have been influenced by preconceived notions about what elements to include from specific groups or regions. This could result in a potential oversight of other contributing factors not considered. Additionally, the accuracy of the data used for these risk factors is subject to question. It’s crucial to ensure the data used is reliable and up-to-date for the model to provide accurate predictions. Furthermore, the model’s focus on physical environmental factors means it may not account for a comprehensive range of socio-economic elements that could influence gun violence. Future improvements to the model could involve incorporating a broader array of socio-economic indicators for a more holistic understanding.

Geospatial Risk Prediction

Ling Chen

10/19/2023