library(corrplot)
library(corrr)
library(FNN)
library(ggcorrplot)
library(gifski)
library(grid)
library(gridExtra)
library(ggmap)
library(ggpubr) 
library(ggplot2)
library(ggpmisc)
library(ggstance)
library(gganimate)
library(geojsonio)
library(jtools) 
library(knitr)
library(kableExtra)
library(stargazer)
library(AICcmodavg)
library(lubridate)
library(leaflet)
library(plotROC)
library(pROC)
library(rmarkdown)
library(riem)
library(RSocrata)
library(sf)
library(sp)
library(spdep)
library(spatstat)
library(stargazer)
library(tableHTML)
library(tigris)
library(tidycensus)
library(viridis)
library(magick)
library(rgdal)
library(tidyverse)

options(scipen=999, tigris_class = "sf")
mapTheme <- theme(axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank())

# setting pallettes
palette2 <- c("#efb366","#bbefa9")
palette3 <- c("#ffe15c","#efb366","#bbefa9")
palette5 <- c("#264653","#2a9d8f","#e9c46a","#f4a261","#e76f51")
palette7 <- c("#355070","#6d597a","#915f78","#b56576","#e77c76", "#e88c7d","#eaac8b")
palette9 <- c("#b98b73","#cb997e","#ddbea9","#ffe8d6","#d4c7b0","#b7b7a4","#a5a58d","#6b705c","#3f4238")
palette11 <- c("#797d62","#9b9b7a","#baa587","#d9ae94","#aeb0a4","#d8e2dc", "#ffcb69", "#e8ac65","#d08c60","#b58463", "#997b66")
palette11_1 <- c("#264653","#287271","#2a9d8f","#8ab17d","#babb74","#e9c46a", "#efb366", "#f4a261","#ee8959","#e76f51", "#862906")
palette11_2<- c("#283c55","#355070","#6d597a","#915f78","#b56576","#e56b6f", "#d19999", "#e77c76", "#e88c7d","#eaac8b","#eebba0")

#import functions
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

1.Introduction

This is a forecast to predict the origin or destination time delays in Metro trains like NJ Transit and Amtrak around New York City.

Click picture to see our presentation video

LOOK AT OUR VIDEO

NJ Transit system is a non-cyclical rail network that owns 11 lines and services 162 train stations in New Jersey and connects travelers to New York Penn Station. The region of NJ Transit operations has complex system dynamics that affect thousands of people everyday. Train delays will affect many people’s travel and schedules. Therefore, a good prediction could allow enough time for passengers to consider train delays and anticipate in advance.

Delay is the extra time it takes a train to operate on a route due to many factors, such as weather, stations, lines, and equipment. The delay will not only affect the operation of the train but also spread in the section, causing other trains to be late. Train delays will also cause a long time of passenger retention and bring inconvenience. We want to study trends and offer a better understanding of the principal factors that contribute to training delays. Therefore, our goal is to provide a reliable prediction of station delay that can help dispatchers to estimate the train operation status and make reasonable dispatching decisions to improve the operation and service quality of rail transit.

2.User Case

The system is facing a serious train delays issues. we can know from NJ transit website, it clearly shows the recent train delays, which happens everyday. Train delay is a harsh truth that frustrates passengers. This is a problem that every passenger will face as soon as they plan to take the train. It also cause a long time of passengers retention, and bring them frustrations and inconvenience.

Our app - Transpotting is designed for passengers and tells them whether the train will delay or not and avoid unexpected delays. The app user could be anyone who regularly uses the train system in and around NYC. This could include commuters who use the train to work, tourists who are visiting the city, and business travelers who need to get around the city quickly.

2.1 Work Flow

Different professional people work together to support the APP. The datasets of the app are from train companies. And professional data analysts deal with the data, analysis, and visualize them to let users easy to understand. The transportation department can improve traffic policy according to the analysis result. The goal of our app is to provide users with our predicted train arrival times. They can adjust their travel plans with this app. We hope users can get more trust in traveling by train.

A reliable prediction of train delays offers users a better ride experience, which can increase rail ridership. And also the analysis report can let train operators have a better understanding of the reason for train delays. They can improve the train management according to different conditions.

In addition, to put a machine learning model into the hands of a non-technical decision maker, the app should be easy to use. It has a user-friendly interface that allows the decision maker to access and use the model without technical expertise, and has clear instructions on how to use the model and interpret its predictions. The app also includes visualizations and summary statistics that help the decision maker understand the predictions. it allows decision maker to adjust parameters to meet their specific needs.

2.2 Interface

According to this use case, we will focus on Rail Passengers. We design three main functions for them, which is to get real-time train information, offer train delay prediction. And users can click the customize button to get the train report by adding the train they interested.

Here is our user interface design. The main page shows real-time status of trains, and predicted arrival time of each train. Users can zoom in and out on the map to get the most intuitive view of train moving dynamics. And when users click the alert button, they can have a quick look at all the delayed trains notifications.

Users can get the delayed report of the train they searched for. Train report mainly contains two part: predicted time and history timetable. It will provides users a general summary of train’s punctuality rate, arriving time in one week, and predicted arriving time.

In the account management panel, users can customize their coming trips, and get information of predicted train delayed time, actual arrival time, and weather at stops. And they can also get a report of how long they spent waiting for the train this year. We hope everyone could save their time with this app.

3.Method

As the issue exhibits very strong space-time dependencies, we will create a space-time geospatial machine learning model in this case. The dependent variable here is train delays time. After the exploratory data analysis, we are going to do feature engineering both spatial and temporal features. In addition, a strong time series model is an understanding of the underlying temporal train operation process. And also the space effects like inclement weather are initial features to predict train time delays.

Specifically, we will use linear regression (because records are large enough) with fixed effects like time lags and station lags to predict. Moreover, we will compare the predicted data with observed data, and look at the spatial pattern of MAEs in test set. Finally, we will use cross validation think about whether it is an accurate and generalizable model.

4.Data

The available dataset is based on past records and shows a result of the past. We analyze it to get a conclusion in order to make new predictions for similar issues in the future and help companies or sectors make a judgment. In this case, datasets we used are as following:

  • Granular performance data from 150k+ NJ Transit and Amtrak train trips by PRANAV BADAMI.

  • NJ Transits and Amtrak rail stations

  • NJ and NYC Census Data

  • Real-time weather data

  • Transportation Data

Those data set is then wrangled into a complete Final.Panel of observations that include every possible space-time combination.

4.1 Train Trips

In this case, we choose trains trips during October and December in 2019, which were in the pre-pandemic period and concluded Thanksgiving and Christmas holidays. Here, we got 667,754 pieces of records. Inspection of the dataset revealed that some temporal data needed to be processed. Then, we sort data by 60 minutes intervals respectively.

trips201910 <- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/archive/2019_10.csv")
trips201911 <- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/archive/2019_11.csv")
trips201912 <- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/archive/2019_12.csv")

# get rid of NAs and merge them together
trips <- rbind(trips201910, trips201911, trips201912)
trips <- na.omit(trips)
trips <- trips %>% 
  mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
         actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         #scheduled time: month, week, day, hour and minute
         month = month(interval60),
         week = week(interval60),
         day = day(interval60),
         dotw = wday(interval60, label=TRUE),
         hour = hour(interval60),
         minute = minute(scheduled_time),
         #time in a decimal format
         time = hour + minute/60,
         #actual time: month, week, day, hour and minute
         act_month = month(actual_time),
         act_week = week(actual_time),
         act_dotw = wday(actual_time, label=TRUE),
         act_day = day(actual_time),
         act_hour = hour(actual_time),
         act_minute = minute(actual_time), 
         #actual time in a decimal format
         act_time = act_hour + act_minute/60) 

trips201910 <- trips201910 %>% na.omit() %>%
  #change data type
  mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
         actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         #scheduled time: month, week, day, hour and minute
         month = month(interval60),
         week = week(interval60),
         day = day(interval60),
         dotw = wday(interval60, label=TRUE),
         hour = hour(interval60),
         minute = minute(scheduled_time),
         #time in a decimal format
         time = hour + minute/60,
         #actual time: month, week, day, hour and minute
         act_month = month(actual_time),
         act_week = week(actual_time),
         act_dotw = wday(actual_time, label=TRUE),
         act_day = day(actual_time),
         act_hour = hour(actual_time),
         act_minute = minute(actual_time), 
         #actual time in a decimal format
         act_time = act_hour + act_minute/60) 

trips201911 <- trips201911 %>% na.omit() %>%
  #change data type
  mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
         actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         #scheduled time: month, week, day, hour and minute
         month = month(interval60),
         week = week(interval60),
         day = day(interval60),
         dotw = wday(interval60, label=TRUE),
         hour = hour(interval60),
         minute = minute(scheduled_time),
         #time in a decimal format
         time = hour + minute/60,
         #actual time: month, week, day, hour and minute
         act_month = month(actual_time),
         act_week = week(actual_time),
         act_dotw = wday(actual_time, label=TRUE),
         act_day = day(actual_time),
         act_hour = hour(actual_time),
         act_minute = minute(actual_time), 
         #actual time in a decimal format
         act_time = act_hour + act_minute/60) 

trips201912 <- trips201912 %>% na.omit() %>%
  #change data type
  mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
         actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
  mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         #scheduled time: month, week, day, hour and minute
         month = month(interval60),
         week = week(interval60),
         day = day(interval60),
         dotw = wday(interval60, label=TRUE),
         hour = hour(interval60),
         minute = minute(scheduled_time),
         #time in a decimal format
         time = hour + minute/60,
         #actual time: month, week, day, hour and minute
         act_month = month(actual_time),
         act_week = week(actual_time),
         act_dotw = wday(actual_time, label=TRUE),
         act_day = day(actual_time),
         act_hour = hour(actual_time),
         act_minute = minute(actual_time), 
         #actual time in a decimal format
         act_time = act_hour + act_minute/60) 

4.2 Train Stations

Here, we add NJ Transit and Amtrak stations. Then, we are going to create a trip.sf with latitude and longitude of stations by merging stations data into train trips data.

#NJ transits in New Jersey
NJstations <- st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Railroads_Network/Railroads_Network.shp") %>% 
  st_transform(crs = 4326) %>% 
  dplyr::select(OBJECTID, STATION_ID, geometry) 
## Reading layer `Railroads_Network' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Railroads_Network/Railroads_Network.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 167 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 300860.8 ymin: 193014.1 xmax: 634837.8 ymax: 961070.5
## Projected CRS: NAD83 / New Jersey (ftUS)
colnames(NJstations)[2] <- "STNNAME"
  
amtrak <- st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Amtrak_Stations-shp/Amtrak_Stations.shp") %>%
  st_transform(crs = 4326) %>% 
  filter(STATE == "NJ" | CITY2 == "New York") %>% 
  dplyr::select(OBJECTID, STNNAME, geometry) 
## Reading layer `Amtrak_Stations' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Amtrak_Stations-shp/Amtrak_Stations.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 529 features and 7 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -123.1028 ymin: 25.84955 xmax: -69.96549 ymax: 48.72026
## Geodetic CRS:  WGS 84
stations <- rbind(NJstations, amtrak) %>% st_transform(crs = 4326)
trips.sf <- merge(trips, stations, by.x= "to", by.y="STNNAME")
trips.sf <- merge(trips.sf, stations, by.x= "from", by.y="STNNAME")
trips.sf <- trips.sf %>% 
  dplyr::mutate(to_lon = sf::st_coordinates(geometry.y)[,1],
                to_lat = sf::st_coordinates(geometry.y)[,2]) %>%
  mutate(from_lon = sf::st_coordinates(geometry.x)[,1],
         from_lat = sf::st_coordinates(geometry.x)[,2])
limit <- c(
  left = -75.11,
  bottom = 40.111689,
  right = -73.53,
  top = 41.302571)

basemap <- get_stamenmap(limit, zoom = 9, maptype = "toner-background")

#ggmap(map) + 
#  geom_point(data=trips.sf, 
#             mapping = aes(x=from_lon, y=from_lat, color=line),
#             size=0.8, 
#             alpha=0.3)+
#   labs(title="NJ Transit") +
#  theme_nice() +
#  mapTheme()

4.3 Census Data

In this report, We import new york and new jersey census data, including transportation conditions of population, such as numbers of commuters, the total number of public transportation, etc. We use spatial information as origin and destination data to the train trips. And also we extract the census tracts for mapping and joining purposes - creating an sf object that consists geometry.

NJcensus <- 
  get_acs(geography = "county", 
          variables = c("B01003_001", "B08006_012",
                        "B08013_001", "B08012_001", 
                        "B08301_001", "B08301_010"), 
          year = 2019, 
          state = "NJ", 
          geometry = TRUE,
          output = "wide"
          ) %>%
  rename(Total_Pop =  B01003_001E,
         Rail_Pop = B08006_012E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(NAME, Total_Pop, Rail_Pop, Travel_Time, Num_Commuters, Means_of_Transport, Total_Public_Trans, GEOID, geometry) %>%
  mutate(Percent_Railpop = Rail_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport) 

NJcensus <- 
  NJcensus%>%
  mutate(County.Name = sapply(strsplit(as.character(NJcensus$NAME),','), "[", 1))%>%
  mutate(State.Name = sapply(strsplit(as.character(NJcensus$NAME),','), "[", 2))%>%
## extract county name from name
  select(., -c("NAME"))
NJcensus$State.Name <- trimws(NJcensus$State.Name, which = c("left"))
NYcensus <- 
  get_acs(geography = "county", 
          variables = c("B01003_001", "B08006_012",
                        "B08013_001", "B08012_001", 
                        "B08301_001", "B08301_010"), 
          year = 2019, 
          state = "NY", 
          geometry = TRUE,
          output = "wide"
          )  %>%
  rename(Total_Pop =  B01003_001E,
         Rail_Pop = B08006_012E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(NAME, Total_Pop, Rail_Pop, Travel_Time, Num_Commuters, Means_of_Transport, Total_Public_Trans, GEOID, geometry) %>%
  mutate(Percent_Railpop = Rail_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

NYcensus <- 
  NYcensus%>%
  mutate(County.Name = sapply(strsplit(as.character(NYcensus$NAME),','), "[", 1))%>%
  mutate(State.Name = sapply(strsplit(as.character(NYcensus$NAME),','), "[", 2))%>%
## extract county name from name
  select(., -c("NAME"))
NYcensus$State.Name <- trimws(NYcensus$State.Name, which = c("left"))
#combine census data together
censusdata <- rbind(NYcensus, NJcensus)
#extract census tracts
censustracts <- censusdata %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf %>%
  st_transform(crs=4326)

4.4 Transportation Data

These data from Bureau of Transportation Statistics. There are 22 columns in the dataset, which contain percent of poor condition bridges, percent of resident workers who commute by transit, and other transportation conditions divided by census tracts.

Bridges data is compiled in the National Transportation Atlas Database (NTAD) using data from the Federal Highway Administration National Bridge Inventory (NBI). The dataset includes bridges located on public roads as of December 31, 2018, including Interstate Highways, U.S. highways, State and county roads, as well as publicly-accessible bridges on Federal and Tribal lands.

#transportation data(divided by county)
transdataall <- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/County_Transportation_Profiles.csv") %>%
  select(., -c("Non.Commercial..Civil.Public.Use.Airports.and.Seaplane.base", "Non.Commercial..Other.Aerodromes","Total.Marinas","Total.Docks"))
transdata <- subset(transdataall, transdataall$State.Name == "New York" | transdataall$State.Name == "New Jersey" )

#merge census data with transportation data
census_merge <- merge(censusdata,transdata,by=c("County.Name","State.Name")) 
census_merge.st <- st_transform(x=census_merge,crs=4326)
#write_csv(census_merge,"/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/census_merge_original.csv")

4.5 Weather Data

New York City Weather Panel

We consider stations in NYC as destinations. Weather at the destination (fog, gust, or bad weather) may also affect the departure time of the train, then cause delays.

NYCweather.panel <- 
  riem_measures(station = "NYC", date_start = "2019-10-1", date_end = "2019-12-31") %>%
  dplyr::select(valid, tmpf, sknt, p01i, vsby, gust, ice_accretion_6hr)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Wind_speed = max(sknt),
              Precipitation = sum(p01i),
              Visibility = max(vsby),
              Gust = max(gust),
              Ice = max(ice_accretion_6hr)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
  ggplot(NYCweather.panel, aes(interval60,Temperature)) + geom_line(color = "#f4a261", linewidth = 0.3) +
  labs(title="Temperature",
       x="Hour", 
       y="Temperature") +  theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NYCweather.panel, aes(interval60,Wind_speed)) + geom_line(color = "#f4a261", linewidth = 0.3) +
  labs(title="Wind Speed",
       x="Hour", 
       y="Wind Speed") +  theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NYCweather.panel, aes(interval60,Precipitation)) + geom_line(color = "#f4a261", linewidth = 0.3) +
  labs(title="Percipitation",
       x="Hour", 
       y="Perecipitation") +  theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NYCweather.panel, aes(interval60, Visibility)) + geom_line(color = "#f4a261", linewidth = 0.2) + 
    labs(title="Visibility", x="Hour", y="Visibility", ) + theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NYCweather.panel, aes(interval60, Gust)) + geom_line(color = "#f4a261", linewidth = 0.3) +
    labs(title="Gust", x="Hour", y="Gust") + theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NYCweather.panel, aes(interval60, Ice)) + geom_line(color = "#f4a261", linewidth = 0.3) +
    labs(title="Ice Accretion over 6 Hours", x="Hour", y="Ice Accretion") + theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  top = "Weather Data Per Hour - New York City - From Oct. to Dec., 2019") 

New Jersey Weather Panel

We use riem_measures help to import weather data, in this case We extract weather data of Newark Liberty International Airport in New Jersey. In addition, there is another panel dataset NJweather.Panel summarizing temperature, precipitation, and wind speed per hour between October and December.

NJweather.panel <- 
  riem_measures(station = "EWR", date_start = "2019-10-1", date_end = "2019-12-31") %>%
  dplyr::select(valid, tmpf, sknt, p01i, vsby, gust, ice_accretion_6hr)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Wind_speed = max(sknt),
              Precipitation = sum(p01i),
              Visibility = max(vsby),
              Gust = max(gust),
              Ice = max(ice_accretion_6hr)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
  ggplot(NJweather.panel, aes(interval60,Temperature)) + geom_line(color = "#80ed99", linewidth = 0.3) +
  labs(title="Temperature",
       x="Hour", 
       y="Temperature") +  theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NJweather.panel, aes(interval60,Wind_speed)) + geom_line(color = "#80ed99", linewidth = 0.3) +
  labs(title="Wind Speed",
       x="Hour", 
       y="Wind Speed") +  theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NJweather.panel, aes(interval60,Precipitation)) + geom_line(color = "#80ed99", linewidth = 0.3) +
  labs(title="Percipitation",
       x="Hour", 
       y="Perecipitation") +  theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NJweather.panel, aes(interval60, Visibility)) + geom_line(color = "#80ed99", linewidth = 0.2) + 
    labs(title="Visibility", x="Hour", y="Visibility", ) + theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NYCweather.panel, aes(interval60, Gust)) + geom_line(color = "#80ed99", linewidth = 0.3) +
    labs(title="Gust", x="Hour", y="Gust") + theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  ggplot(NJweather.panel, aes(interval60, Ice)) + geom_line(color = "#80ed99", linewidth = 0.3) +
    labs(title="Ice Accretion over 6 Hours", x="Hour", y="Ice Accretion") + theme_nice() + 
    theme(axis.text.x = element_text(size = 8),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(size=10),
          axis.title.x = element_blank()),
  top = "Weather Data Per Hour - New Jersey - From Oct. to Dec., 2019") 

5.Exploratory Data Analysis

In this section, we are going to examine the distribution and characteristic of train delays time.

5.1 Delay Distribution

ggplot(subset(trips.sf, delay_minutes<=10))+
            geom_histogram(aes(delay_minutes), 
                           binwidth = 0.2, 
                           fill="#f7ebe1", 
                           color="#f4a261",
                           alpha=0.8)+
            labs(x="Minutes", 
                 y="Counts",
                   title="Delay Minutes - Less than 10 Minutes",
                   subtitle="On-time") +
  theme_nice()

ggplot(subset(trips.sf, delay_minutes>10))+
              geom_histogram(aes(delay_minutes), 
                             binwidth = 10, 
                             fill="#f7ebe1",
                             color="#f4a261",
                             alpha=0.4)+
              #xlim(0, 100) +
              labs(x="Minutes", 
                   y="Counts",
                   title="Delay Minutes - More Than 10 Minutes",
                   subtitle="Delayed") +
  theme_nice()

By Day

The following diagrams illustrate Monday and Tuesday, the first two days of workdays, have the largest delay minutes. And also we know from the charts that the train often has longer delayed minutes in rush hours during workdays. And it is less noticeable on weekends. So we conclude the value of delay minutes have a strong correlation with workdays and rush hours.

trips.sf %>%
  select(dotw, delay_minutes) %>%
  group_by(dotw) %>%
  summarise(sum_delay = sum(delay_minutes)) %>%
  ggplot(aes(y = dotw, x = sum_delay, fill = dotw)) +
  geom_bar(stat="identity",alpha = 0.6, width = 0.6) +
  scale_fill_manual(values = palette11_1) +
  labs(y = "Day",
       x = "Minutes",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       title = "Overall Delay Minutes By Day",
       fill = "Day") +
  theme_nice()

trips %>%
  select(hour, delay_minutes, dotw) %>%
  group_by(hour, dotw) %>%
  summarise(sum_delay = sum(delay_minutes)) %>%
  ggplot(aes(x = hour, y = sum_delay, color = dotw)) +
  scale_color_manual(values = palette11_1) +
  geom_line() +
  labs(title="Overall Delay Minutes by Day",
       subtitle="NJ Transit in NJ and around NYC",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       x="Hour", 
       y="Delay Minutes",
       color="Day") +
  theme(plot.title = element_text(size=14),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        plot.subtitle = element_text(size = 10)) +
  theme_nice()
## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.

By Month

The following plots indicate the overall delay minutes by time during October and December. There is a rapid increasing in late December. We speculate that this peak occurred around Christmas holiday. But overall, train delays happened every day in 2019, which is a sad story for passengers. Meanwhile, most delays are less than 100 minutes.

#overall delay minutes by month
ggplot()+
  geom_line(data=trips201910, aes(x=interval60, y=delay_minutes, color=factor(month))) + scale_colour_manual(values = c("#ffe15c","#efb366")) + 
  labs(title="Delay Minutes in October",
       subtitle="NJ Transit in NJ and around NYC",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       x="Time",
       y="Delay Minutes") +
  theme_nice()

ggplot()+
  geom_line(data=trips201911, aes(x=interval60, y=delay_minutes, color=factor(month))) + scale_colour_manual(values = c("#efb366","#bbefa9")) + 
  labs(title="Delay Minutes in November",
       subtitle="NJ Transit in NJ and around NYC",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       x="Time",
       y="Delay Minutes") +
  theme_nice()

ggplot()+
  geom_line(data=trips201912, aes(x=interval60, y=delay_minutes, color=factor(month))) + scale_colour_manual(values = c("#f7ede2","#bbefa9")) +
  labs(title="Delay Minutes in December",
       subtitle="NJ Transit in NJ and around NYC",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       x="Time",
       y="Delay Minutes") +
  theme_nice()

ggplot()+
  geom_line(data=trips, aes(x=interval60, y=delay_minutes, color=factor(month))) +
  scale_colour_manual(values = c("#f7ede2","#ffe15c","#efb366","#bbefa9")) +
  labs(title="Delay Minutes from Oct. to Dec.",
       subtitle="NJ Transit in NJ and around NYC",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       x="Time",
       y="Delay Minutes") +
  theme_nice()

By Line

From the diagrams, We can see that the princeton shuttle performed well among these lines, and the delays often happen in the morning rush and evening rush, which really frustrates commuters.

trips.sf %>%
  select(line, delay_minutes) %>%
  group_by(line) %>%
  summarise(sum_delay = sum(delay_minutes)) %>%
  ggplot(aes(x = sum_delay , y=line, fill = line)) +
  geom_bar(stat="identity",alpha = 0.6, width = 0.6) +
  scale_fill_manual(values = palette11_1) +
  labs(y = "Line",
       x = "Minutes",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       title = "Overall Delay Minutes By Line",
       fill = "Line") +
  theme_nice()

trips %>%
  select(hour, delay_minutes, line) %>%
  group_by(hour, line) %>%
  summarise(sum_delay = sum(delay_minutes)) %>%
  ggplot(aes(x = hour, y = sum_delay, color = line)) +
  scale_color_manual(values = palette11_1) +
  geom_line() +
  labs(title="Overall Delay Minutes by Line",
       subtitle="NJ Transit in NJ and around NYC",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       x="Hour", 
       y="Delay Minutes",
       color="Line") +
  theme(plot.title = element_text(size=14),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        plot.subtitle = element_text(size = 10)) +
  theme_nice()
## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.

trips %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, from, time_of_day) %>%
         tally()%>%
  group_by(from, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), fill="#f7ebe1", color="#f4a261", linewidth = 0.4, binwidth = 0.2)+
  labs(title="Mean Number of Hourly Departure Trains Per Station",
       subtitle = "in NJ Transits and Amtrak Stations during quarter 4 in 2019",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  theme_nice()
## `summarise()` has grouped output by 'from'. You can override using the
## `.groups` argument.

By Stations

Those maps show the delays at each stations both in weekday and weekend. It is common that train would delay for a while on weekday.

ggmap(basemap) +
  geom_point(data = trips.sf %>% 
            mutate(hour = hour(actual_time),
                weekday = ifelse(dotw %in% c("Mon", "Tue","Wed","Thu", "Fri"),"Weekday","Weekend"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(to, to_lat, to_lon, weekday, time_of_day) %>%
              tally(),
            aes(x = to_lon, y = to_lat, color = n), 
            fill = "transparent", alpha = 0.4, size = 0.1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       name = "Sum of delay minutes") +
  facet_grid(weekday ~ time_of_day)+
  labs(title="Delay Minutes By Station",
       subtitle = "during quarter 4 in 2019",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
       ) +
    theme_nice() +
    theme(axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()) 

The following time series diagrams indicate the delay time in each station and each metro line from October and December. We can compare between stations and lines to figure out whether there will be spatial patterns in the delays time.

#for (i in linelist)
#  {
#stationlist <- trips.sf %>% filter(line==i)
#stationlist <- unique(stationlist$to)

#for (j in stationlist){
#plot3 <- trips.sf %>%
#  st_drop_geometry()%>%
#  filter(line==i & to==j)%>%
#  na.omit()%>%
#  mutate(interval_hour = ymd_h(substr(scheduled_time, 1, 13)),
#        week = week(scheduled_time))%>%
#  filter(week%in%(40:52))%>%
#  group_by(interval_hour,date)%>%
#  summarise(AverageDelayInHour = mean(delay_minutes))%>%
#  ggplot(aes(interval_hour,AverageDelayInHour)) + 
#  geom_line(color="#f4a261",size=0.2)+
#  labs(title = paste("Station:",j,sep = ""),y="Delay Hour")+
#  theme_nice()

#  g3 <- arrangeGrob(plot3)
#   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Line",paste(j,"_delay.png",sep = ""),g3)
#}
#}

#list.files(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Line",pattern = '*.png', full.names = TRUE) %>%
#  image_read() %>% 
#  image_join() %>% 
#  image_animate(fps=1) %>%
#  image_write("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Line/delay.gif")

6.Feature Engineering

6.1 Panel

trips.sf <- merge(trips.sf, NJweather.panel)

Station Lags

Delays at the former station have a significant impact on the punctuality of the latter station. It is difficult for trains to arrive at the next station on time after being delayed at the previous station. So we build up a “station lag” to take this factor into account. Arranging train data by stop sequence, we use delay minutes to count the influence of the previous station. And we choose 3 as our lag range.

#write.csv(station.panel,"stationpanel.csv")
# station panel
station.panel <- trips.sf %>%
  group_by(train_id,date) %>% arrange(stop_sequence) %>%
    mutate(stationlag_1 = dplyr::lag(delay_minutes,1),
     stationlag_2 = dplyr::lag(delay_minutes,2),
     stationlag_3 = dplyr::lag(delay_minutes,3)) %>%
  ungroup() 
station.panel[c("stationlag_1", "stationlag_2", "stationlag_3")][is.na(station.panel[c("stationlag_1", "stationlag_2", "stationlag_3")])] <- 0

Time Lags

To test for serial (temporal) correlation, additional feature engineering creates time lags; arrange sorts the data by space, then time; group_by 60 minutes interval, and lag returns the delay_minutes for the previous nth time period.

# add time lags
timelags.panel <- 
  trips.sf %>% 
  arrange(from, interval60) %>% 
  group_by(interval60)%>% 
    mutate(
           lag1Hour = dplyr::lag(delay_minutes,1),
           lag2Hours = dplyr::lag(delay_minutes,2),
           lag3Hours = dplyr::lag(delay_minutes,3),
           lag4Hours = dplyr::lag(delay_minutes,4),
           lag5Hours = dplyr::lag(delay_minutes,5),
           lag6Hours = dplyr::lag(delay_minutes,6),
           lag12Hours = dplyr::lag(delay_minutes,12),
           lag1day = dplyr::lag(delay_minutes, 24)) %>% 
  ungroup()
timelags.panel[c("lag1Hour", "lag2Hours", "lag3Hours","lag4Hours","lag5Hours","lag6Hours","lag12Hours","lag1day")][is.na(timelags.panel[c("lag1Hour", "lag2Hours", "lag3Hours","lag4Hours","lag5Hours","lag6Hours","lag12Hours","lag1day")])] <- 0

Final Panel

Finally, we have a merge those panels together to create a Final.Panel which includes station lags, time lags, train.panel, and census data.

# study panel 
study.panel <- 
  expand.grid(interval60 = unique(trips.sf$interval60), 
              from = unique(trips.sf$from)) %>%
  left_join(trips.sf)

# train panel
train.panel <- trips.sf %>%
  right_join(study.panel) %>%
  right_join(station.panel) %>%
  right_join(timelags.panel)

# add census data
train.panel.st <- st_as_sf(x = train.panel, 
                        coords = c("to_lon", "to_lat"),
                        crs = 4326)
# final panel
Final.panel <- st_join(train.panel.st,census_merge.st)

6.2 Serial Autocorrelation

Those plots show that there is a positive linear relationship between the delays time and the time lags.

plotData.lag <-
  filter(as.data.frame(Final.panel),week == 40) %>%
  dplyr::select(starts_with("lag"), delay_minutes) %>%
  gather(Variable, Value, -delay_minutes) %>%
  mutate(Variable = fct_relevel(Variable, "lag1Hour ","lag2Hours","lag3Hours","lag4Hours","lag5Hours","lag6Hours","lag12Hours","lag1day"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2)) 

ggplot(plotData.lag, aes(Value, delay_minutes)) +
  geom_point(size = 0.05,
             alpha = 0.3,
             color = "black") +
  geom_text(data = correlation.lag, 
            aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, 
            y=Inf, 
            vjust = 2, 
            hjust = -.1,
            color = "#c77746") +
  geom_smooth(method = "lm", se = FALSE, size = 0.8, color = "#ECA063") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Train Delay Time as a Function of Time Lags",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle") +
  theme_nice() +
  theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 6)) 

According to the plots, station lags have a strong linear correlation with delay minutes. Especially the influence of the previous station to the next station.

plotData.lag <-
  filter(as.data.frame(Final.panel),week == 40) %>%
  dplyr::select(starts_with("stationlag"), delay_minutes) %>%
  gather(Variable, Value, -delay_minutes) %>%
  mutate(Variable = fct_relevel(Variable, "stationlag_1 ","stationlag_2","stationlag_3"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2)) 

ggplot(plotData.lag, aes(Value, delay_minutes)) +
  geom_point(size = 0.05,
             alpha = 0.3,
             color = "black") +
  geom_text(data = correlation.lag, 
            aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, 
            y=Inf, 
            vjust = 2, 
            hjust = -.1,
            color = "#c77746") +
  geom_smooth(method = "lm", se = FALSE, size = 0.8, color = "#ECA063") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Train Delay Time as a Function of Station Lags",
       caption = "Data: NYC Open Data, NJGIN Open Data & kaggle") +
  theme_nice() +
  theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 6)) 

6.3 NJ Station Weather Correlation

Extreme weather can sometimes affect the entire train system. The weather data includes temperature, wind speed, precipitation, visibility, gust and ice accretion. We explored the correlation between weather and delay time. For departure stations in New Jersey, at higher wind speed like gust and higher precipitation, the delays time are longer.

#adding the weather features to the data
temper1 <- trips.sf  %>%
  group_by(Temperature) %>%
  summarize(mean_delay = mean(delay_minutes)) %>%
  ggplot(aes(Temperature,mean_delay)) +
  geom_point(size = 0.5)+
  stat_smooth(aes(Temperature,mean_delay), 
              method = "lm", se = FALSE, linewidth = 0.7, color="#80ed99")+
  labs(x="Temperature", y="Avg. Delay (min)") +
  theme_nice()

wind1 <- trips.sf %>%
  group_by(Wind_speed) %>%
  summarize(mean_delay = mean(delay_minutes)) %>%
  ggplot(aes(Wind_speed,mean_delay)) +
  geom_point(size = 0.5)+
  stat_smooth(aes(Wind_speed,mean_delay), 
              method = "lm", se = FALSE, linewidth = 0.7, color="#80ed99")+
  labs(x="Wind_speed", y="Avg. Delay (min)") +
  theme_nice()

precip1 <- trips.sf  %>%
  group_by(Precipitation) %>%
  summarize(mean_delay = mean(delay_minutes)) %>%
  ggplot(aes(Precipitation,mean_delay)) +
  geom_point(size = 0.5)+
  stat_smooth(aes(Precipitation,mean_delay), 
              method = "lm", se = FALSE, linewidth = 0.7, color="#80ed99")+
  labs(x="Precipitation", y="Avg. Delay (min)") +
  theme_nice()

visib1 <- trips.sf %>%
  group_by(Visibility) %>%
  summarize(mean_delay = mean(delay_minutes)) %>%
  ggplot(aes(Visibility,mean_delay)) +
  geom_point(size = 0.5)+
  stat_smooth(aes(Visibility,mean_delay), 
              method = "lm", se = FALSE, linewidth = 0.7, color="#80ed99")+
  labs(x="Visibility", y="Avg. Delay (min)") +
  theme_nice()

gust1 <- trips.sf %>%
  group_by(Gust) %>%
  summarize(mean_delay = mean(delay_minutes)) %>%
  ggplot(aes(Gust,mean_delay)) +
  geom_point(size = 0.5)+
  stat_smooth(aes(Gust,mean_delay), 
              method = "lm", se = FALSE, linewidth = 0.7, color="#80ed99")+
  labs(x="Gust", y="Avg. Delay (min)") +
  theme_nice()

ice1 <- trips.sf %>%
  group_by(Ice) %>%
  summarize(mean_delay = mean(delay_minutes)) %>%
  ggplot(aes(Ice,mean_delay)) +
  geom_point(size = 0.5)+
  stat_smooth(aes(Ice, mean_delay), 
              method = "lm", se = FALSE, linewidth = 0.7, color="#80ed99")+
  labs(x="Ice Accretion", y="Avg. Delay (min)") +
  theme_nice()
ggarrange(temper1, wind1, precip1, visib1, gust1, ice1, nrow = 2, ncol = 3)

6.4 Correlations Matrix

The correlation matrix below is to visualize correlation across numeric variables. Darker orange and darker green shows a strong relationship between variables. Number of resident workers who work at home is related to the number of resident workers who commute within county. The same relationship between So we removed “Number.of.resident.workers.who.work.at.home” ,“Total_Pop” from predictors.

numvar <- Final.panel %>% dplyr::select( Temperature, Wind_speed, Precipitation, Visibility, Gust, Ice, lag1Hour,lag2Hours, lag3Hours, lag4Hours, lag5Hours,lag12Hours,lag1day,Number.of.Bridges, Percent_Taking_Public_Trans, Number.of.resident.workers.who.work.at.home, Number.of.resident.workers.who.commute.within.county,stationlag_1, stationlag_2, stationlag_3,Percent_Railpop, Total_Pop, Total_Public_Trans ) %>% select_if(is.numeric) %>% na.omit() %>% ungroup()
numvar <- st_drop_geometry(numvar)

ggcorrplot(
outline.color = "white",
  round(cor(numvar), 1), 
  p.mat = cor_pmat(numvar),
  show.diag = TRUE,
  colors = c("#80ed99", "white", "#ECA063"),
  type="lower",
  tl.cex=9,
  pch.cex=10,
  insig = "blank", 
  lab = TRUE,
  lab_size = 1.5,
  lab_col = "grey40") +  
  theme_nice()+
    labs(title = "Correlation Matrix Across Numeric Variables") +
  theme(plot.title = element_text(size = 15, hjust = 1),
        axis.text.x = element_text(size = 5, hjust = 1,vjust = 1, angle = 45),
        axis.text.y = element_text(size = 5, hjust = 1))

7.Regression Model

We created 7 regression models to identify the effects of spatial factors, temporal factors and also external features.

  • Model A focuses on just time effects, including temporal controls: hour fixed effects, day of the week.

  • Model B focuses on just space effects with the stations fixed effects, and also includes day of the week and the weather.

  • Model C includes both time and space fixed effects, and also contains weather.

  • Model D focuses on station lags, with both time and space fixed effects, weather and transportation features.

  • Model E focuses on time lags, with both time and space fixed effects, weather, and transportation features.

  • Model F focuses on both time lags and station lags, also includes time and space fixed effects, then contains weather and transportation features.

  • Model G includes both time and space fixed effects, and also both time lags and station lags, then contains weather, transportation features and census factors.

7.1 Split the Data

Here, a time series approach is taken, training on 13 weeks of data, weeks 35-48, and testing on the following 5 weeks, 49-53. The below code splits the data by week.

delay.Train <- filter(Final.panel, week <= 48)
delay.Test <- filter(Final.panel, between(week, 49, 52))

7.2 Time series

In this part, we added Mondays as well as Thanksgiving and Christmas into the graphic(dotted lines). Most weeks exhibit remarkably comparable time series patterns with consistent peaks and troughs. This suggests the presence of serial correlation. Weeks surrounding Thanksgiving and Christmas appear as clear outliers. There is also a noticeable peak in the testing set, around 20 Dec, which could influence the accuracy of our model.

tg   <- as.POSIXct("2019-11-28 00:00:00 UTC")
xmas <- as.POSIXct("2019-12-24 00:00:00 UTC")

mondays <- 
  mutate(Final.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 
st_drop_geometry(rbind(
  mutate(delay.Train, Legend = "Training"), 
  mutate(delay.Test, Legend = "Testing"))) %>%
    group_by(Legend, interval60) %>% 
      summarize(delay_minutes) %>%
      ungroup() %>% 
      ggplot(aes(interval60, delay_minutes, colour = Legend)) + geom_line(size = 0.2) +
        scale_colour_manual(values = palette2) +
        geom_vline(xintercept = tg, color = "#29424d") +
        geom_vline(xintercept = xmas, color = "#29424d") +
        geom_vline(data = mondays, linetype = "dotted", aes(xintercept = monday)) +
        labs(title="Overall Delay Minutes: October - December",
             subtitle="Dotted Lines for Mondays, Straight Lines for Thanksgiving and Christmas", 
             x="Day", y="Mean of Delay Minutes") + 
  theme_nice() +
  theme(panel.grid.major = element_blank())  

#reg1: Time_FE
#focuses on just time, including temporal controls: hour fixed effects, day of the week.
reg1 <- 
  lm(delay_minutes ~  hour(interval60) + dotw, data=delay.Train)

##reg2: Space_FE
#focuses on just space effects with the stations fixed effects, including day of the week and weather.
reg2 <- 
  lm(delay_minutes  ~  from + dotw, data=delay.Train)

#reg3: Time_Space_FE_Weather 
#includes both time and space fixed effects, and weather.
reg3 <- 
  lm(delay_minutes  ~  from +  hour(interval60) + dotw + Temperature + Precipitation + Wind_speed + Visibility + Gust + Ice, data=delay.Train)

#reg4: Time_Space_FE_TimeLags_Weather_Transportation
#time and space fixed effects, with weather, time lag and transportation features.
reg4 <- 
  lm(delay_minutes  ~  from +  hour(interval60) + dotw + Temperature + Precipitation + Wind_speed + Visibility + Gust + Ice + lag1Hour + lag2Hours + lag3Hours + lag4Hours + lag5Hours + lag12Hours + lag1day + Number.of.Bridges, data=delay.Train)

#reg5: Time_Space_FE_StationLags_Weather_Transportation
#time and space fixed effects, with weather, station lag and transportation features.
reg5 <- 
  lm(delay_minutes  ~  from +  hour(interval60) + dotw + Temperature + Precipitation + Wind_speed + Visibility + Gust + Ice + stationlag_1 + stationlag_2 + stationlag_3 + Number.of.Bridges, data=delay.Train)

#reg6: Time_Space_FE_TimeLags_Weather_Transportation_StationLags
#time and space fixed effects, with weather, transportation features, time lag and station lag.
reg6 <- 
  lm(delay_minutes  ~  from +  hour(interval60) + dotw + Temperature + Precipitation + Wind_speed + Visibility + Gust + Ice + lag1Hour + lag2Hours + lag3Hours + lag4Hours + lag5Hours + lag12Hours + lag1day + Number.of.Bridges  + Percent_Taking_Public_Trans +  Number.of.resident.workers.who.commute.within.county + stationlag_1 + stationlag_2 + stationlag_3, data=delay.Train)

#reg7: Time_Space_FE_TimeLags_Weather_Transportation_StationLags_Census
#time and space fixed effects, with weather, transportation features, time lag, station lag and census features.
reg7 <- 
  lm(delay_minutes  ~  from +  hour(interval60) + dotw + Temperature + Precipitation + Wind_speed + Visibility + Gust + Ice + lag1Hour + lag2Hours + lag3Hours + lag4Hours + lag5Hours + lag12Hours + lag1day + Number.of.Bridges  + Percent_Taking_Public_Trans + Number.of.resident.workers.who.commute.within.county + stationlag_1 + stationlag_2 + stationlag_3 + Percent_Railpop, data=delay.Train)

7.3 Model Summary

Mathematically, we use some common evaluation metrics to examine the performance of models.

  1. Mean Absolute Error (MAE): the mean absolute error between observed and predicted values.

  2. R-squared (R2): the higher the R-squared, the better the model.

  3. Residual Standard Error (RSE): the lower the RSE, the better the model.

  4. AIC: the lower the AIC, the better the model.

From the summary table, the Model E, F, G have lower AIC value, and R-squared are much higher, compared with Model A, B, C, D. We conclude that adding station lags significantly improves the performance of model.

stargazer(reg1, reg2, reg3, reg4, reg5,reg6,reg7,
          type = "html",
          title = "Seven Regression Models Predicting Train Delays in and around NYC",
          column.labels = c("Model A", "Model B", "Model C","Model D", "Model E", "Model F", "Model G"),
          colnames = FALSE,
          add.lines = list(c("Corrected AIC", round(AICc(reg1), 1), round(AICc(reg2), 1), round(AICc(reg3), 1),round(AICc(reg4), 1), round(AICc(reg5), 1), round(AICc(reg6), 1), round(AICc(reg7), 1))),
          out = "/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/table1.html")
Seven Regression Models Predicting Train Delays in and around NYC
Dependent variable:
delay_minutes
Model A Model B Model C Model D Model E Model F Model G
(1) (2) (3) (4) (5) (6) (7)
hour(interval60) 0.120*** 0.093*** 0.048*** 0.010*** 0.001 0.001
(0.002) (0.002) (0.002) (0.001) (0.001) (0.001)
fromAbsecon 14.425*** 14.375*** 14.690*** 20.446*** 19.936*** 19.818***
(0.287) (0.285) (0.290) (0.145) (0.145) (0.147)
fromAllendale -1.348*** -1.367*** -2.934*** 0.775*** 0.666*** 0.686***
(0.211) (0.210) (0.206) (0.102) (0.103) (0.103)
fromAllenhurst -0.035 -0.053 -1.426*** 0.549*** 0.200** 0.167
(0.209) (0.208) (0.205) (0.102) (0.102) (0.102)
fromAnnandale -1.684*** -1.689*** -2.218*** -0.363** -1.138*** -1.150***
(0.303) (0.301) (0.298) (0.148) (0.149) (0.149)
fromAsbury Park -0.345* -0.366* -1.631*** 0.042 -0.288*** -0.321***
(0.209) (0.208) (0.205) (0.102) (0.101) (0.102)
fromAtco 11.634*** 11.504*** 10.242*** 1.394*** 0.777*** 0.721***
(0.222) (0.221) (0.229) (0.115) (0.115) (0.116)
fromAvenel -0.203 -0.206 -2.277*** 0.925*** 0.403*** 0.470***
(0.198) (0.197) (0.194) (0.096) (0.097) (0.098)
fromBasking Ridge -0.450** -0.437** -1.522*** -0.815*** -1.506*** -1.548***
(0.198) (0.197) (0.196) (0.097) (0.098) (0.099)
fromBay Head -4.202*** -4.193*** -5.086*** 1.731*** 1.172*** 1.069***
(0.269) (0.268) (0.272) (0.135) (0.138) (0.140)
fromBelmar 0.838*** 0.785*** -0.667*** 1.272*** 0.917*** 0.884***
(0.209) (0.208) (0.205) (0.102) (0.102) (0.102)
fromBerkeley Heights 0.409** 0.407** -0.946*** 2.134*** 1.610*** 1.670***
(0.198) (0.197) (0.194) (0.096) (0.097) (0.098)
fromBernardsville 0.218 0.219 -0.945*** 1.344*** 0.634*** 0.592***
(0.198) (0.197) (0.196) (0.097) (0.098) (0.099)
fromBloomfield 0.026 -0.170 -1.927*** 3.417*** 2.945*** 2.979***
(0.207) (0.206) (0.203) (0.101) (0.108) (0.108)
fromBoonton 0.422 0.261 -1.465*** 0.620*** 0.119 0.120
(0.300) (0.298) (0.293) (0.145) (0.145) (0.145)
fromBound Brook -0.812*** -0.864*** -2.499*** 0.299*** -0.224** -0.228**
(0.184) (0.183) (0.181) (0.090) (0.090) (0.090)
fromBradley Beach 0.102 0.049 -1.394*** 0.026 -0.304*** -0.337***
(0.209) (0.208) (0.205) (0.102) (0.101) (0.102)
fromBrick Church 0.163 0.126 -1.397*** 0.603*** 0.224*** 0.259***
(0.155) (0.154) (0.152) (0.075) (0.085) (0.085)
fromBridgewater -0.450** -0.512*** -1.978*** 0.761*** 0.004 -0.034
(0.185) (0.184) (0.185) (0.091) (0.093) (0.093)
fromCampbell Hall -0.255 0.166 -1.144*** 2.211*** 1.657*** 1.666***
(0.313) (0.312) (0.306) (0.152) (0.152) (0.152)
fromChatham 0.161 0.073 -1.400*** 0.364*** -0.176** -0.114
(0.163) (0.162) (0.161) (0.080) (0.081) (0.082)
fromCherry Hill 9.409*** 9.272*** 8.255*** 15.313*** 14.669*** 14.671***
(0.294) (0.293) (0.294) (0.146) (0.146) (0.146)
fromClifton -0.198 -0.262 -1.739*** 1.286*** 0.571*** 0.475***
(0.185) (0.184) (0.187) (0.093) (0.094) (0.096)
fromConvent Station 0.393** 0.320** -1.529*** 0.534*** 0.016 0.017
(0.163) (0.162) (0.160) (0.079) (0.079) (0.079)
fromCranford -0.928*** -0.921*** -2.487*** 0.439*** -0.208** -0.095
(0.186) (0.186) (0.185) (0.092) (0.094) (0.098)
fromDelawanna -0.650*** -0.710*** -2.136*** 0.523*** 0.156* 0.123
(0.185) (0.184) (0.182) (0.090) (0.091) (0.091)
fromDenville -0.391** -0.496*** -2.091*** 0.573*** 0.104 0.105
(0.159) (0.159) (0.156) (0.077) (0.078) (0.078)
fromDover -2.631*** -2.585*** -3.853*** 2.004*** 1.564*** 1.564***
(0.163) (0.162) (0.160) (0.079) (0.080) (0.080)
fromDunellen -0.752*** -0.800*** -1.742*** 0.744*** 0.151 0.188**
(0.184) (0.183) (0.183) (0.091) (0.092) (0.093)
fromEast Orange -0.248 -0.254 -1.498*** 0.599*** 0.248*** 0.283***
(0.177) (0.176) (0.173) (0.086) (0.095) (0.095)
fromEdison -1.141*** -1.123*** -2.643*** 0.264*** 0.033 0.068
(0.154) (0.153) (0.150) (0.075) (0.075) (0.075)
fromEgg Harbor City 12.317*** 12.219*** 11.451*** 3.076*** 2.497*** 2.380***
(0.222) (0.221) (0.231) (0.116) (0.117) (0.120)
fromElberon -1.182*** -1.207*** -3.756*** -0.370*** -0.892*** -0.925***
(0.209) (0.208) (0.205) (0.102) (0.102) (0.102)
fromElizabeth -0.588*** -0.601*** -2.278*** 0.647*** 0.033 0.126
(0.151) (0.150) (0.150) (0.074) (0.079) (0.081)
fromEmerson -1.445*** -1.481*** -3.193*** 0.025 -0.114 -0.093
(0.206) (0.205) (0.201) (0.100) (0.101) (0.101)
fromFanwood -0.650*** -0.658*** -1.780*** 0.524*** -0.043 0.073
(0.187) (0.187) (0.186) (0.092) (0.094) (0.098)
fromFar Hills 0.156 0.131 -0.993*** 1.250*** 0.552*** 0.511***
(0.201) (0.200) (0.199) (0.099) (0.100) (0.100)
fromGarfield -0.711*** -0.745*** -2.407*** 0.370*** 0.248** 0.267**
(0.214) (0.213) (0.209) (0.104) (0.105) (0.105)
fromGarwood -0.012 0.008 -1.512*** 0.689*** 0.053 0.169
(0.248) (0.247) (0.245) (0.121) (0.123) (0.126)
fromGillette -0.673*** -0.678*** -2.237*** -0.129 -0.688*** -0.628***
(0.198) (0.197) (0.194) (0.096) (0.097) (0.098)
fromGladstone -4.203*** -3.970*** -4.678*** 1.098*** 0.444*** 0.401***
(0.206) (0.205) (0.203) (0.101) (0.102) (0.103)
fromGlen Ridge -2.123*** -2.059*** -3.159*** 0.593*** 0.249** 0.283**
(0.216) (0.215) (0.211) (0.105) (0.112) (0.112)
fromHackettstown -3.816*** -3.743*** -4.532*** 1.571*** 0.922*** 0.863***
(0.296) (0.295) (0.291) (0.145) (0.145) (0.146)
fromHamilton -3.424*** -3.597*** -4.212*** -3.685*** -4.139*** -4.114***
(0.178) (0.177) (0.177) (0.088) (0.088) (0.089)
fromHammonton 11.021*** 10.929*** 10.541*** 0.590*** 0.145 0.088
(0.222) (0.221) (0.229) (0.115) (0.115) (0.116)
fromHarriman 2.365*** 2.303*** 0.568*** 3.413*** 2.801*** 2.812***
(0.218) (0.217) (0.213) (0.106) (0.106) (0.106)
fromHawthorne -1.303*** -1.360*** -2.895*** 0.959*** 0.229** 0.133
(0.224) (0.223) (0.225) (0.111) (0.112) (0.114)
fromHazlet -1.494*** -1.417*** -3.199*** 0.450*** 0.035 0.001
(0.212) (0.211) (0.207) (0.103) (0.103) (0.103)
fromHigh Bridge -4.340*** -3.997*** -4.536*** 1.099*** 0.295* 0.281*
(0.320) (0.319) (0.315) (0.157) (0.158) (0.158)
fromHighland Avenue -0.027 -0.025 -1.544*** 0.582*** 0.161* 0.196**
(0.183) (0.182) (0.179) (0.089) (0.097) (0.098)
fromHillsdale -1.283*** -1.276*** -3.150*** 0.273*** 0.092 0.112
(0.204) (0.203) (0.200) (0.099) (0.100) (0.100)
fromJersey Avenue -2.008*** -1.350*** -2.686*** 1.671*** 1.461*** 1.494***
(0.249) (0.248) (0.243) (0.121) (0.121) (0.121)
fromKingsland -1.368*** -1.455*** -3.290*** 3.901*** 3.712*** 3.732***
(0.240) (0.239) (0.235) (0.117) (0.117) (0.117)
fromLake Hopatcong -0.295 -0.414 -1.974*** 0.850*** 0.385*** 0.386***
(0.300) (0.299) (0.293) (0.146) (0.145) (0.145)
fromLebanon -0.465 -0.470 -1.471*** 0.241 -0.599*** -0.611***
(0.303) (0.301) (0.298) (0.148) (0.149) (0.149)
fromLincoln Park 2.356*** 1.748*** 0.264 2.031*** 1.586*** 1.587***
(0.365) (0.363) (0.356) (0.177) (0.177) (0.177)
fromLinden -0.566*** -0.575*** -1.722*** 0.586*** 0.015 0.131
(0.149) (0.148) (0.149) (0.074) (0.077) (0.081)
fromLindenwold 12.654*** 12.534*** 11.030*** 2.479*** 1.883*** 1.887***
(0.222) (0.221) (0.227) (0.114) (0.114) (0.114)
fromLittle Silver -1.587*** -1.617*** -3.928*** -0.694*** -1.190*** -1.222***
(0.172) (0.171) (0.169) (0.084) (0.084) (0.084)
fromLong Branch -3.270*** -3.200*** -4.619*** 1.820*** 1.454*** 1.421***
(0.158) (0.158) (0.156) (0.077) (0.077) (0.078)
fromLyndhurst -0.697*** -0.756*** -1.665*** 0.667*** 0.403*** 0.370***
(0.185) (0.184) (0.181) (0.090) (0.090) (0.091)
fromLyons 0.789*** 0.802*** -0.357* 1.848*** 1.312*** 1.292***
(0.198) (0.197) (0.194) (0.096) (0.097) (0.097)
fromMadison 0.293* 0.222 -1.356*** 0.332*** -0.137* -0.135*
(0.163) (0.162) (0.159) (0.079) (0.079) (0.079)
fromMahwah -1.442*** -1.586*** -2.649*** 0.435*** -0.331*** -0.421***
(0.216) (0.215) (0.223) (0.110) (0.112) (0.113)
fromManasquan -3.214*** -3.219*** -4.736*** 1.245*** 0.874*** 0.841***
(0.269) (0.268) (0.263) (0.131) (0.131) (0.131)
fromMaplewood -0.039 -0.041 -1.485*** 0.579*** 0.192** 0.229***
(0.154) (0.153) (0.151) (0.075) (0.084) (0.085)
fromMetropark -0.925*** -0.898*** -2.310*** 0.382*** 0.009 0.070
(0.152) (0.152) (0.149) (0.074) (0.075) (0.077)
fromMetuchen -1.108*** -1.081*** -2.577*** 0.050 -0.170** -0.136*
(0.154) (0.153) (0.150) (0.075) (0.075) (0.075)
fromMillburn -0.145 -0.157 -1.594*** 0.221*** -0.184** -0.145*
(0.154) (0.153) (0.151) (0.075) (0.084) (0.085)
fromMillington -0.393** -0.377* -1.826*** 0.715*** 0.111 0.092
(0.198) (0.197) (0.194) (0.096) (0.097) (0.097)
fromMontclair Heights -2.242*** -2.169*** -3.639*** 2.825*** 2.390*** 2.424***
(0.231) (0.230) (0.225) (0.112) (0.119) (0.119)
fromMontvale -1.644*** -1.641*** -2.727*** 0.622*** 0.207** 0.170*
(0.204) (0.203) (0.202) (0.100) (0.101) (0.101)
fromMorris Plains -0.233 -0.328** -1.702*** 0.207*** -0.219*** -0.218***
(0.164) (0.163) (0.160) (0.079) (0.080) (0.080)
fromMorristown 0.272* 0.218 -1.440*** 0.654*** 0.173** 0.174**
(0.162) (0.161) (0.159) (0.079) (0.079) (0.079)
fromMount Olive -1.708*** -1.815*** -3.342*** 0.402*** -0.293** -0.338**
(0.268) (0.267) (0.264) (0.131) (0.131) (0.132)
fromMount Tabor -1.084*** -1.122*** -2.561*** 0.219** -0.232** -0.231**
(0.202) (0.201) (0.198) (0.098) (0.098) (0.098)
fromMountain Avenue 0.544*** 0.463** -1.090*** 0.617*** 0.208** 0.243**
(0.195) (0.194) (0.191) (0.095) (0.103) (0.103)
fromMountain Lakes 0.183 0.033 -1.652*** 0.183 -0.300** -0.299**
(0.300) (0.298) (0.293) (0.145) (0.145) (0.145)
fromMountain Station -0.121 -0.127 -1.619*** 0.613*** 0.206** 0.242**
(0.182) (0.181) (0.178) (0.088) (0.097) (0.097)
fromMurray Hill -1.787*** -1.790*** -3.068*** -0.565*** -1.166*** -1.050***
(0.198) (0.197) (0.196) (0.097) (0.099) (0.103)
fromNanuet -1.945*** -1.946*** -2.541*** 0.875*** 0.181* 0.090
(0.196) (0.196) (0.204) (0.101) (0.103) (0.105)
fromNetcong -0.091 -0.298 -1.607*** 0.422*** 0.017 0.017
(0.264) (0.263) (0.258) (0.128) (0.128) (0.128)
fromNetherwood -0.969*** -0.975*** -1.962*** 0.156* -0.384*** -0.268***
(0.187) (0.187) (0.186) (0.092) (0.094) (0.098)
fromNew Bridge Landing -1.679*** -1.949*** -3.376*** 3.519*** 3.406*** 3.426***
(0.254) (0.253) (0.248) (0.123) (0.124) (0.124)
fromNew Brunswick -2.169*** -2.153*** -3.451*** 0.029 -0.168** -0.134*
(0.155) (0.154) (0.152) (0.075) (0.075) (0.076)
fromNew Providence -0.590*** -0.591*** -1.481*** 1.345*** 0.825*** 0.939***
(0.198) (0.197) (0.196) (0.097) (0.099) (0.103)
fromNew York Penn Station -2.741*** -2.860*** -3.137*** 2.756*** 3.882*** 3.920***
(0.140) (0.139) (0.149) (0.074) (0.095) (0.096)
fromNewark Airport -0.789*** -0.819*** -1.676*** 0.660*** 0.292*** 0.360***
(0.142) (0.142) (0.140) (0.070) (0.075) (0.077)
fromNewark Broad Street -0.043 -0.072 -1.339*** 3.408*** 3.322*** 3.362***
(0.164) (0.163) (0.161) (0.080) (0.089) (0.089)
fromNewark Penn Station -1.145*** -1.233*** -2.386*** 2.542*** 2.269*** 2.318***
(0.144) (0.143) (0.141) (0.070) (0.078) (0.078)
fromNorth Branch 1.206*** 1.203*** 0.341 1.872*** 1.130*** 1.105***
(0.303) (0.301) (0.297) (0.148) (0.149) (0.149)
fromNorth Elizabeth -0.353** -0.344** -1.594*** 0.882*** 0.399*** 0.477***
(0.169) (0.168) (0.166) (0.083) (0.087) (0.089)
fromOradell -1.805*** -1.820*** -3.322*** 0.102 -0.008 0.012
(0.204) (0.203) (0.199) (0.099) (0.100) (0.100)
fromOrange 0.010 -0.030 -1.384*** 0.345*** -0.032 0.003
(0.157) (0.156) (0.154) (0.076) (0.086) (0.086)
fromOtisville 3.405*** 3.020*** 1.308*** 1.208*** 0.605*** 0.615***
(0.312) (0.310) (0.305) (0.151) (0.151) (0.151)
fromPark Ridge -1.272*** -1.269*** -3.105*** 0.415*** 0.242** 0.263***
(0.204) (0.203) (0.200) (0.099) (0.100) (0.100)
fromPassaic -0.732*** -0.784*** -1.620*** 0.563*** -0.042 -0.139
(0.185) (0.184) (0.187) (0.093) (0.094) (0.096)
fromPaterson -0.607*** -0.699*** -1.529*** 0.751*** 0.148 0.052
(0.185) (0.184) (0.187) (0.093) (0.094) (0.096)
fromPeapack -0.353* -0.400** -1.540*** 1.000*** 0.301*** 0.260***
(0.198) (0.197) (0.196) (0.097) (0.098) (0.099)
fromPearl River -1.646*** -1.675*** -2.762*** 0.407*** -0.014 -0.051
(0.199) (0.198) (0.197) (0.098) (0.098) (0.099)
fromPerth Amboy -0.141 -0.146 -1.695*** 0.291*** 0.067 0.102
(0.167) (0.167) (0.164) (0.081) (0.081) (0.082)
fromPlainfield -0.771*** -0.788*** -2.201*** 0.555*** 0.142 0.217**
(0.184) (0.184) (0.181) (0.090) (0.090) (0.092)
fromPlauderville -1.940*** -1.882*** -3.510*** 0.250** 0.113 0.133
(0.243) (0.242) (0.238) (0.118) (0.119) (0.119)
fromPort Jervis -3.520*** -3.048*** -4.053*** 1.982*** 1.443*** 1.452***
(0.238) (0.237) (0.233) (0.116) (0.116) (0.116)
fromPrinceton -4.729*** -4.770*** -5.549*** 0.865*** 0.327*** 0.351***
(0.188) (0.187) (0.187) (0.093) (0.093) (0.093)
fromRahway -0.364** -0.387*** -1.378*** 1.318*** 0.985*** 1.059***
(0.148) (0.147) (0.145) (0.072) (0.073) (0.075)
fromRaritan -1.983*** -1.888*** -2.805*** 1.592*** 0.912*** 0.870***
(0.182) (0.181) (0.181) (0.090) (0.091) (0.091)
fromRed Bank 1.801*** 1.646*** 0.257 0.574*** 0.264*** 0.232**
(0.208) (0.207) (0.204) (0.101) (0.101) (0.101)
fromRidgewood -0.262 -0.473** -2.221*** 0.655*** 0.509*** 0.529***
(0.187) (0.186) (0.183) (0.091) (0.092) (0.092)
fromRiver Edge -1.836*** -1.865*** -3.514*** -0.106 -0.240** -0.220**
(0.204) (0.203) (0.199) (0.099) (0.100) (0.100)
fromRoselle Park -0.762*** -0.780*** -1.809*** 0.613*** 0.059 0.175*
(0.188) (0.187) (0.186) (0.092) (0.094) (0.098)
fromRutherford 0.103 -0.094 -1.605*** 5.375*** 5.243*** 5.263***
(0.247) (0.246) (0.241) (0.120) (0.121) (0.121)
fromSalisbury Mills-Cornwall 1.952*** 1.901*** 0.534** 0.946*** 0.400*** 0.410***
(0.237) (0.235) (0.231) (0.115) (0.115) (0.115)
fromShort Hills 0.068 0.032 -1.442*** 0.449*** -0.068 0.009
(0.155) (0.154) (0.153) (0.076) (0.081) (0.083)
fromSloatsburg 1.406*** 1.379*** 0.067 0.864*** 0.193* 0.153
(0.220) (0.219) (0.219) (0.109) (0.109) (0.110)
fromSomerville -1.724*** -1.754*** -3.102*** -0.253*** -0.991*** -1.033***
(0.183) (0.183) (0.183) (0.090) (0.092) (0.092)
fromSouth Amboy 0.200 0.174 -1.467*** 1.049*** 0.753*** 0.759***
(0.164) (0.163) (0.160) (0.079) (0.080) (0.080)
fromSouth Orange 0.296* 0.301** -1.335*** 0.521*** 0.104 0.140*
(0.154) (0.153) (0.151) (0.075) (0.084) (0.085)
fromSpring Lake -0.126 -0.178 -1.930*** 1.261*** 0.864*** 0.831***
(0.209) (0.208) (0.205) (0.102) (0.102) (0.102)
fromSpring Valley -3.624*** -3.302*** -3.944*** 1.901*** 1.175*** 1.084***
(0.196) (0.195) (0.204) (0.102) (0.103) (0.105)
fromStirling -0.970*** -0.979*** -2.312*** -0.202** -0.624*** -0.624***
(0.198) (0.197) (0.193) (0.096) (0.096) (0.096)
fromSuffern -2.339*** -2.358*** -3.194*** 1.997*** 1.584*** 1.543***
(0.165) (0.164) (0.165) (0.082) (0.082) (0.083)
fromSummit -0.316** -0.320** -1.597*** 0.912*** 0.486*** 0.528***
(0.148) (0.148) (0.146) (0.072) (0.076) (0.077)
fromTowaco 0.032 -0.125 -1.722*** -0.096 -0.564*** -0.562***
(0.300) (0.298) (0.293) (0.145) (0.145) (0.145)
fromTrenton -3.176*** -3.062*** -3.828*** 1.170*** 0.652*** 0.675***
(0.152) (0.151) (0.151) (0.075) (0.076) (0.076)
fromTuxedo 1.625*** 1.595*** 1.153*** 0.916*** 0.416*** 0.376***
(0.220) (0.219) (0.219) (0.109) (0.109) (0.110)
fromUnion -1.325*** -1.346*** -2.422*** 0.724*** 0.285*** 0.360***
(0.188) (0.187) (0.184) (0.092) (0.096) (0.098)
fromUpper Montclair 0.614*** 0.541*** -0.940*** 0.829*** 0.433*** 0.468***
(0.181) (0.180) (0.177) (0.088) (0.096) (0.097)
fromWaldwick -2.068*** -2.048*** -3.542*** 0.278*** 0.169** 0.189**
(0.158) (0.157) (0.155) (0.077) (0.078) (0.078)
fromWalnut Street 2.294*** 2.000*** 0.614*** 2.992*** 2.621*** 2.655***
(0.224) (0.223) (0.219) (0.109) (0.116) (0.116)
fromWatchung Avenue 0.401** 0.318* -1.288*** 0.737*** 0.319*** 0.354***
(0.182) (0.181) (0.178) (0.089) (0.097) (0.097)
fromWesmont -0.959*** -1.048*** -2.772*** 0.416*** 0.271*** 0.291***
(0.196) (0.196) (0.192) (0.096) (0.096) (0.097)
fromWestfield -0.942*** -0.957*** -2.003*** 0.181** -0.372*** -0.257***
(0.184) (0.183) (0.183) (0.091) (0.093) (0.097)
fromWestwood -1.125*** -1.120*** -2.757*** 0.323*** 0.193* 0.213**
(0.204) (0.203) (0.200) (0.099) (0.100) (0.100)
fromWhite House -0.920*** -0.931*** -1.979*** -1.190*** -1.954*** -1.979***
(0.303) (0.301) (0.298) (0.148) (0.149) (0.149)
fromWoodbridge -0.399** -0.423** -1.778*** -0.210** -0.505*** -0.454***
(0.171) (0.170) (0.167) (0.083) (0.083) (0.084)
fromWoodcliff Lake -0.832*** -0.864*** -2.438*** 0.437*** 0.320*** 0.341***
(0.235) (0.233) (0.229) (0.114) (0.114) (0.115)
dotw.L 0.069* 0.030 0.052 0.027 -0.008 -0.013 -0.013
(0.038) (0.036) (0.037) (0.036) (0.018) (0.018) (0.018)
dotw.Q 1.039*** 1.007*** 1.087*** 0.661*** 0.169*** 0.094*** 0.094***
(0.037) (0.036) (0.037) (0.037) (0.018) (0.018) (0.018)
dotw.C 0.015 0.003 0.234*** 0.139*** 0.024 0.007 0.007
(0.034) (0.033) (0.035) (0.034) (0.017) (0.017) (0.017)
dotw4 0.516*** 0.470*** 0.507*** 0.305*** 0.072*** 0.037** 0.037**
(0.032) (0.030) (0.031) (0.030) (0.015) (0.015) (0.015)
dotw5 0.285*** 0.282*** 0.284*** 0.169*** 0.039*** 0.020 0.020
(0.031) (0.029) (0.029) (0.029) (0.014) (0.014) (0.014)
dotw6 -0.158*** -0.229*** -0.157*** -0.102*** -0.027* -0.019 -0.019
(0.030) (0.029) (0.029) (0.028) (0.014) (0.014) (0.014)
Temperature 0.004*** 0.003** 0.0003 -0.0001 -0.0001
(0.001) (0.001) (0.001) (0.001) (0.001)
Precipitation 0.532*** 0.333*** 0.040 0.011 0.011
(0.058) (0.057) (0.028) (0.028) (0.028)
Wind_speed -0.001 -0.001 0.0001 -0.0003 -0.0003
(0.003) (0.003) (0.001) (0.001) (0.001)
Visibility -0.008 -0.005 0.002 0.003 0.003
(0.008) (0.008) (0.004) (0.004) (0.004)
Gust 0.025*** 0.015*** 0.004*** 0.003*** 0.003***
(0.002) (0.002) (0.001) (0.001) (0.001)
Ice
lag1Hour 0.103*** 0.015*** 0.015***
(0.002) (0.001) (0.001)
lag2Hours 0.054*** 0.013*** 0.013***
(0.002) (0.001) (0.001)
lag3Hours 0.050*** 0.009*** 0.009***
(0.002) (0.001) (0.001)
lag4Hours 0.045*** 0.014*** 0.014***
(0.002) (0.001) (0.001)
lag5Hours 0.061*** 0.009*** 0.009***
(0.002) (0.001) (0.001)
lag12Hours 0.051*** 0.007*** 0.007***
(0.002) (0.001) (0.001)
lag1day 0.042*** 0.006*** 0.006***
(0.002) (0.001) (0.001)
stationlag_1 0.867*** 0.860*** 0.860***
(0.002) (0.002) (0.002)
stationlag_2 0.079*** 0.078*** 0.078***
(0.002) (0.002) (0.002)
stationlag_3 0.019*** 0.021*** 0.021***
(0.002) (0.002) (0.002)
Percent_Railpop -207.401***
(47.656)
Number.of.Bridges 0.003*** 0.002*** 0.002*** 0.001***
(0.0002) (0.0001) (0.0001) (0.0002)
Percent_Taking_Public_Trans 0.330 0.942***
(0.294) (0.326)
Number.of.resident.workers.who.commute.within.county -0.00000*** -0.00000***
(0.00000) (0.00000)
Constant 3.198*** 5.340*** 3.814*** 2.612*** -1.295*** -0.469*** -0.261**
(0.031) (0.121) (0.155) (0.200) (0.099) (0.103) (0.114)
Corrected AIC 2193204.7 2164887.2 2161761.3 2148064.7 1693492.6 1690881.6 1690864.7
Observations 325,757 325,757 325,757 325,757 325,757 325,757 325,757
R2 0.013 0.096 0.105 0.141 0.787 0.789 0.789
Adjusted R2 0.013 0.095 0.104 0.141 0.787 0.789 0.789
Residual Std. Error 7.010 (df = 325749) 6.711 (df = 325614) 6.678 (df = 325608) 6.539 (df = 325600) 3.255 (df = 325604) 3.242 (df = 325595) 3.242 (df = 325594)
F Statistic 610.752*** (df = 7; 325749) 243.182*** (df = 142; 325614) 256.872*** (df = 148; 325608) 343.904*** (df = 156; 325600) 7,929.611*** (df = 152; 325604) 7,563.181*** (df = 161; 325595) 7,517.026*** (df = 162; 325594)
Note: p<0.1; p<0.05; p<0.01

8.Accuracy

We use the mean absolute error to examine the accuracy of our models. Compared with MAPE, MAE didn’t need to divide the actual prediction value. Because the actual value, which is delay minutes, has a big possibility is zero, we choose MAE as our method to examine the accuracy. And we use the purr package to map the data result from nested data structures.

delay.Test.weekNest <- 
  delay.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  delay.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE_W = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags_w_t = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_stationLags_w_t = map(.x = data, fit = reg5, .f = model_pred),
           FTime_Space_FE_timeLags_w_t_stationLags = map(.x = data, fit = reg6, .f = model_pred),
           GTime_Space_FE_timeLags_w_t_stationLags_c = map(.x = data, fit = reg7, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, delay_minutes),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions 
## # A tibble: 28 × 8
##     week data               Regression       Predi…¹ Obser…² Absol…³   MAE sd_AE
##    <dbl> <list>             <chr>            <list>  <list>  <list>  <dbl> <dbl>
##  1    49 <sf [37,071 × 79]> ATime_FE         <dbl>   <dbl>   <dbl>    3.79  5.62
##  2    50 <sf [37,102 × 79]> ATime_FE         <dbl>   <dbl>   <dbl>    3.61  5.55
##  3    51 <sf [36,971 × 79]> ATime_FE         <dbl>   <dbl>   <dbl>    4.58 10.6 
##  4    52 <sf [34,055 × 79]> ATime_FE         <dbl>   <dbl>   <dbl>    3.69  5.42
##  5    49 <sf [37,071 × 79]> BSpace_FE        <dbl>   <dbl>   <dbl>    3.67  5.45
##  6    50 <sf [37,102 × 79]> BSpace_FE        <dbl>   <dbl>   <dbl>    3.51  5.38
##  7    51 <sf [36,971 × 79]> BSpace_FE        <dbl>   <dbl>   <dbl>    4.48 10.5 
##  8    52 <sf [34,055 × 79]> BSpace_FE        <dbl>   <dbl>   <dbl>    3.57  5.10
##  9    49 <sf [37,071 × 79]> CTime_Space_FE_W <dbl>   <dbl>   <dbl>    3.61  5.40
## 10    50 <sf [37,102 × 79]> CTime_Space_FE_W <dbl>   <dbl>   <dbl>    3.47  5.37
## # … with 18 more rows, and abbreviated variable names ¹​Prediction, ²​Observed,
## #   ³​Absolute_Error

8.1 MAE - Histogram

When we compare mean absolute errors between seven models in test set, we find the model E,F and G perform well, which are time-space fixed effects with timelags and stationlags models.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), 
             position = "dodge", 
             stat="identity") +
    scale_fill_manual(values = palette11) +
    labs(title = "Mean Absolute Errors by model specification and week",
         subtitle = "in Test Set") +
  theme_nice()

8.2 Compare - Time Series

The following time series line chart gives a good indication of the prediction performance among 7 models, which confirms the previous estimation that regression F and G seems to have the best goodness of fit generally. So we choose the Model G as our training model.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from = map(data, pull, from)) %>%
    dplyr::select(interval60, from, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -from) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = .6) +
      facet_wrap(~Regression, ncol=2) +
    scale_color_manual(values = palette2) +
      labs(title = "Predicted/Observed Train Delay Time Series", 
           subtitle = "A test set of 4 weeks",  
           x = "Hour", 
           y= "Delay Minutes") +
      theme_nice() +
  theme(axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 5),
        strip.text = element_text(size=7))

8.3 MAE Map

From the following GIF map, it is significant and straightforward that the mean absolute error of Model E decrease sharply as we added station lags into the model. Furthermore, it indicates that the Model with station lags like Model E, Model F, Model G, will have a better performance in predict train delay time. Meanwhile, we notice the spatial patterns of MAEs.

A <- week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval60, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ATime_FE") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = censustracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.8, alpha = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       limits = c(0, 5))+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  labs(title="Mean Abs Error, Test Set, Model A") +
  theme_nice() +
  theme(plot.title = element_text(size=10),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

B <- week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval60, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "BSpace_FE") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = censustracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.8, alpha = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       limits = c(0, 5))+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  labs(title="Mean Abs Error, Test Set, Model B") +
  theme_nice() +
  theme(plot.title = element_text(size=10),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

C <- week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval60, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "CTime_Space_FE_W") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = censustracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.8, alpha = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       limits = c(0, 5))+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  labs(title="Mean Abs Error, Test Set, Model C") +
  theme_nice() +
  theme(plot.title = element_text(size=10),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())


D <- week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval60, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags_w_t") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = censustracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.8, alpha = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       limits = c(0, 5))+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  labs(title="Mean Abs Error, Test Set, Model D") +
  theme_nice() +
  theme(plot.title = element_text(size=10),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())


E <- week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval60, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_stationLags_w_t") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = censustracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.8, alpha = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       limits = c(0, 5))+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  labs(title="Mean Abs Error, Test Set, Model E") +
  theme_nice() +
  theme(plot.title = element_text(size=10),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

F1 <- week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval60, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_w_t_stationLags") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = censustracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.8, alpha = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       limits = c(0, 5))+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  labs(title="Mean Abs Error, Test Set, Model F") +
  theme_nice() +
  theme(plot.title = element_text(size=10),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())

G <- week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval60, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "GTime_Space_FE_timeLags_w_t_stationLags_c") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = censustracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.8, alpha = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, 
                       option = "D",
                       limits = c(0, 5))+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  labs(title="Mean Abs Error, Test Set, Model G") +
  theme_nice() +
  theme(plot.title = element_text(size=10),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())
  gA <- arrangeGrob(A)
   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",paste("A.png",sep = ""),gA)
    gB <- arrangeGrob(B)
   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",paste("B.png",sep = ""),gB)
    gC <- arrangeGrob(C)
   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",paste("C.png",sep = ""),gC)
    gD <- arrangeGrob(D)
   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",paste("D.png",sep = ""),gD)
    gE <- arrangeGrob(E)
   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",paste("E.png",sep = ""),gE)
    gF <- arrangeGrob(F1)
   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",paste("F.png",sep = ""),gF)
    gG <- arrangeGrob(G)
   ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",paste("G.png",sep = ""),gG)

list.files(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map",pattern = '*.png', full.names = TRUE) %>%
  image_read() %>% 
  image_join() %>% 
  image_animate(fps=1) %>%
  image_write("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/map/MAE.gif")

The MAE and stations histogram indicates our model performs well in most stations. However, there are two peaks in station id 2 and 27.

spatialMAE<-
  week_predictions %>%
      mutate(interval60 = map(data, pull, interval60),
             from_id = map(data, pull, from_id), 
             line = map(data, pull, line)) %>%
      select(interval60, from_id, line, Observed, Prediction, Regression) %>%
      unnest() %>%
  filter(Regression == "GTime_Space_FE_timeLags_w_t_stationLags_c")%>%
  group_by(from_id)%>%
  summarize(MAE=mean(abs(Observed-Prediction),na.rm=TRUE))%>%
  ggplot(aes(from_id, MAE)) + 
    geom_bar(position = "dodge", stat = "summary", fun = "mean", fill="#f7ebe1",size=0.3, color="#f4a261", width = 1) +
    scale_fill_manual(values = palette2) +
    labs(x="Station", y="MAE", 
         title = "MAE associations with station")+ theme(legend.position = "none") + 
    theme(axis.text.x = element_text(size = 5, angle = 45, hjust = 1),
        axis.text.y = element_text(size = 6, hjust = 1))+ xlim(0, 170) +
  theme_nice()

spatialMAE

The average MAE of each line is in the range of 0-2, which shows our model performs well in most lines. Also, there is one peak in line Atl.City Line. Combined with MAE graphics above, we can sure our model didn’t perform well in delay prediction in some stations and lines.

lineMAE<-
  week_predictions %>%
      mutate(interval60 = map(data, pull, interval60),
             from_id = map(data, pull, from_id), 
             line = map(data, pull, line)) %>%
      select(interval60, from_id, line, Observed, Prediction, Regression) %>%
      unnest() %>%
  filter(Regression == "GTime_Space_FE_timeLags_w_t_stationLags_c")%>%
  group_by(line)%>%
  summarize(MAE=mean(abs(Observed-Prediction),na.rm=TRUE))%>%
  ggplot(aes(line, MAE)) + 
    geom_bar(position = "dodge", stat = "summary", fun = "mean", fill="#f7ebe1",size=0.3, color="#f4a261", width = 0.5) +
    scale_fill_manual(values = palette2) +
    labs(x="Line", y="MAE", 
         title = "MAE associations with the line") +
  theme_nice()+
    theme(legend.position = "none",
          axis.text.x = element_text(size = 5, angle = 20, hjust = 0.7),
        axis.text.y = element_text(size = 6, hjust = 1)) 
lineMAE

9.Evaluation

The model has the same predicted accuracy in different time periods. And we have over-estimated delayed time most of the time.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, from_id, from_lon, 
           from_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "GTime_Space_FE_timeLags_w_t_stationLags_c")%>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(size=0.5, alpha=.3, aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), size=0.5, method = "lm", se = FALSE, color = "#ECA063")+
    geom_abline(slope = 1, intercept = 0, color = "grey")+
  facet_grid(~time_of_day)+
  labs(title="Observed vs Predicted",
       x="Observed Delay Minutes", 
       y="Predicted Delay Minutes")+
  theme_nice() +
  xlim(0, 150) +
  ylim(0, 100)

Model performs well in predicting short delays. For a delayed train bigger than 30 minutes, there are some special factors that we did not take into account that would cause the train delayed than 30 minutes.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon),
           delay_minutes = map(data, pull, delay_minutes)) %>%
    select(interval60, from_id, from_lon, 
           from_lat, delay_minutes, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "GTime_Space_FE_timeLags_w_t_stationLags_c")%>%
  mutate(time_of_day = case_when(delay_minutes <= 5 ~ "a 0-5 minutes",
                                 delay_minutes > 5 & delay_minutes <= 10 ~ "b 5-10 minutes",
                                 delay_minutes > 10 & delay_minutes <= 15 ~ "c 10-15 minutes",
                                 delay_minutes > 15 & delay_minutes <= 20 ~ "d 15-20 minutes",
                                 delay_minutes > 20 & delay_minutes <= 30 ~ "e 20-30 minutes", 
                                 delay_minutes > 30 & delay_minutes <= 60 ~ "f 30-60 minutes",
                                 delay_minutes > 60 & delay_minutes <= 120 ~ "g 1-2 hours")) %>%
  group_by(from_id, time_of_day, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>%
  ggplot(.) +
  geom_sf(data = censustracts, color = "grey", fill = "transparent", size=0.01)+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", size = 0.3, alpha = 0.5)+
  scale_colour_viridis(limits=c(0, 5),direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(Final.panel$from_lat), max(Final.panel$from_lat))+
  xlim(min(Final.panel$from_lon), max(Final.panel$from_lon))+
  facet_wrap(~time_of_day, ncol = 4)+
  labs(title="Mean Absolute Errors, Test Set")+
  theme_nice() +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        strip.text = element_text(size=6))

9.1 Cross Validation

We use cross-validation to ensure that the goodness of fit results for a single holdout is not a fluke. CV can give an insight on how the model will generalize to an independent dataset. Due to the overwhelming amount of data, we choose 1 week as a cross-validation dataset.

## This hold out fold is 37953 
## This hold out fold is 139 
## This hold out fold is 140 
## This hold out fold is 61 
## This hold out fold is 48 
## This hold out fold is 44 
## This hold out fold is 37169 
## This hold out fold is 141 
## This hold out fold is 114 
## This hold out fold is 156 
## This hold out fold is 158 
## This hold out fold is 143 
## This hold out fold is 136 
## This hold out fold is 12 
## This hold out fold is 29 
## This hold out fold is 23 
## This hold out fold is 31 
## This hold out fold is 15 
## This hold out fold is 41 
## This hold out fold is 88 
## This hold out fold is 33 
## This hold out fold is 66 
## This hold out fold is 130 
## This hold out fold is 38 
## This hold out fold is 105 
## This hold out fold is 45 
## This hold out fold is 112 
## This hold out fold is 62 
## This hold out fold is 18 
## This hold out fold is 145 
## This hold out fold is 76 
## This hold out fold is 73 
## This hold out fold is 84 
## This hold out fold is 97 
## This hold out fold is 113 
## This hold out fold is 152 
## This hold out fold is 115 
## This hold out fold is 22 
## This hold out fold is 107 
## This hold out fold is 87 
## This hold out fold is 90 
## This hold out fold is 127 
## This hold out fold is 116 
## This hold out fold is 75 
## This hold out fold is 106 
## This hold out fold is 81 
## This hold out fold is 118 
## This hold out fold is 74 
## This hold out fold is 8 
## This hold out fold is 70 
## This hold out fold is 159 
## This hold out fold is 30 
## This hold out fold is 155 
## This hold out fold is 37 
## This hold out fold is 19 
## This hold out fold is 150 
## This hold out fold is 95 
## This hold out fold is 17 
## This hold out fold is 42 
## This hold out fold is 104 
## This hold out fold is 4 
## This hold out fold is 111 
## This hold out fold is 124 
## This hold out fold is 32 
## This hold out fold is 131 
## This hold out fold is 103 
## This hold out fold is 77 
## This hold out fold is 78 
## This hold out fold is 55 
## This hold out fold is 102 
## This hold out fold is 91 
## This hold out fold is 32905 
## This hold out fold is 100 
## This hold out fold is 99 
## This hold out fold is 109 
## This hold out fold is 34 
## This hold out fold is 120 
## This hold out fold is 132 
## This hold out fold is 40 
## This hold out fold is 119 
## This hold out fold is 142 
## This hold out fold is 117 
## This hold out fold is 39 
## This hold out fold is 153 
## This hold out fold is 83 
## This hold out fold is 92 
## This hold out fold is 38105 
## This hold out fold is 151 
## This hold out fold is 94 
## This hold out fold is 28 
## This hold out fold is 27 
## This hold out fold is 71 
## This hold out fold is 9 
## This hold out fold is 148 
## This hold out fold is 123 
## This hold out fold is 59 
## This hold out fold is 21 
## This hold out fold is 49 
## This hold out fold is 24 
## This hold out fold is 2 
## This hold out fold is 35 
## This hold out fold is 26 
## This hold out fold is 135 
## This hold out fold is 138 
## This hold out fold is 144 
## This hold out fold is 129 
## This hold out fold is 36 
## This hold out fold is 32906 
## This hold out fold is 60 
## This hold out fold is 101 
## This hold out fold is 57 
## This hold out fold is 137 
## This hold out fold is 11 
## This hold out fold is 121 
## This hold out fold is 46 
## This hold out fold is 149 
## This hold out fold is 3 
## This hold out fold is 50 
## This hold out fold is 96 
## This hold out fold is 89 
## This hold out fold is 6 
## This hold out fold is 58 
## This hold out fold is 47 
## This hold out fold is 54 
## This hold out fold is 93 
## This hold out fold is 13 
## This hold out fold is 67 
## This hold out fold is 147 
## This hold out fold is 157 
## This hold out fold is 68 
## This hold out fold is 20 
## This hold out fold is 108 
## This hold out fold is 79 
## This hold out fold is 43599 
## This hold out fold is 134 
## This hold out fold is 110 
## This hold out fold is 69
reg.summary <- mutate(reg.cv, Error = Prediction - delay_minutes,
                      Regression = "Cross Validation on a Week")

error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - delay_minutes, na.rm = T),
            MAE = mean(abs(Mean_Error), na.rm = T),
            SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()
error_by_reg_and_fold.long <-
  error_by_reg_and_fold%>%
  gather(Vriable, Value, -cvID, -Regression)%>%
  unnest()

9.2 Histogram

The charts show that most samples have a concentration in low values of MAE, Mean Error, and standard MAE. There are still some outliers that have big MAE, which means our model didn’t correctly predict some delayed trains for some reason. We want to figure them out in the discussion part.

ggplot(error_by_reg_and_fold.long, aes(x = Value)) +
  geom_histogram(bins = 20, 
                 fill="#f7ebe1", 
                 color="#f4a261",
                 alpha=0.8) +
  facet_wrap(~Vriable, ncol = 3, scales = "free") +
  theme_nice()

10.Conclusion

In this final project, we build up a regression model to predict train delay minutes, in order to help passengers make good travel decisions. We select six-dimensional features as our predictors, which are stations, time effects, weather(Temperature, Precipitation, Wind_speed, Visibility, Gust, Ice), time lags(lag1Hour, lag2Hours, lag3Hours, lag4Hours, lag5Hours, lag12Hours, lag1day), conditions of transportation (Numbers of Bridges, Percent of Taking Public Transportation, Number of resident workers who commute within county), station lags (stationlag_1, stationlag_2, stationlag_3), and census data(Percent_Railpop).

It is worth mentioning that we have a good feature engineering predictor, which is station lags. We used the delay minutes of the previous sequence of trains as a factor and did 3 stations of delay lags. 50% reduction in MAE (mean absolute error) after we added stationlags in the model, which is a very well-fitting model.

MAE mapping shows there is some spatial correlation in mean absolute error. And we use bar charts to visualize the relationship between MAE, different lines, and stations. The result illustrates that our model has different predicted accuracy in different stations and lines. For the station, there is a big MAE peak in AbseconWe. For lines, our model performs well in Princeton Shuttle and has low accuracy in Atl. City Line. AbseconWe is a station on Atl. City Line, which is a line from PA to Cape May. After searching, we found AbseconWe station opened in 1854, is a very high station. We conclude there may have some frequent accidents, like old equipment, and railroad failure in the low MAE area. These kinds of uncontrollable factors may have contributed to the low prediction rate of our model.

For generalizability, we use cross-validation to examine if this model works well in different partitions.

In conclusion, we highly recommend our model as the support for the delayed train model. We wish high accuracy, high generalizability can help people fully the train arriving times!

11.Improvement

  1. Simplifying the independent variables. Now we have more than 20 predictors, though we used a correlation matrix to delete some variables which have a strong correlation. We want to use decision tree or other methods to simplify our predictors.

  2. We are a little hesitant about this well-fitting model. Is that over-fitted?

  3. How to improve the accuracy in specific lines or stations? We may need to find more spatial features to reduce the impact of space.

12.Reference

Ken Steif (2021), Public Policy Analytics: Code & Context for Data Science in Government.

Pranav Badami (2019), NJ Transit and Amtrak(NEC) Rail Performance dataset. Retrieved from https://www.kaggle.com/datasets/pranavbadami/nj-transit-amtrak-nec-performance?select=2018_11.csv

Michael Zhang (2018),What are the chances that NJ Transit will cause you to miss the Dinky? Retrieved from https://medium.com/@mzhang13/what-are-the-chances-that-nj-transit-will-cause-you-to-miss-the-dinky-bfeacd11ebc6

Pranav Badami (2018), The 5 Stages of a System Breakdown on NJ Transit. Retrieved from https://towardsdatascience.com/the-5-stages-of-a-system-breakdown-on-nj-transit-8258127e31e9

Weather data (predictive): identifying factors; Service advisories (prescriptive): what type of infrastructure issues are consistently leading to delays and where. Retriev from https://medium.com/@pranavbadami/how-data-can-help-fix-nj-transit-c0d15c0660fe

Amtrak data (descriptive, predictive): a threshold number of concurrently running trains. Retrieved from https://medium.com/@pranavbadami/how-data-can-help-fix-nj-transit-c0d15c0660fe