library(corrplot)
library(corrr)
library(FNN)
library(ggcorrplot)
library(gifski)
library(grid)
library(gridExtra)
library(ggmap)
library(ggpubr)
library(ggplot2)
library(ggpmisc)
library(ggstance)
library(gganimate)
library(geojsonio)
library(jtools)
library(knitr)
library(kableExtra)
library(stargazer)
library(AICcmodavg)
library(lubridate)
library(leaflet)
library(plotROC)
library(pROC)
library(rmarkdown)
library(riem)
library(RSocrata)
library(sf)
library(sp)
library(spdep)
library(spatstat)
library(stargazer)
library(tableHTML)
library(tigris)
library(tidycensus)
library(viridis)
library(magick)
library(rgdal)
library(tidyverse)
options(scipen=999, tigris_class = "sf")
<- theme(axis.text.x = element_blank(),
mapTheme axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank())
# setting pallettes
<- c("#efb366","#bbefa9")
palette2 <- c("#ffe15c","#efb366","#bbefa9")
palette3 <- c("#264653","#2a9d8f","#e9c46a","#f4a261","#e76f51")
palette5 <- c("#355070","#6d597a","#915f78","#b56576","#e77c76", "#e88c7d","#eaac8b")
palette7 <- c("#b98b73","#cb997e","#ddbea9","#ffe8d6","#d4c7b0","#b7b7a4","#a5a58d","#6b705c","#3f4238")
palette9 <- c("#797d62","#9b9b7a","#baa587","#d9ae94","#aeb0a4","#d8e2dc", "#ffcb69", "#e8ac65","#d08c60","#b58463", "#997b66")
palette11 <- c("#264653","#287271","#2a9d8f","#8ab17d","#babb74","#e9c46a", "#efb366", "#f4a261","#ee8959","#e76f51", "#862906")
palette11_1 <- c("#283c55","#355070","#6d597a","#915f78","#b56576","#e56b6f", "#d19999", "#e77c76", "#e88c7d","#eaac8b","#eebba0")
palette11_2
#import functions
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
This is a forecast to predict the origin or destination time delays in Metro trains like NJ Transit and Amtrak around New York City.
NJ Transit system is a non-cyclical rail network that owns 11 lines and services 162 train stations in New Jersey and connects travelers to New York Penn Station. The region of NJ Transit operations has complex system dynamics that affect thousands of people everyday. Train delays will affect many people’s travel and schedules. Therefore, a good prediction could allow enough time for passengers to consider train delays and anticipate in advance.
Delay is the extra time it takes a train to operate on a route due to many factors, such as weather, stations, lines, and equipment. The delay will not only affect the operation of the train but also spread in the section, causing other trains to be late. Train delays will also cause a long time of passenger retention and bring inconvenience. We want to study trends and offer a better understanding of the principal factors that contribute to training delays. Therefore, our goal is to provide a reliable prediction of station delay that can help dispatchers to estimate the train operation status and make reasonable dispatching decisions to improve the operation and service quality of rail transit.
The system is facing a serious train delays issues. we can know from NJ transit website, it clearly shows the recent train delays, which happens everyday. Train delay is a harsh truth that frustrates passengers. This is a problem that every passenger will face as soon as they plan to take the train. It also cause a long time of passengers retention, and bring them frustrations and inconvenience.
Our app - Transpotting
is designed for passengers and
tells them whether the train will delay or not and avoid unexpected
delays. The app user could be anyone who regularly uses the train system
in and around NYC. This could include commuters who use the train to
work, tourists who are visiting the city, and business travelers who
need to get around the city quickly.
Different professional people work together to support the APP. The datasets of the app are from train companies. And professional data analysts deal with the data, analysis, and visualize them to let users easy to understand. The transportation department can improve traffic policy according to the analysis result. The goal of our app is to provide users with our predicted train arrival times. They can adjust their travel plans with this app. We hope users can get more trust in traveling by train.
A reliable prediction of train delays offers users a better ride experience, which can increase rail ridership. And also the analysis report can let train operators have a better understanding of the reason for train delays. They can improve the train management according to different conditions.
In addition, to put a machine learning model into the hands of a non-technical decision maker, the app should be easy to use. It has a user-friendly interface that allows the decision maker to access and use the model without technical expertise, and has clear instructions on how to use the model and interpret its predictions. The app also includes visualizations and summary statistics that help the decision maker understand the predictions. it allows decision maker to adjust parameters to meet their specific needs.
According to this use case, we will focus on Rail Passengers. We design three main functions for them, which is to get real-time train information, offer train delay prediction. And users can click the customize button to get the train report by adding the train they interested.
Here is our user interface design. The main page shows real-time status of trains, and predicted arrival time of each train. Users can zoom in and out on the map to get the most intuitive view of train moving dynamics. And when users click the alert button, they can have a quick look at all the delayed trains notifications.
Users can get the delayed report of the train they searched for. Train report mainly contains two part: predicted time and history timetable. It will provides users a general summary of train’s punctuality rate, arriving time in one week, and predicted arriving time.
In the account management panel, users can customize their coming trips, and get information of predicted train delayed time, actual arrival time, and weather at stops. And they can also get a report of how long they spent waiting for the train this year. We hope everyone could save their time with this app.
As the issue exhibits very strong space-time dependencies, we will create a space-time geospatial machine learning model in this case. The dependent variable here is train delays time. After the exploratory data analysis, we are going to do feature engineering both spatial and temporal features. In addition, a strong time series model is an understanding of the underlying temporal train operation process. And also the space effects like inclement weather are initial features to predict train time delays.
Specifically, we will use linear regression
(because
records are large enough) with fixed effects like time lags and station
lags to predict. Moreover, we will compare the predicted data with
observed data, and look at the spatial pattern of MAEs in test set.
Finally, we will use cross validation think about whether it is an
accurate and generalizable model.
The available dataset is based on past records and shows a result of the past. We analyze it to get a conclusion in order to make new predictions for similar issues in the future and help companies or sectors make a judgment. In this case, datasets we used are as following:
Granular performance data from 150k+ NJ Transit and Amtrak train trips by PRANAV BADAMI.
NJ Transits and Amtrak rail stations
NJ and NYC Census Data
Real-time weather data
Transportation Data
Those data set is then wrangled into a complete
Final.Panel
of observations that include every possible
space-time combination.
In this case, we choose trains trips during October and December in 2019, which were in the pre-pandemic period and concluded Thanksgiving and Christmas holidays. Here, we got 667,754 pieces of records. Inspection of the dataset revealed that some temporal data needed to be processed. Then, we sort data by 60 minutes intervals respectively.
<- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/archive/2019_10.csv")
trips201910 <- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/archive/2019_11.csv")
trips201911 <- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/archive/2019_12.csv")
trips201912
# get rid of NAs and merge them together
<- rbind(trips201910, trips201911, trips201912)
trips <- na.omit(trips) trips
<- trips %>%
trips mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
#scheduled time: month, week, day, hour and minute
month = month(interval60),
week = week(interval60),
day = day(interval60),
dotw = wday(interval60, label=TRUE),
hour = hour(interval60),
minute = minute(scheduled_time),
#time in a decimal format
time = hour + minute/60,
#actual time: month, week, day, hour and minute
act_month = month(actual_time),
act_week = week(actual_time),
act_dotw = wday(actual_time, label=TRUE),
act_day = day(actual_time),
act_hour = hour(actual_time),
act_minute = minute(actual_time),
#actual time in a decimal format
act_time = act_hour + act_minute/60)
<- trips201910 %>% na.omit() %>%
trips201910 #change data type
mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
#scheduled time: month, week, day, hour and minute
month = month(interval60),
week = week(interval60),
day = day(interval60),
dotw = wday(interval60, label=TRUE),
hour = hour(interval60),
minute = minute(scheduled_time),
#time in a decimal format
time = hour + minute/60,
#actual time: month, week, day, hour and minute
act_month = month(actual_time),
act_week = week(actual_time),
act_dotw = wday(actual_time, label=TRUE),
act_day = day(actual_time),
act_hour = hour(actual_time),
act_minute = minute(actual_time),
#actual time in a decimal format
act_time = act_hour + act_minute/60)
<- trips201911 %>% na.omit() %>%
trips201911 #change data type
mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
#scheduled time: month, week, day, hour and minute
month = month(interval60),
week = week(interval60),
day = day(interval60),
dotw = wday(interval60, label=TRUE),
hour = hour(interval60),
minute = minute(scheduled_time),
#time in a decimal format
time = hour + minute/60,
#actual time: month, week, day, hour and minute
act_month = month(actual_time),
act_week = week(actual_time),
act_dotw = wday(actual_time, label=TRUE),
act_day = day(actual_time),
act_hour = hour(actual_time),
act_minute = minute(actual_time),
#actual time in a decimal format
act_time = act_hour + act_minute/60)
<- trips201912 %>% na.omit() %>%
trips201912 #change data type
mutate(scheduled_time = as_datetime(scheduled_time, tz = "America/New_York"),
actual_time = as_datetime(actual_time, tz = "America/New_York")) %>%
mutate(interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
#scheduled time: month, week, day, hour and minute
month = month(interval60),
week = week(interval60),
day = day(interval60),
dotw = wday(interval60, label=TRUE),
hour = hour(interval60),
minute = minute(scheduled_time),
#time in a decimal format
time = hour + minute/60,
#actual time: month, week, day, hour and minute
act_month = month(actual_time),
act_week = week(actual_time),
act_dotw = wday(actual_time, label=TRUE),
act_day = day(actual_time),
act_hour = hour(actual_time),
act_minute = minute(actual_time),
#actual time in a decimal format
act_time = act_hour + act_minute/60)
Here, we add NJ Transit and Amtrak stations. Then, we are going to
create a trip.sf
with latitude and longitude of stations by
merging stations data into train trips data.
#NJ transits in New Jersey
<- st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Railroads_Network/Railroads_Network.shp") %>%
NJstations st_transform(crs = 4326) %>%
::select(OBJECTID, STATION_ID, geometry) dplyr
## Reading layer `Railroads_Network' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Railroads_Network/Railroads_Network.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 167 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 300860.8 ymin: 193014.1 xmax: 634837.8 ymax: 961070.5
## Projected CRS: NAD83 / New Jersey (ftUS)
colnames(NJstations)[2] <- "STNNAME"
<- st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Amtrak_Stations-shp/Amtrak_Stations.shp") %>%
amtrak st_transform(crs = 4326) %>%
filter(STATE == "NJ" | CITY2 == "New York") %>%
::select(OBJECTID, STNNAME, geometry) dplyr
## Reading layer `Amtrak_Stations' from data source
## `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Amtrak_Stations-shp/Amtrak_Stations.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 529 features and 7 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -123.1028 ymin: 25.84955 xmax: -69.96549 ymax: 48.72026
## Geodetic CRS: WGS 84
<- rbind(NJstations, amtrak) %>% st_transform(crs = 4326) stations
<- merge(trips, stations, by.x= "to", by.y="STNNAME")
trips.sf <- merge(trips.sf, stations, by.x= "from", by.y="STNNAME")
trips.sf <- trips.sf %>%
trips.sf ::mutate(to_lon = sf::st_coordinates(geometry.y)[,1],
dplyrto_lat = sf::st_coordinates(geometry.y)[,2]) %>%
mutate(from_lon = sf::st_coordinates(geometry.x)[,1],
from_lat = sf::st_coordinates(geometry.x)[,2])
<- c(
limit left = -75.11,
bottom = 40.111689,
right = -73.53,
top = 41.302571)
<- get_stamenmap(limit, zoom = 9, maptype = "toner-background")
basemap
#ggmap(map) +
# geom_point(data=trips.sf,
# mapping = aes(x=from_lon, y=from_lat, color=line),
# size=0.8,
# alpha=0.3)+
# labs(title="NJ Transit") +
# theme_nice() +
# mapTheme()
In this report, We import new york and new jersey census data,
including transportation conditions of population, such as numbers of
commuters, the total number of public transportation, etc. We use
spatial information as origin and destination data to the train trips.
And also we extract the census tracts for mapping and joining purposes -
creating an sf
object that consists geometry.
<-
NJcensus get_acs(geography = "county",
variables = c("B01003_001", "B08006_012",
"B08013_001", "B08012_001",
"B08301_001", "B08301_010"),
year = 2019,
state = "NJ",
geometry = TRUE,
output = "wide"
%>%
) rename(Total_Pop = B01003_001E,
Rail_Pop = B08006_012E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(NAME, Total_Pop, Rail_Pop, Travel_Time, Num_Commuters, Means_of_Transport, Total_Public_Trans, GEOID, geometry) %>%
mutate(Percent_Railpop = Rail_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
<-
NJcensus %>%
NJcensusmutate(County.Name = sapply(strsplit(as.character(NJcensus$NAME),','), "[", 1))%>%
mutate(State.Name = sapply(strsplit(as.character(NJcensus$NAME),','), "[", 2))%>%
## extract county name from name
select(., -c("NAME"))
$State.Name <- trimws(NJcensus$State.Name, which = c("left")) NJcensus
<-
NYcensus get_acs(geography = "county",
variables = c("B01003_001", "B08006_012",
"B08013_001", "B08012_001",
"B08301_001", "B08301_010"),
year = 2019,
state = "NY",
geometry = TRUE,
output = "wide"
%>%
) rename(Total_Pop = B01003_001E,
Rail_Pop = B08006_012E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(NAME, Total_Pop, Rail_Pop, Travel_Time, Num_Commuters, Means_of_Transport, Total_Public_Trans, GEOID, geometry) %>%
mutate(Percent_Railpop = Rail_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
<-
NYcensus %>%
NYcensusmutate(County.Name = sapply(strsplit(as.character(NYcensus$NAME),','), "[", 1))%>%
mutate(State.Name = sapply(strsplit(as.character(NYcensus$NAME),','), "[", 2))%>%
## extract county name from name
select(., -c("NAME"))
$State.Name <- trimws(NYcensus$State.Name, which = c("left")) NYcensus
#combine census data together
<- rbind(NYcensus, NJcensus)
censusdata #extract census tracts
<- censusdata %>%
censustracts as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
%>%
st_sf st_transform(crs=4326)
These data from Bureau of Transportation Statistics. There are 22 columns in the dataset, which contain percent of poor condition bridges, percent of resident workers who commute by transit, and other transportation conditions divided by census tracts.
Bridges data is compiled in the National Transportation Atlas Database (NTAD) using data from the Federal Highway Administration National Bridge Inventory (NBI). The dataset includes bridges located on public roads as of December 31, 2018, including Interstate Highways, U.S. highways, State and county roads, as well as publicly-accessible bridges on Federal and Tribal lands.
#transportation data(divided by county)
<- read.csv("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/County_Transportation_Profiles.csv") %>%
transdataall select(., -c("Non.Commercial..Civil.Public.Use.Airports.and.Seaplane.base", "Non.Commercial..Other.Aerodromes","Total.Marinas","Total.Docks"))
<- subset(transdataall, transdataall$State.Name == "New York" | transdataall$State.Name == "New Jersey" )
transdata
#merge census data with transportation data
<- merge(censusdata,transdata,by=c("County.Name","State.Name"))
census_merge <- st_transform(x=census_merge,crs=4326)
census_merge.st #write_csv(census_merge,"/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/census_merge_original.csv")
We consider stations in NYC as destinations. Weather at the destination (fog, gust, or bad weather) may also affect the departure time of the train, then cause delays.
<-
NYCweather.panel riem_measures(station = "NYC", date_start = "2019-10-1", date_end = "2019-12-31") %>%
::select(valid, tmpf, sknt, p01i, vsby, gust, ice_accretion_6hr)%>%
dplyrreplace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Wind_speed = max(sknt),
Precipitation = sum(p01i),
Visibility = max(vsby),
Gust = max(gust),
Ice = max(ice_accretion_6hr)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
ggplot(NYCweather.panel, aes(interval60,Temperature)) + geom_line(color = "#f4a261", linewidth = 0.3) +
labs(title="Temperature",
x="Hour",
y="Temperature") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NYCweather.panel, aes(interval60,Wind_speed)) + geom_line(color = "#f4a261", linewidth = 0.3) +
labs(title="Wind Speed",
x="Hour",
y="Wind Speed") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NYCweather.panel, aes(interval60,Precipitation)) + geom_line(color = "#f4a261", linewidth = 0.3) +
labs(title="Percipitation",
x="Hour",
y="Perecipitation") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NYCweather.panel, aes(interval60, Visibility)) + geom_line(color = "#f4a261", linewidth = 0.2) +
labs(title="Visibility", x="Hour", y="Visibility", ) + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NYCweather.panel, aes(interval60, Gust)) + geom_line(color = "#f4a261", linewidth = 0.3) +
labs(title="Gust", x="Hour", y="Gust") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NYCweather.panel, aes(interval60, Ice)) + geom_line(color = "#f4a261", linewidth = 0.3) +
labs(title="Ice Accretion over 6 Hours", x="Hour", y="Ice Accretion") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
top = "Weather Data Per Hour - New York City - From Oct. to Dec., 2019")
We use riem_measures
help to import weather data, in
this case We extract weather data of Newark Liberty International
Airport in New Jersey. In addition, there is another panel dataset
NJweather.Panel
summarizing temperature, precipitation, and
wind speed per hour between October and December.
<-
NJweather.panel riem_measures(station = "EWR", date_start = "2019-10-1", date_end = "2019-12-31") %>%
::select(valid, tmpf, sknt, p01i, vsby, gust, ice_accretion_6hr)%>%
dplyrreplace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Wind_speed = max(sknt),
Precipitation = sum(p01i),
Visibility = max(vsby),
Gust = max(gust),
Ice = max(ice_accretion_6hr)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
ggplot(NJweather.panel, aes(interval60,Temperature)) + geom_line(color = "#80ed99", linewidth = 0.3) +
labs(title="Temperature",
x="Hour",
y="Temperature") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NJweather.panel, aes(interval60,Wind_speed)) + geom_line(color = "#80ed99", linewidth = 0.3) +
labs(title="Wind Speed",
x="Hour",
y="Wind Speed") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NJweather.panel, aes(interval60,Precipitation)) + geom_line(color = "#80ed99", linewidth = 0.3) +
labs(title="Percipitation",
x="Hour",
y="Perecipitation") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NJweather.panel, aes(interval60, Visibility)) + geom_line(color = "#80ed99", linewidth = 0.2) +
labs(title="Visibility", x="Hour", y="Visibility", ) + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NYCweather.panel, aes(interval60, Gust)) + geom_line(color = "#80ed99", linewidth = 0.3) +
labs(title="Gust", x="Hour", y="Gust") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
ggplot(NJweather.panel, aes(interval60, Ice)) + geom_line(color = "#80ed99", linewidth = 0.3) +
labs(title="Ice Accretion over 6 Hours", x="Hour", y="Ice Accretion") + theme_nice() +
theme(axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 6),
plot.title = element_text(size=10),
axis.title.x = element_blank()),
top = "Weather Data Per Hour - New Jersey - From Oct. to Dec., 2019")
In this section, we are going to examine the distribution and characteristic of train delays time.
ggplot(subset(trips.sf, delay_minutes<=10))+
geom_histogram(aes(delay_minutes),
binwidth = 0.2,
fill="#f7ebe1",
color="#f4a261",
alpha=0.8)+
labs(x="Minutes",
y="Counts",
title="Delay Minutes - Less than 10 Minutes",
subtitle="On-time") +
theme_nice()
ggplot(subset(trips.sf, delay_minutes>10))+
geom_histogram(aes(delay_minutes),
binwidth = 10,
fill="#f7ebe1",
color="#f4a261",
alpha=0.4)+
#xlim(0, 100) +
labs(x="Minutes",
y="Counts",
title="Delay Minutes - More Than 10 Minutes",
subtitle="Delayed") +
theme_nice()
The following diagrams illustrate Monday and Tuesday, the first two days of workdays, have the largest delay minutes. And also we know from the charts that the train often has longer delayed minutes in rush hours during workdays. And it is less noticeable on weekends. So we conclude the value of delay minutes have a strong correlation with workdays and rush hours.
%>%
trips.sf select(dotw, delay_minutes) %>%
group_by(dotw) %>%
summarise(sum_delay = sum(delay_minutes)) %>%
ggplot(aes(y = dotw, x = sum_delay, fill = dotw)) +
geom_bar(stat="identity",alpha = 0.6, width = 0.6) +
scale_fill_manual(values = palette11_1) +
labs(y = "Day",
x = "Minutes",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
title = "Overall Delay Minutes By Day",
fill = "Day") +
theme_nice()
%>%
trips select(hour, delay_minutes, dotw) %>%
group_by(hour, dotw) %>%
summarise(sum_delay = sum(delay_minutes)) %>%
ggplot(aes(x = hour, y = sum_delay, color = dotw)) +
scale_color_manual(values = palette11_1) +
geom_line() +
labs(title="Overall Delay Minutes by Day",
subtitle="NJ Transit in NJ and around NYC",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
x="Hour",
y="Delay Minutes",
color="Day") +
theme(plot.title = element_text(size=14),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.subtitle = element_text(size = 10)) +
theme_nice()
## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.
The following plots indicate the overall delay minutes by time during October and December. There is a rapid increasing in late December. We speculate that this peak occurred around Christmas holiday. But overall, train delays happened every day in 2019, which is a sad story for passengers. Meanwhile, most delays are less than 100 minutes.
#overall delay minutes by month
ggplot()+
geom_line(data=trips201910, aes(x=interval60, y=delay_minutes, color=factor(month))) + scale_colour_manual(values = c("#ffe15c","#efb366")) +
labs(title="Delay Minutes in October",
subtitle="NJ Transit in NJ and around NYC",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
x="Time",
y="Delay Minutes") +
theme_nice()
ggplot()+
geom_line(data=trips201911, aes(x=interval60, y=delay_minutes, color=factor(month))) + scale_colour_manual(values = c("#efb366","#bbefa9")) +
labs(title="Delay Minutes in November",
subtitle="NJ Transit in NJ and around NYC",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
x="Time",
y="Delay Minutes") +
theme_nice()
ggplot()+
geom_line(data=trips201912, aes(x=interval60, y=delay_minutes, color=factor(month))) + scale_colour_manual(values = c("#f7ede2","#bbefa9")) +
labs(title="Delay Minutes in December",
subtitle="NJ Transit in NJ and around NYC",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
x="Time",
y="Delay Minutes") +
theme_nice()
ggplot()+
geom_line(data=trips, aes(x=interval60, y=delay_minutes, color=factor(month))) +
scale_colour_manual(values = c("#f7ede2","#ffe15c","#efb366","#bbefa9")) +
labs(title="Delay Minutes from Oct. to Dec.",
subtitle="NJ Transit in NJ and around NYC",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
x="Time",
y="Delay Minutes") +
theme_nice()
From the diagrams, We can see that the princeton shuttle performed well among these lines, and the delays often happen in the morning rush and evening rush, which really frustrates commuters.
%>%
trips.sf select(line, delay_minutes) %>%
group_by(line) %>%
summarise(sum_delay = sum(delay_minutes)) %>%
ggplot(aes(x = sum_delay , y=line, fill = line)) +
geom_bar(stat="identity",alpha = 0.6, width = 0.6) +
scale_fill_manual(values = palette11_1) +
labs(y = "Line",
x = "Minutes",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
title = "Overall Delay Minutes By Line",
fill = "Line") +
theme_nice()
%>%
trips select(hour, delay_minutes, line) %>%
group_by(hour, line) %>%
summarise(sum_delay = sum(delay_minutes)) %>%
ggplot(aes(x = hour, y = sum_delay, color = line)) +
scale_color_manual(values = palette11_1) +
geom_line() +
labs(title="Overall Delay Minutes by Line",
subtitle="NJ Transit in NJ and around NYC",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
x="Hour",
y="Delay Minutes",
color="Line") +
theme(plot.title = element_text(size=14),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.subtitle = element_text(size = 10)) +
theme_nice()
## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.
%>%
trips mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, from, time_of_day) %>%
tally()%>%
group_by(from, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), fill="#f7ebe1", color="#f4a261", linewidth = 0.4, binwidth = 0.2)+
labs(title="Mean Number of Hourly Departure Trains Per Station",
subtitle = "in NJ Transits and Amtrak Stations during quarter 4 in 2019",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
theme_nice()
## `summarise()` has grouped output by 'from'. You can override using the
## `.groups` argument.
Those maps show the delays at each stations both in weekday and weekend. It is common that train would delay for a while on weekday.
ggmap(basemap) +
geom_point(data = trips.sf %>%
mutate(hour = hour(actual_time),
weekday = ifelse(dotw %in% c("Mon", "Tue","Wed","Thu", "Fri"),"Weekday","Weekend"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(to, to_lat, to_lon, weekday, time_of_day) %>%
tally(),
aes(x = to_lon, y = to_lat, color = n),
fill = "transparent", alpha = 0.4, size = 0.1)+
scale_colour_viridis(direction = -1,
discrete = FALSE,
option = "D",
name = "Sum of delay minutes") +
facet_grid(weekday ~ time_of_day)+
labs(title="Delay Minutes By Station",
subtitle = "during quarter 4 in 2019",
caption = "Data: NYC Open Data, NJGIN Open Data & kaggle",
+
) theme_nice() +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank())
The following time series diagrams indicate the delay time in each station and each metro line from October and December. We can compare between stations and lines to figure out whether there will be spatial patterns in the delays time.
#for (i in linelist)
# {
#stationlist <- trips.sf %>% filter(line==i)
#stationlist <- unique(stationlist$to)
#for (j in stationlist){
#plot3 <- trips.sf %>%
# st_drop_geometry()%>%
# filter(line==i & to==j)%>%
# na.omit()%>%
# mutate(interval_hour = ymd_h(substr(scheduled_time, 1, 13)),
# week = week(scheduled_time))%>%
# filter(week%in%(40:52))%>%
# group_by(interval_hour,date)%>%
# summarise(AverageDelayInHour = mean(delay_minutes))%>%
# ggplot(aes(interval_hour,AverageDelayInHour)) +
# geom_line(color="#f4a261",size=0.2)+
# labs(title = paste("Station:",j,sep = ""),y="Delay Hour")+
# theme_nice()
# g3 <- arrangeGrob(plot3)
# ggsave(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Line",paste(j,"_delay.png",sep = ""),g3)
#}
#}
#list.files(path="/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Line",pattern = '*.png', full.names = TRUE) %>%
# image_read() %>%
# image_join() %>%
# image_animate(fps=1) %>%
# image_write("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Final Project/Line/delay.gif")