Load libraries
library("formattable")
library("tidyverse")
library("ggplot2")
library("ggmap")
library("lubridate")
library("caret")
library("ranger")
library("randomForest")
library("ggridges")
library("rpart")
library("rattle")
library("rpart.plot")
library("RColorBrewer")
library("MASS")
library("tree")
library("dplyr")
library("quantreg")
library("ggmap")
library("ggmap")
library("MLmetrics")
The first goal of this project is to identify any trends and patterns in New York City taxi fares. If we are able to discover any significant findings from the data, we can make recommendations to taxi drivers on how to boost their daily earnings. Second goal is to see if we are can build regression models to predict taxi fares based on a number of factors (time, date, pickup and dropoff locations and etc.)
NYC Taxi fares are affective by pick-up and drop-off locations, passenger count, trip distance, pick-up and drop-off time during the day, day of the week, and whether it is an airport trip. We will exam this hypothesis through data exploratory analysis and building linear model
First, loading the datasets of NYC Yellow Taxi Trips for Januray, April, July, October 2019 The datasets were sourced from nyc.gov website link (or copy-paste the url into your browser : https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Data Dictionary can be found here Link
#loading the database of NYC Yellow Taxi Trip for Januray, April, July, October 2018 and combine them into a single data frame.
yel_jan <- read.csv("yellow_tripdata_2019-01.csv")
yel_apl <- read.csv("yellow_tripdata_2019-04.csv")
yel_jul <- read.csv("yellow_tripdata_2019-07.csv")
yel_oct <- read.csv("yellow_tripdata_2019-10.csv")
all <- rbind(yel_jan, yel_apl, yel_jul, yel_oct)
A quick overview of the dataset: * Check dataset size (noted the dataset has 28.6M rows an 18 columns)
dim(all)
## [1] 28625241 18
head(all)
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1 1 2019-01-01 00:46:40 2019-01-01 00:53:20 1
## 2 1 2019-01-01 00:59:47 2019-01-01 01:18:59 1
## 3 2 2018-12-21 13:48:30 2018-12-21 13:52:40 3
## 4 2 2018-11-28 15:52:25 2018-11-28 15:55:45 5
## 5 2 2018-11-28 15:56:57 2018-11-28 15:58:33 5
## 6 2 2018-11-28 16:25:49 2018-11-28 16:28:26 5
## trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID
## 1 1.5 1 N 151 239
## 2 2.6 1 N 239 246
## 3 0.0 1 N 236 236
## 4 0.0 1 N 193 193
## 5 0.0 2 N 193 193
## 6 0.0 1 N 193 193
## payment_type fare_amount extra mta_tax tip_amount tolls_amount
## 1 1 7.0 0.5 0.5 1.65 0.00
## 2 1 14.0 0.5 0.5 1.00 0.00
## 3 1 4.5 0.5 0.5 0.00 0.00
## 4 2 3.5 0.5 0.5 0.00 0.00
## 5 2 52.0 0.0 0.5 0.00 0.00
## 6 2 3.5 0.5 0.5 0.00 5.76
## improvement_surcharge total_amount congestion_surcharge
## 1 0.3 9.95 NA
## 2 0.3 16.30 NA
## 3 0.3 5.80 NA
## 4 0.3 7.55 NA
## 5 0.3 55.55 NA
## 6 0.3 13.31 NA
str(all)
## 'data.frame': 28625241 obs. of 18 variables:
## $ VendorID : int 1 1 2 2 2 2 2 1 1 1 ...
## $ tpep_pickup_datetime : Factor w/ 8808859 levels "2001-02-02 14:55:07",..: 3046 3826 87 76 77 78 79 1558 2181 3692 ...
## $ tpep_dropoff_datetime: Factor w/ 8820499 levels "2001-02-02 15:07:27",..: 2804 4320 87 76 77 78 79 1365 2356 3760 ...
## $ passenger_count : int 1 1 3 5 5 5 5 1 1 2 ...
## $ trip_distance : num 1.5 2.6 0 0 0 0 0 1.3 3.7 2.1 ...
## $ RatecodeID : int 1 1 1 1 2 1 2 1 1 1 ...
## $ store_and_fwd_flag : Factor w/ 3 levels "N","Y","": 1 1 1 1 1 1 1 1 1 1 ...
## $ PULocationID : int 151 239 236 193 193 193 193 163 229 141 ...
## $ DOLocationID : int 239 246 236 193 193 193 193 229 7 234 ...
## $ payment_type : int 1 1 1 2 2 2 2 1 1 1 ...
## $ fare_amount : num 7 14 4.5 3.5 52 3.5 52 6.5 13.5 10 ...
## $ extra : num 0.5 0.5 0.5 0.5 0 0.5 0 0.5 0.5 0.5 ...
## $ mta_tax : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ tip_amount : num 1.65 1 0 0 0 0 0 1.25 3.7 1.7 ...
## $ tolls_amount : num 0 0 0 0 0 5.76 0 0 0 0 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 9.95 16.3 5.8 7.55 55.55 ...
## $ congestion_surcharge : num NA NA NA NA NA NA NA NA NA NA ...
Random sample a subset of data Considering data size and running speed, I randomly sampled 1 millions rows of data from the all dataset for EDA (Data Exploratory Analysis)
set.seed(123)
index <- sample(x = 1:nrow(all), size= 1000000)
Retain the following columns will be revelant to taxi fare
sampled <- all[index , c("tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count", "trip_distance", "PULocationID", "DOLocationID","payment_type", "total_amount")]
#Save the sampled dataset
write.csv(sampled, file="NYC Taxi Fare Data Sample.csv", row.names=FALSE)
Load back the sampled smalldataset
nyc_taxi <- read.csv("NYC Taxi Fare Data Sample.csv")
I am going to use the sample dataset to perform Data Exploratoy Analysis and build prediction model to test my hypothesis above.
Also loading the Taxi Zone Lookup Table from the same nyc.gov webpage link.
#Loading the taxi zone lookup table, which we will be used for locations referrence
taxi_zone <- read.csv("taxi_zone_lookup.csv")
After reviewing the summary of taxi_zone table, founded that there are 2 LocationID (264 and 265) recorded as unknown borough. We’re excluding data rows with these two locationID in the all table later.
unknown_loc <- taxi_zone[taxi_zone$Borough == "Unknown", ]$LocationID
Obtain a quick understanding of the dataset. Here are a few questions that I found : + There is negative values in total_amount and the xax value is 36090.30(extreme values) + zero passenger or passengers greater than 6 + NA values in 2 columns passenger_count and payment_type (same number of NAs in both columns)
summary(nyc_taxi)
## tpep_pickup_datetime tpep_dropoff_datetime
## 2019-10-22 17:44:55: 5 2019-04-15 00:00:00: 6
## 2019-01-01 19:43:36: 4 2019-07-03 00:00:00: 6
## 2019-01-04 19:37:06: 4 2019-07-24 00:00:00: 6
## 2019-01-05 11:50:14: 4 2019-07-27 00:00:00: 6
## 2019-01-05 19:01:13: 4 2019-01-18 00:00:00: 5
## 2019-01-05 20:36:15: 4 2019-01-19 19:13:30: 5
## (Other) :999975 (Other) :999966
## passenger_count trip_distance PULocationID DOLocationID
## Min. :0.000 Min. : 0.000 Min. : 1.0 Min. : 1.0
## 1st Qu.:1.000 1st Qu.: 0.970 1st Qu.:125.0 1st Qu.:112.0
## Median :1.000 Median : 1.600 Median :162.0 Median :162.0
## Mean :1.563 Mean : 2.964 Mean :163.6 Mean :161.8
## 3rd Qu.:2.000 3rd Qu.: 3.000 3rd Qu.:233.0 3rd Qu.:234.0
## Max. :9.000 Max. :183.680 Max. :265.0 Max. :265.0
## NA's :2829
## payment_type total_amount
## Min. :1.000 Min. : -224.80
## 1st Qu.:1.000 1st Qu.: 10.55
## Median :1.000 Median : 14.12
## Mean :1.292 Mean : 18.42
## 3rd Qu.:2.000 3rd Qu.: 20.13
## Max. :4.000 Max. :36090.30
## NA's :2829
Inspect missing data. Noted the missing values in passenger_count and payment_type are in the same rows
apply(nyc_taxi, 2, function(x) {sum(is.na(x))})
## tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 0 0 2829
## trip_distance PULocationID DOLocationID
## 0 0 0
## payment_type total_amount
## 2829 0
Remove rows where there is missing values
nyc_taxi <- nyc_taxi %>% drop_na()
dim(nyc_taxi)
## [1] 997171 8
Here I performed data cleaning in the following: 1. Remove taxi trips for which the pickup or drop-off location was unknown (filtering out the two unknown locationID)
Remove rows with zero passenger, zero trip_distance, zero or negative fare amount, zero or negative tolls amount, zero or negative total amount.
Filter for taxi trips with payment type as credict card or cash, excluding trips that are no-charging, dispute, unknown, or voided. (payment type: 1 = credit card, 2 = cash, 3 = No charge, 4 = Dispute, 5 = Unknown, 6 = Voided trip, from Data Dictionary – Yellow Taxi Trip Records link)
nyc_taxi <- nyc_taxi %>%
filter((! PULocationID %in% unknown_loc) & (!DOLocationID %in% unknown_loc)) %>%
filter(passenger_count != 0 & total_amount > 0 & trip_distance != 0 ) %>%
filter(payment_type == 1 | payment_type == 2)
Check dataset summary again after removing outliers. Confirm that all values (min, max, quantiles) are more reasonable and we can move on to feature engineerings
summary(nyc_taxi)
## tpep_pickup_datetime tpep_dropoff_datetime
## 2019-10-22 17:44:55: 5 2019-07-03 00:00:00: 6
## 2019-01-01 19:43:36: 4 2019-07-24 00:00:00: 6
## 2019-01-04 19:37:06: 4 2019-07-27 00:00:00: 6
## 2019-01-05 11:50:14: 4 2019-01-19 19:13:30: 5
## 2019-01-05 19:01:13: 4 2019-04-13 00:00:00: 5
## 2019-01-05 20:36:15: 4 2019-04-15 00:00:00: 5
## (Other) :951605 (Other) :951597
## passenger_count trip_distance PULocationID DOLocationID
## Min. :1.000 Min. : 0.010 Min. : 1.0 Min. : 1.0
## 1st Qu.:1.000 1st Qu.: 0.990 1st Qu.:116.0 1st Qu.:107.0
## Median :1.000 Median : 1.620 Median :162.0 Median :162.0
## Mean :1.596 Mean : 2.942 Mean :162.7 Mean :160.6
## 3rd Qu.:2.000 3rd Qu.: 3.000 3rd Qu.:233.0 3rd Qu.:233.0
## Max. :9.000 Max. :77.540 Max. :263.0 Max. :263.0
##
## payment_type total_amount
## Min. :1.000 Min. : 0.30
## 1st Qu.:1.000 1st Qu.: 10.56
## Median :1.000 Median : 14.12
## Mean :1.276 Mean : 18.12
## 3rd Qu.:2.000 3rd Qu.: 19.80
## Max. :2.000 Max. :403.30
##
Since location ID is not easily interpretable. We’re going to convert location ID to location by name. Perform a “Vlookup” to identify at which borough the taxi trip began using the taxi_zone table
nyc_taxi <- nyc_taxi %>%
left_join(taxi_zone[, c(1,2,3)], by = c("PULocationID" = "LocationID"))
nyc_taxi <- nyc_taxi %>%
rename("puBorough" = "Borough", "puZone" = "Zone") #puBorough = pick-up borough
Performed another “vlookup” to identify at which Borough the taxi trip ended
nyc_taxi <- nyc_taxi %>%
left_join(taxi_zone[, c(1,2,3)], by = c("DOLocationID" = "LocationID"))
nyc_taxi <- nyc_taxi %>%
rename("doBorough" = "Borough", "doZone" = "Zone") #doBorough = drop-off borough
head(nyc_taxi)
## tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
## 1 2019-10-25 22:36:03 2019-10-25 23:11:50 3 2.59
## 2 2019-04-20 20:04:44 2019-04-20 20:45:45 2 17.90
## 3 2019-07-25 16:43:39 2019-07-25 17:10:24 1 2.92
## 4 2019-07-23 13:08:56 2019-07-23 13:25:46 1 0.78
## 5 2019-07-06 07:31:05 2019-07-06 07:44:01 1 4.30
## 6 2019-10-14 15:31:31 2019-10-14 15:43:18 1 2.07
## PULocationID DOLocationID payment_type total_amount puBorough
## 1 249 163 1 31.62 Manhattan
## 2 132 230 2 61.42 Queens
## 3 161 211 2 21.80 Manhattan
## 4 100 164 1 17.16 Manhattan
## 5 50 12 1 20.30 Manhattan
## 6 161 137 1 16.62 Manhattan
## puZone doBorough doZone
## 1 West Village Manhattan Midtown North
## 2 JFK Airport Manhattan Times Sq/Theatre District
## 3 Midtown Center Manhattan SoHo
## 4 Garment District Manhattan Midtown South
## 5 Clinton West Manhattan Battery Park
## 6 Midtown Center Manhattan Kips Bay
Create New Features to classify at what time during the day and on which day during the week the taxi trip is taken * pickup hour * pickup day of the week * pickup day of the month
nyc_taxi <- nyc_taxi %>%
mutate( pk_hour = hour(tpep_pickup_datetime),
pk_wday = wday(tpep_pickup_datetime, label = TRUE),
pk_month = month(tpep_pickup_datetime, label = TRUE))
To change pk_hour and passenger count as factor variables
nyc_taxi$pk_hour <- as.factor(nyc_taxi$pk_hour)
nyc_taxi$passenger_count <- as.factor(nyc_taxi$passenger_count)
To classify whether the taxi trips were going to or coming from airports. We first identify locationID for airports from the taxi zone table.
airportID <- taxi_zone[str_detect(taxi_zone$Zone, "Airport"), ]$LocationID
airportID <- airportID[! is.na(airportID)]
airportID
## [1] 1 132 138
Create a new column named airports_trip for which TRUE stands for airport trips.
nyc_taxi <- nyc_taxi %>%
mutate(airports_trip = (PULocationID %in% airportID | DOLocationID %in% airportID ))
head(nyc_taxi)
## tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
## 1 2019-10-25 22:36:03 2019-10-25 23:11:50 3 2.59
## 2 2019-04-20 20:04:44 2019-04-20 20:45:45 2 17.90
## 3 2019-07-25 16:43:39 2019-07-25 17:10:24 1 2.92
## 4 2019-07-23 13:08:56 2019-07-23 13:25:46 1 0.78
## 5 2019-07-06 07:31:05 2019-07-06 07:44:01 1 4.30
## 6 2019-10-14 15:31:31 2019-10-14 15:43:18 1 2.07
## PULocationID DOLocationID payment_type total_amount puBorough
## 1 249 163 1 31.62 Manhattan
## 2 132 230 2 61.42 Queens
## 3 161 211 2 21.80 Manhattan
## 4 100 164 1 17.16 Manhattan
## 5 50 12 1 20.30 Manhattan
## 6 161 137 1 16.62 Manhattan
## puZone doBorough doZone pk_hour pk_wday
## 1 West Village Manhattan Midtown North 22 Fri
## 2 JFK Airport Manhattan Times Sq/Theatre District 20 Sat
## 3 Midtown Center Manhattan SoHo 16 Thu
## 4 Garment District Manhattan Midtown South 13 Tue
## 5 Clinton West Manhattan Battery Park 7 Sat
## 6 Midtown Center Manhattan Kips Bay 15 Mon
## pk_month airports_trip
## 1 Oct FALSE
## 2 Apr TRUE
## 3 Jul FALSE
## 4 Jul FALSE
## 5 Jul FALSE
## 6 Oct FALSE
The distribution of taxi fares is highly right-skewed, which means the dataset contains lots of outliers.
ggplot(nyc_taxi, aes(x = total_amount)) +
geom_histogram(fill = "blue", bins = 50) +
ggtitle("Histogram of NYC Taxi Fares") +
theme(plot.title = element_text(hjust = 0.5))
While taking logarithm to transform the target variable can help conforming to normality, I found it hard to interprect the plots. I am more opt for excluding outliers in the plot.
To “zoon in” the distribution plot, I filtered the dataset for taxi fares (total amount) below the 99.99th quantile ($134.8). This can help us to have a better understanding of the distribution of the majority(99.99%) of data in our sample. I also highlight the median taxi fare in the plot.
total_amount_99 <-quantile(nyc_taxi$total_amount, .9999)
total_amount_99
## 99.99%
## 136.367
median_fare <- nyc_taxi %>%
summarise(median_fare = median(total_amount))
median_fare
## median_fare
## 1 14.12
nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
ggplot(aes(x = total_amount)) +
geom_histogram(color = "white", fill = "blue", bins = 50) +
geom_vline(xintercept = median_fare$median_fare,linetype="dashed", color = "black", size=1) +
annotate(geom="text", x=30, y=140000, label="median = 11.8", color="black") +
ggtitle("Histogram of NYC Taxi Fares(median = 11.8)") +
theme(plot.title = element_text(hjust = 0.5))
I am applying the first 99.99th quantile filter on total_amount(taxi fare) for the following data exploratory analysis as a way to elimiate noise(outliers), which will help us identify the general trends in taxi fares easier.
Next I plotted the density of taxi fares by different pick-up boroughs(see below). This plot shows that there are differences in taxi fares between different pickup locations. Thought the median taxi fares of Newark is the highest among all boroughs, the distribution of EWR-picked-up taxi fares is the most widely spread out. Taxi fares of Staten-Island-pick-up follows the similar patterns as EWR.
The taxi fares of the other four boroughs (Queens, Manhattan, Brooklyn and Bronx) follow approximately normal distributions. Taxi trips startting from Queens have a median taxi fares about 2 times larger than from Manhattan, Brooklyn and Bronx.
I also computed the number of taxi trips taken at each boroughs and the percentage over total. It’s worthwhile to be aware that Manhattan by itself accounts for 92.3% of all the taxi trips in New York City, followed by Queens 6.3% and Brooklyn 1.23%. The rest of three boroughs has very small amount of taxi trips.
pu_median <- nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
group_by(puBorough) %>%
summarize(median = median(total_amount), num_trips = n()) %>%
mutate(percentage = round(num_trips/sum(num_trips), digits=4))
pu_median
## # A tibble: 6 x 4
## puBorough median num_trips percentage
## <fct> <dbl> <int> <dbl>
## 1 Bronx 17.3 1343 0.0014
## 2 Brooklyn 15.8 10065 0.0106
## 3 EWR 85.3 15 0
## 4 Manhattan 13.6 877632 0.922
## 5 Queens 46.7 62446 0.0656
## 6 Staten Island 49.8 33 0
# Density Plot : Taxi fares by different pick-up boroughs
nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
ggplot(aes(x =total_amount, y = puBorough, fill = puBorough)) +
geom_density_ridges(scale = 1, quantile_lines = TRUE, quantiles = 2) +
ggtitle("Density and Median Taxi Fares by Pick-up Boroughs") +
geom_text(data = pu_median, aes(x= median, y= puBorough, label= median),
position=position_nudge(y=-0.1), colour="black", size=3.5) +
theme(plot.title = element_text(size = 12, hjust = 0.5))
## Picking joint bandwidth of 6.43
Then, I want to how taxi fares differe by different zones. Computer the average and median of taxi fares, count of taxi trips by pick-up boroughs and further by pick-up zones.
puZone_summary <- nyc_taxi %>%
group_by(puBorough, puZone) %>%
summarise(avg_taxi_fare_puzone = mean(total_amount),median_taxi_fare_puzone = median(total_amount), num_trips = n()) %>%
arrange(desc(avg_taxi_fare_puzone))
head(puZone_summary,20)
## # A tibble: 20 x 5
## # Groups: puBorough [6]
## puBorough puZone avg_taxi_fare_pu… median_taxi_fare_… num_trips
## <fct> <fct> <dbl> <dbl> <int>
## 1 EWR Newark Airpo… 104. 96.6 18
## 2 Staten Isl… Port Richmond 99.5 99.5 1
## 3 Staten Isl… Heartland Vi… 95.0 95.0 2
## 4 Staten Isl… Arden Heights 84.4 93.9 7
## 5 Brooklyn Marine Park/… 75.7 73.7 3
## 6 Queens Jamaica Bay 73.7 73.7 1
## 7 Staten Isl… Bloomfield/E… 66.4 86.8 9
## 8 Staten Isl… South Beach/… 65.8 65.8 1
## 9 Bronx Pelham Bay 64.8 79.4 5
## 10 Queens Flushing Mea… 63.9 70.3 233
## 11 Staten Isl… Saint George… 62.9 49.8 3
## 12 Queens Baisley Park 61.1 67.3 305
## 13 Queens South Jamaica 60.5 66.4 139
## 14 Manhattan Randalls Isl… 58.7 68.8 46
## 15 Queens Far Rockaway 58.3 63.9 13
## 16 Queens Springfield … 56.0 61.9 79
## 17 Queens JFK Airport 55.3 61.4 29116
## 18 Queens South Ozone … 52.5 61.4 112
## 19 Queens Jamaica 50.4 61.4 148
## 20 Queens Astoria Park 50.2 59.4 12
The summary table shows that the majority of zones with the highest average taxi fares actually have very small number of trips. This indicates that those zones with high taxi fares are not necessarily the most profitable choice or the go-to places for taxi drivers when considering the volumn of demand. Since the dataset is about New York Yellow Cap Taxi Trips, pick-ups from Newark Airport is of very small amount, while drop-off at Newark Airport has high volumn(presented below). Given the distance to Newark Airport from NYC, a New York taxi driver is more likely to pick up a passenger within the NYC and drop off her at Newark Airport. Queens is a very promising choice as it accounts for 13 out of 20 top zones on this rank. The median taxi fare of Queens-pick-up is $45.36.
Then I filtered for zones with more than 100 trips to focus on highly taxi-wanted areas, and sorted the summary table by average taxi fare from highest to the lowest.
puZone_summary <- puZone_summary %>%
filter(num_trips > 100) %>%
arrange(desc(avg_taxi_fare_puzone))
We can see that, among the top 20 zones by average taxi fares, half of them are located within Queens where passengers are very likely to take an expensive taxi trip, not to mention the taxi trips that people takes from JFK and LaGuardia airports are also within Queens. Manhattan and Brooklyn are two other boroughs associated with high average taxi fares.
head(puZone_summary,20)
## # A tibble: 20 x 5
## # Groups: puBorough [3]
## puBorough puZone avg_taxi_fare_pu… median_taxi_fare_… num_trips
## <fct> <fct> <dbl> <dbl> <int>
## 1 Queens Flushing Meado… 63.9 70.3 233
## 2 Queens Baisley Park 61.1 67.3 305
## 3 Queens South Jamaica 60.5 66.4 139
## 4 Queens JFK Airport 55.3 61.4 29116
## 5 Queens South Ozone Pa… 52.5 61.4 112
## 6 Queens Jamaica 50.4 61.4 148
## 7 Queens Briarwood/Jama… 47.3 61.4 119
## 8 Queens LaGuardia Airp… 41.8 43.1 24123
## 9 Brooklyn East New York 32.9 25.0 124
## 10 Queens Long Island Ci… 31.5 19.3 805
## 11 Queens Rego Park 31.0 22.4 102
## 12 Queens Steinway 27.2 18.3 478
## 13 Queens Elmhurst 25.5 11.8 268
## 14 Queens East Elmhurst 24.3 21.6 136
## 15 Manhattan Financial Dist… 23.9 22.3 3600
## 16 Queens Forest Hills 23.8 18.3 143
## 17 Manhattan Financial Dist… 22.9 20.8 6873
## 18 Manhattan Battery Park 22.6 22.3 469
## 19 Manhattan World Trade Ce… 22.5 19.8 5456
## 20 Queens Sunnyside 22.4 16.6 1301
Plot density ridges for a comparison in taxi fare distributions between different drop-off boroughs. As taxi driver are not likely to predict where the next passengers are going to until he picks up the passenger, this information is less useful in helping taxi drivers to boost their earnings.
do_median <- nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
group_by(doBorough) %>%
summarize(do_median = median(total_amount))
do_median
## # A tibble: 6 x 2
## doBorough do_median
## <fct> <dbl>
## 1 Bronx 34.1
## 2 Brooklyn 29.8
## 3 EWR 96.8
## 4 Manhattan 13.5
## 5 Queens 34.8
## 6 Staten Island 83.3
nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
ggplot(aes(x =total_amount, y = doBorough, fill = doBorough)) +
geom_density_ridges(scale = 1, quantile_lines = TRUE, quantiles = 2) +
geom_text(data = do_median, aes(x= do_median, y= doBorough, label= do_median),
position=position_nudge(y=-0.1), colour="black", size=3.5) +
ggtitle("Density of Taxi Fares by Different Drop-off Boroughs") +
theme(plot.title = element_text(size = 12, hjust = 0.5))
## Picking joint bandwidth of 2.4
Plot distributions of taxi fares between airport taxi trips and non-airport trips. Both the density and boxplot show that taxi rides travel to or from the airport cost higher than non-airport trips, which makes sense as the distance to airports are most likely longer than a within-city trip. With this information, I would suggest taxi drivers to take more airports trips by stopping for passengers with suitcases with them.
nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
ggplot(aes(x = total_amount, color = airports_trip)) +
geom_density() +
ggtitle("Density of Taxi Fares by Airport-trips or not") +
theme(plot.title = element_text(size = 12, hjust = 0.5))
nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
ggplot() +
geom_boxplot(aes(x=airports_trip, y=total_amount, fill = airports_trip)) +
ggtitle("Boxplot of total taxi fare by airport-trip or not ")+
theme(plot.title = element_text(size = 12, hjust = 0.5))
Plot the density of taxi fare by pick-up hour. It looks like there is very minimum difference in taxi fares between different hours of the day
nyc_taxi %>%
filter( total_amount < total_amount_99 ) %>%
ggplot(aes(x=total_amount, color = pk_hour), alpha = 0.5) +
geom_density()+
ggtitle("Density of Taxi Fare by Pick-up Time") +
theme(plot.title = element_text(size = 12, hjust = 0.5))
Computed the median taxi fares by hour of the day, sorted it in descending order and listed the top 10. Noted two things from the table: * The best time to work for taxi drivers are probably these 10 hours that I grouped into three time slots:3am to 5am(early morning), 4pm to 7pm(after work rush hour) and 10pm to 1am (late evening). * By arranging these top 10 hours by number of trips, the morning shift (4am to 6am) has highest average value but the lowest in demand.
pkh <- nyc_taxi %>%
group_by(pk_hour) %>%
summarize(avg_pkh = round(mean(total_amount), 2), num_trips = n())
pkh_top10 <- pkh %>%
arrange(desc(avg_pkh)) %>%
top_n(wt = avg_pkh, n=10)
pkh_top10 %>%
arrange(desc(num_trips))
## # A tibble: 10 x 3
## pk_hour avg_pkh num_trips
## <fct> <dbl> <int>
## 1 17 18.9 57051
## 2 15 18.4 53556
## 3 21 18.4 53081
## 4 14 18.6 52623
## 5 16 19.6 49905
## 6 22 18.6 49153
## 7 23 19.0 37898
## 8 0 18.8 26689
## 9 5 21.5 8477
## 10 4 20.3 7115
Given the observation above, I created a new variable called rush_hour to group 24 hours a day into 4 time-slot:Early Morning(4am-6am), Mid_Day(1pm-6pm), Midnight(10pm-1am),Regular Hour(any other time). This is one of techniques of feature engineering, Discretization, which is converting a continuous variable to a categorical variable by creating meaningful classes
nyc_taxi <- nyc_taxi %>%
mutate(rush_hour =as.factor(ifelse(pk_hour %in% c(22,23,0), "Midnight(9pm-2am)",
ifelse(pk_hour %in% c(4,5), "Early Morning(3am-5am)",
ifelse(pk_hour %in% c(13,14,15,16,17), "Mid_day(1pm-6pm)" , "Other Hours")))))
rush_hr_median <- nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
group_by(rush_hour) %>%
summarize(median_rush_hr = round(median(total_amount), 2), avg_rush_hr = round(mean(total_amount), 2), num_trips = n())
rush_hr_median
## # A tibble: 4 x 4
## rush_hour median_rush_hr avg_rush_hr num_trips
## <fct> <dbl> <dbl> <int>
## 1 Early Morning(3am-5am) 13.8 20.9 15589
## 2 Mid_day(1pm-6pm) 14.1 18.7 263015
## 3 Midnight(9pm-2am) 14.8 18.8 113729
## 4 Other Hours 13.8 17.6 559201
Plot boxplots for taxi fares by four time frames.This graph shows that there is some advantage for taxi drivers to work during those three special time frames, early morning, mid_day and midnight than any other hours.
nyc_taxi %>%
filter( total_amount < total_amount_99 ) %>%
ggplot(aes(y=total_amount, x= rush_hour, color = rush_hour)) +
geom_boxplot()+
stat_summary(fun.y=mean, colour="darkblue", geom="point",
shape=18, size=3, show.legend =FALSE) +
geom_text(data = rush_hr_median, aes(y= avg_rush_hr , x= rush_hour, label= avg_rush_hr ),
position=position_nudge(y = 3), colour="black", size=3.5) +
ggtitle("Boxplot of Taxi Fares by Pick-up Time Frame") +
theme(plot.title = element_text(size = 12, hjust = 0.5))
Let’s look at taxi fares by day of the week. According to this plot, we can see that Thursday, Friday and Monday are best days to work given they rank the highest both in average and mean taxi fares, while Saturday is worst day to work during a week.
pk_wday_median <- nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
group_by(pk_wday) %>%
summarize(median_wday = round(median(total_amount), 2),avg_wday = round(mean(total_amount), 2), num_trips = n())
pk_wday_median
## # A tibble: 7 x 4
## pk_wday median_wday avg_wday num_trips
## <ord> <dbl> <dbl> <int>
## 1 Sun 13.3 17.9 107372
## 2 Mon 13.8 18.4 128949
## 3 Tue 14.2 18.2 156613
## 4 Wed 14.2 18.3 152927
## 5 Thu 14.3 18.4 150020
## 6 Fri 14.2 18.3 131617
## 7 Sat 13.6 17.1 124036
nyc_taxi %>%
filter(total_amount < total_amount_99) %>%
ggplot(aes(y =total_amount, x = pk_wday, color = pk_wday)) +
geom_boxplot() +
geom_text(data = pk_wday_median, aes(y= avg_wday, x= pk_wday, label= avg_wday),
position=position_nudge(y= -3), colour="black", size=3.5) +
stat_summary(fun.y=mean, colour="darkblue", geom="point",
shape=18, size=3, show.legend =FALSE) +
ggtitle("Boxplots & Mean of Taxi Fares by pick-up day of the week") +
theme(plot.title = element_text(size = 12, hjust = 0.5))
I also want to see if passenger count affects the taxi fares. It seems there is little difference in taxi fare between taxi rides with different number of passengers.
nyc_taxi %>%
filter(passenger_count %in% c(1:6)) %>%
filter(total_amount <= total_amount_99) %>%
group_by(passenger_count) %>%
summarise(mean= mean(total_amount)) %>%
ggplot(aes(y = mean, x=passenger_count, fill = passenger_count)) +
geom_bar(stat = "identity") +
ggtitle("Average Taxi Fares by Number of Passengers") +
theme(plot.title = element_text(size = 12, hjust = 0.5))
Let’s get ready for some modeling!
Let’s check datatype
str(nyc_taxi)
## 'data.frame': 951630 obs. of 17 variables:
## $ tpep_pickup_datetime : Factor w/ 943663 levels "2001-02-02 14:55:07",..: 897115 416959 661190 644670 524670 807291 194142 549687 13304 294695 ...
## $ tpep_dropoff_datetime: Factor w/ 943193 levels "2001-02-02 15:07:27",..: 896812 417011 660883 644391 524449 807066 194056 549478 13253 294636 ...
## $ passenger_count : Factor w/ 8 levels "1","2","3","4",..: 3 2 1 1 1 1 2 1 1 1 ...
## $ trip_distance : num 2.59 17.9 2.92 0.78 4.3 2.07 1.13 0.77 1.78 1.41 ...
## $ PULocationID : int 249 132 161 100 50 161 263 170 148 237 ...
## $ DOLocationID : int 163 230 211 164 12 137 140 229 107 236 ...
## $ payment_type : int 1 2 2 1 1 1 2 2 2 1 ...
## $ total_amount : num 31.6 61.4 21.8 17.2 20.3 ...
## $ puBorough : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 5 4 4 4 4 4 4 4 4 ...
## $ puZone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 247 128 157 97 48 157 261 166 144 235 ...
## $ doBorough : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ doZone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 159 228 209 160 10 133 136 227 102 234 ...
## $ pk_hour : Factor w/ 24 levels "0","1","2","3",..: 23 21 17 14 8 16 10 12 7 23 ...
## $ pk_wday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 6 7 5 3 7 2 6 4 5 6 ...
## $ pk_month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 10 4 7 7 7 10 1 7 1 4 ...
## $ airports_trip : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
## $ rush_hour : Factor w/ 4 levels "Early Morning(3am-5am)",..: 3 4 2 2 4 2 4 4 4 3 ...
Noted that pk_wday and pk_month are ordered factor variables. I converted them to unordered factor variables because there is no actual ranking between these classes.
nyc_taxi$pk_wday <- factor(nyc_taxi$pk_wday, ordered = FALSE )
nyc_taxi$pk_month <- factor(nyc_taxi$pk_month, ordered = FALSE)
standard scale the numerical variables trip_distance
nyc_taxi$trip_distance_scaled <- scale(nyc_taxi$trip_distance, center = TRUE, scale = TRUE)
set.seed(123)
index_2 <- sample(1:nrow(nyc_taxi), 120000)
df_model <- nyc_taxi[index_2 , c("passenger_count", "trip_distance_scaled", "total_amount", "puBorough","doBorough", "pk_hour", "pk_wday", "airports_trip", "rush_hour") ]
# save the dataset
write.csv(df_model, file = "nyctaxi_df_for_model.csv")
Split the dataset into training and test set
# Re-load data
df <- read.csv(file = "nyctaxi_df_for_model.csv")
# 80/20 split for training and validation
set.seed(12)
shuffle <- sample(nrow(df))
split <- round(nrow(df) * 0.8)
shuffled <- df[shuffle, ]
train <- shuffled[1 : split, ]
test <- shuffled[(split + 1) : nrow(df_model), ]
Further split the training set to create a validation set
# 80/20 split for training and validation
set.seed(12)
shuffle <- sample(nrow(train))
split <- round(nrow(train) * 0.8)
shuffled <- train[shuffle, ]
train_set <- train[1 : split, ]
val <- train[(split + 1) : nrow(train), ]
head(train)
## X passenger_count trip_distance_scaled total_amount puBorough
## 119130 329956 3 -0.6240004 7.80 Manhattan
## 83191 174456 1 0.2552267 24.80 Manhattan
## 2222 19349 1 -0.3229317 11.80 Manhattan
## 42053 714597 1 -0.5041058 20.76 Manhattan
## 66082 374532 1 -0.2776382 12.95 Manhattan
## 118875 453929 1 -0.2216874 13.57 Manhattan
## doBorough pk_hour pk_wday airports_trip rush_hour
## 119130 Manhattan 10 Sat FALSE Other Hours
## 83191 Queens 7 Tue FALSE Other Hours
## 2222 Manhattan 20 Sun FALSE Other Hours
## 42053 Manhattan 21 Sat FALSE Other Hours
## 66082 Manhattan 5 Tue FALSE Early Morning(3am-5am)
## 118875 Manhattan 6 Fri FALSE Other Hours
#train linear model
lm_model <- lm(total_amount ~ passenger_count + trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
summary(lm_model)
##
## Call:
## lm(formula = total_amount ~ passenger_count + trip_distance_scaled +
## puBorough + doBorough + pk_wday + airports_trip + rush_hour,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.265 -2.327 -0.401 1.684 159.681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.773211 0.464660 23.185 < 2e-16 ***
## passenger_count 0.007267 0.014003 0.519 0.603772
## trip_distance_scaled 12.270246 0.027280 449.793 < 2e-16 ***
## puBoroughBrooklyn 0.519726 0.483835 1.074 0.282745
## puBoroughManhattan 2.210184 0.453001 4.879 1.07e-06 ***
## puBoroughQueens 0.694887 0.466174 1.491 0.136066
## puBoroughStaten Island -14.344327 4.817263 -2.978 0.002905 **
## doBoroughBrooklyn 2.971156 0.229195 12.963 < 2e-16 ***
## doBoroughEWR 32.333726 0.470098 68.781 < 2e-16 ***
## doBoroughManhattan 3.574108 0.216343 16.521 < 2e-16 ***
## doBoroughQueens 2.578734 0.230938 11.166 < 2e-16 ***
## doBoroughStaten Island 8.649402 1.153049 7.501 6.39e-14 ***
## pk_wdayMon -0.410407 0.064384 -6.374 1.85e-10 ***
## pk_wdaySat -1.005385 0.065084 -15.447 < 2e-16 ***
## pk_wdaySun -1.429337 0.067351 -21.222 < 2e-16 ***
## pk_wdayThu 0.224001 0.062076 3.609 0.000308 ***
## pk_wdayTue -0.155243 0.061407 -2.528 0.011470 *
## pk_wdayWed 0.072874 0.062021 1.175 0.240002
## airports_tripTRUE 2.090534 0.139266 15.011 < 2e-16 ***
## rush_hourMid_day(1pm-6pm) 2.422515 0.136903 17.695 < 2e-16 ***
## rush_hourMidnight(9pm-2am) 1.271482 0.141682 8.974 < 2e-16 ***
## rush_hourOther Hours 1.722493 0.134877 12.771 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.66 on 76778 degrees of freedom
## Multiple R-squared: 0.8798, Adjusted R-squared: 0.8797
## F-statistic: 2.675e+04 on 21 and 76778 DF, p-value: < 2.2e-16
What’s the model summary telling us? * The Adjusted R-squared of the linear regression model is 0.8854, which means the model(independent variables) explains 88.54% of the variance in the data(target variable). * The p-value on the F-statistic is significantly less than 5%, indicating this linear model fits the dataset * Most of the coefficients of the linear are significant (using 5% threshold) except for passenger count, which has a p values higher than 5%.
Let’s check the goodness of fit of the linear model by predicting on validation set, and compute the RMSE, MAPE
# predict on the validation set
pred_val_lm <- predict(lm_model, val)
# compute RMSE and MAPE of the prediction on validation set
MAPE_val_lm <- MAPE(pred_val_lm , val$total_amount)
RMSE_val_lm <- sqrt(MSE(pred_val_lm , val$total_amount))
MAPE_val_lm
## [1] 0.185982
RMSE_val_lm
## [1] 4.725017
Based on the result of first model, linear regression with all independent variables, not all the variables are significant. I am going apply step model on the linear model to see if any features we can drop.
Run a step model to see if any of the variables could be dropped. Again, passenger count is the least important predicting variable here as dropping the passenger count variable leads to a very small decrease in AIC compared to the full model.
step.model <- step(lm_model, direction = "both")
## Start: AIC=236422.5
## total_amount ~ passenger_count + trip_distance_scaled + puBorough +
## doBorough + pk_wday + airports_trip + rush_hour
##
## Df Sum of Sq RSS AIC
## - passenger_count 1 6 1667454 236421
## <none> 1667448 236423
## - airports_trip 1 4894 1672342 236646
## - puBorough 4 6365 1673814 236707
## - rush_hour 3 14747 1682196 237093
## - pk_wday 6 22417 1689866 237436
## - doBorough 5 116744 1784193 241610
## - trip_distance_scaled 1 4393812 6061260 335540
##
## Step: AIC=236420.8
## total_amount ~ trip_distance_scaled + puBorough + doBorough +
## pk_wday + airports_trip + rush_hour
##
## Df Sum of Sq RSS AIC
## <none> 1667454 236421
## + passenger_count 1 6 1667448 236423
## - airports_trip 1 4896 1672351 236644
## - puBorough 4 6369 1673823 236706
## - rush_hour 3 14748 1682202 237091
## - pk_wday 6 22415 1689869 237434
## - doBorough 5 116755 1784209 241608
## - trip_distance_scaled 1 4394075 6061530 335542
summary(step.model)
##
## Call:
## lm(formula = total_amount ~ trip_distance_scaled + puBorough +
## doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.263 -2.328 -0.401 1.685 159.677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.78403 0.46419 23.232 < 2e-16 ***
## trip_distance_scaled 12.27034 0.02728 449.809 < 2e-16 ***
## puBoroughBrooklyn 0.51960 0.48383 1.074 0.282857
## puBoroughManhattan 2.21037 0.45300 4.879 1.07e-06 ***
## puBoroughQueens 0.69461 0.46617 1.490 0.136219
## puBoroughStaten Island -14.35272 4.81721 -2.979 0.002888 **
## doBoroughBrooklyn 2.97107 0.22919 12.963 < 2e-16 ***
## doBoroughEWR 32.33440 0.47009 68.783 < 2e-16 ***
## doBoroughManhattan 3.57405 0.21634 16.520 < 2e-16 ***
## doBoroughQueens 2.57843 0.23094 11.165 < 2e-16 ***
## doBoroughStaten Island 8.65367 1.15301 7.505 6.20e-14 ***
## pk_wdayMon -0.41046 0.06438 -6.375 1.84e-10 ***
## pk_wdaySat -1.00486 0.06508 -15.441 < 2e-16 ***
## pk_wdaySun -1.42895 0.06735 -21.218 < 2e-16 ***
## pk_wdayThu 0.22381 0.06207 3.605 0.000312 ***
## pk_wdayTue -0.15555 0.06140 -2.533 0.011301 *
## pk_wdayWed 0.07267 0.06202 1.172 0.241328
## airports_tripTRUE 2.09104 0.13926 15.015 < 2e-16 ***
## rush_hourMid_day(1pm-6pm) 2.42315 0.13690 17.700 < 2e-16 ***
## rush_hourMidnight(9pm-2am) 1.27243 0.14167 8.982 < 2e-16 ***
## rush_hourOther Hours 1.72305 0.13487 12.775 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.66 on 76779 degrees of freedom
## Multiple R-squared: 0.8798, Adjusted R-squared: 0.8797
## F-statistic: 2.809e+04 on 20 and 76779 DF, p-value: < 2.2e-16
Train linear regression without the passenger count variable. Noted this modification does improve the model a little bit as it yields a slightly smaller RMSE.
lm_model_v2 <- lm(total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
summary(lm_model_v2)
##
## Call:
## lm(formula = total_amount ~ trip_distance_scaled + puBorough +
## doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.263 -2.328 -0.401 1.685 159.677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.78403 0.46419 23.232 < 2e-16 ***
## trip_distance_scaled 12.27034 0.02728 449.809 < 2e-16 ***
## puBoroughBrooklyn 0.51960 0.48383 1.074 0.282857
## puBoroughManhattan 2.21037 0.45300 4.879 1.07e-06 ***
## puBoroughQueens 0.69461 0.46617 1.490 0.136219
## puBoroughStaten Island -14.35272 4.81721 -2.979 0.002888 **
## doBoroughBrooklyn 2.97107 0.22919 12.963 < 2e-16 ***
## doBoroughEWR 32.33440 0.47009 68.783 < 2e-16 ***
## doBoroughManhattan 3.57405 0.21634 16.520 < 2e-16 ***
## doBoroughQueens 2.57843 0.23094 11.165 < 2e-16 ***
## doBoroughStaten Island 8.65367 1.15301 7.505 6.20e-14 ***
## pk_wdayMon -0.41046 0.06438 -6.375 1.84e-10 ***
## pk_wdaySat -1.00486 0.06508 -15.441 < 2e-16 ***
## pk_wdaySun -1.42895 0.06735 -21.218 < 2e-16 ***
## pk_wdayThu 0.22381 0.06207 3.605 0.000312 ***
## pk_wdayTue -0.15555 0.06140 -2.533 0.011301 *
## pk_wdayWed 0.07267 0.06202 1.172 0.241328
## airports_tripTRUE 2.09104 0.13926 15.015 < 2e-16 ***
## rush_hourMid_day(1pm-6pm) 2.42315 0.13690 17.700 < 2e-16 ***
## rush_hourMidnight(9pm-2am) 1.27243 0.14167 8.982 < 2e-16 ***
## rush_hourOther Hours 1.72305 0.13487 12.775 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.66 on 76779 degrees of freedom
## Multiple R-squared: 0.8798, Adjusted R-squared: 0.8797
## F-statistic: 2.809e+04 on 20 and 76779 DF, p-value: < 2.2e-16
Check goodness of fit of second model
# predict on the validation set
pred_val_lm_2 <- predict(lm_model_v2, val)
# compute RMSE and MAPE of the prediction on validation set
MAPE_val_lm_2 <- MAPE(pred_val_lm_2 , val$total_amount)
RMSE_val_lm_2 <- sqrt(MSE(pred_val_lm_2 , val$total_amount))
MAPE_val_lm_2
## [1] 0.1859735
RMSE_val_lm_2
## [1] 4.724935
# decision tree using rpart
tree <- rpart(total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour,
data = train_set, method = "anova", control=rpart.control(minsplit=100) )
The decision tree model used two variables trip distance and drop-off locations to slipt the data, while the trip distance and airport-trip are ranks top 2 in terms of variable importance
fancyRpartPlot(tree)
Check feature importance
printcp(tree)
##
## Regression tree:
## rpart(formula = total_amount ~ trip_distance_scaled + puBorough +
## doBorough + pk_wday + airports_trip + rush_hour, data = train_set,
## method = "anova", control = rpart.control(minsplit = 100))
##
## Variables actually used in tree construction:
## [1] doBorough trip_distance_scaled
##
## Root node error: 13867605/76800 = 180.57
##
## n= 76800
##
## CP nsplit rel error xerror xstd
## 1 0.649825 0 1.00000 1.00004 0.0123804
## 2 0.098677 1 0.35018 0.35069 0.0051532
## 3 0.068593 2 0.25150 0.25239 0.0048994
## 4 0.016607 3 0.18291 0.18417 0.0041534
## 5 0.015781 4 0.16630 0.16582 0.0041885
## 6 0.010709 5 0.15052 0.15179 0.0041152
## 7 0.010000 6 0.13981 0.14361 0.0038744
tree$variable.importance
## trip_distance_scaled airports_trip puBorough
## 11785705 5528916 3218144
## doBorough
## 1055740
While the decision tree model is easier to interprect, it’s prediction on taxi fare of the test dataset is not as good as the linear regression model, as the decision tree model has a higher RMSE.
# predict on the validation set
pred_val_dt <- predict(tree, val)
# compute RMSE and MAPE of the prediction on validation set
MAPE_val_dt <- MAPE(pred_val_dt , val$total_amount)
RMSE_val_dt <- sqrt(MSE(pred_val_dt , val$total_amount))
MAPE_val_dt
## [1] 0.2029875
RMSE_val_dt
## [1] 5.11615
The random forest model shows that trip distance and drop-off location are most important variables
set.seed(123)
rf_model <- randomForest(total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour,
data = train_set, importance = TRUE)
saveRDS(rf_model, "nyctaxi_model_rf.rds")
#rf_model <- readRDS("~/Folders/R/NYC_Taxi_Project/nyctaxi_model_rf.rds")
Model summary
print(rf_model)
##
## Call:
## randomForest(formula = total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour, data = train_set, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 18.96557
## % Var explained: 89.5
Check feature importance
varImp(rf_model)
## Overall
## trip_distance_scaled 67.57463
## puBorough 27.87959
## doBorough 26.74395
## pk_wday 19.24118
## airports_trip 19.45401
## rush_hour 29.56983
varImpPlot(rf_model,type=1)
Take a look at how well the random forest model perform predictions on the test dataset. The RMSE of random forest model is lower than that of the decision tree model.
# predict on the validation set
pred_val_rf <- predict(rf_model, val)
# compute RMSE and MAPE of the prediction on validation set
MAPE_val_rf <- MAPE(pred_val_rf , val$total_amount)
RMSE_val_rf <- sqrt(MSE(pred_val_rf , val$total_amount))
MAPE_val_rf
## [1] 0.1876614
RMSE_val_rf
## [1] 4.421974
rf_cv <- train(
total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour,
train_set, method = "ranger",
trControl = trainControl( method = "cv", number = 5, verboseIter = FALSE))
## Growing trees.. Progress: 84%. Estimated remaining time: 5 seconds.
## Growing trees.. Progress: 86%. Estimated remaining time: 5 seconds.
## Growing trees.. Progress: 82%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 83%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 9 seconds.
## Growing trees.. Progress: 84%. Estimated remaining time: 5 seconds.
## Growing trees.. Progress: 83%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 75%. Estimated remaining time: 10 seconds.
## Growing trees.. Progress: 81%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 9 seconds.
saveRDS(rf_cv, "rf_cv_model.rds")
#cv_model <- readRDS("~/Folders/R/NYC_Taxi_Project/model_cv.rds")
Check model summary
print(rf_cv)
## Random Forest
##
## 76800 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 61441, 61440, 61441, 61440, 61438
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 6.843012 0.7849198 4.773034
## 2 extratrees 7.610567 0.7255375 5.386164
## 11 variance 4.278346 0.8986082 2.668861
## 11 extratrees 4.271784 0.8992138 2.676828
## 21 variance 4.466950 0.8897395 2.801595
## 21 extratrees 4.411736 0.8923715 2.774423
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 11, splitrule =
## extratrees and min.node.size = 5.
print(rf_cv$finalModel)
## Ranger result
##
## Call:
## ranger::ranger(dependent.variable.name = ".outcome", data = x, mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size, splitrule = as.character(param$splitrule), write.forest = TRUE, probability = classProbs, ...)
##
## Type: Regression
## Number of trees: 500
## Sample size: 76800
## Number of independent variables: 21
## Mtry: 11
## Target node size: 5
## Variable importance mode: none
## Splitrule: extratrees
## OOB prediction error (MSE): 18.14932
## R squared (OOB): 0.8994888
Use the cross-validated random forest model to predict validation set and estimate the accuracy of the predictions. The random forest model with cross-validation resulting in the lowest RMSE score, so therefore it’s the best model of 5 all models.
# predict on the validation set
pred_val_rfcv <- predict(rf_cv, val)
# compute RMSE and MAPE of the prediction on validation set
MAPE_val_rfcv <- MAPE(pred_val_rfcv , val$total_amount)
RMSE_val_rfcv <- sqrt(MSE(pred_val_rfcv , val$total_amount))
MAPE_val_rfcv
## [1] 0.1760098
RMSE_val_rfcv
## [1] 4.338317
A summary of all validation errors by all five models, and noted random forest model with cross validation has the lowest RSME and MAPE, and therefore it’s the best model so far.
modelname <- c("linear model","linear modelv2", "decision tree","random forest","random forest with cv")
modelRMSE <- c(RMSE_val_lm, RMSE_val_lm_2, RMSE_val_dt , RMSE_val_rf , RMSE_val_rfcv)
modelMAPE <- c(MAPE_val_lm, MAPE_val_lm_2, MAPE_val_dt , MAPE_val_rf , MAPE_val_rfcv)
val_error <- as.data.frame( cbind(modelname,modelRMSE , modelMAPE) )
colnames(val_error) <- c("Models", "RMSE","MAPE")
val_error
## Models RMSE MAPE
## 1 linear model 4.72501668693151 0.185981961819124
## 2 linear modelv2 4.72493547559144 0.185973510098375
## 3 decision tree 5.11614991470542 0.202987507884024
## 4 random forest 4.42197421988426 0.187661375033456
## 5 random forest with cv 4.33831742298959 0.176009821749615
Apply the best model(5th model, random forest with cv) to predict on test set
# predict on the test set
pred_test_rfcv <- predict(rf_cv, test)
# compute RMSE and MAPE of the prediction on validation set
MAPE_test_rfcv <- MAPE(pred_test_rfcv , test$total_amount)
RMSE_test_rfcv <- sqrt(MSE(pred_test_rfcv , test$total_amount))
MAPE_test_rfcv
## [1] 0.1744527
RMSE_test_rfcv
## [1] 4.152667
Throughout data exploration, visualization and modeling, I found consistently trip distance is the most important factor for taxi fares, which makes sense as taxi fare is calculated using miles. Still, there are something that taxi drivers can do differently to increase earnings: taxi drivers can pay attention to factors like pick-up and drop-off locations, whether it’s to/from the airports, time blocks during the day, and the day in the week.