Load libraries

library("formattable")
library("tidyverse")
library("ggplot2")
library("ggmap")
library("lubridate")
library("caret")
library("ranger")
library("randomForest")
library("ggridges")
library("rpart")
library("rattle")
library("rpart.plot")
library("RColorBrewer")
library("MASS")
library("tree")
library("dplyr")
library("quantreg")
library("ggmap")
library("ggmap")
library("MLmetrics")

Project Goals

The first goal of this project is to identify any trends and patterns in New York City taxi fares. If we are able to discover any significant findings from the data, we can make recommendations to taxi drivers on how to boost their daily earnings. Second goal is to see if we are can build regression models to predict taxi fares based on a number of factors (time, date, pickup and dropoff locations and etc.)

Hypothesis

NYC Taxi fares are affective by pick-up and drop-off locations, passenger count, trip distance, pick-up and drop-off time during the day, day of the week, and whether it is an airport trip. We will exam this hypothesis through data exploratory analysis and building linear model

Load Data

First, loading the datasets of NYC Yellow Taxi Trips for Januray, April, July, October 2019 The datasets were sourced from nyc.gov website link (or copy-paste the url into your browser : https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Data Dictionary can be found here Link

#loading the database of NYC Yellow Taxi Trip for Januray, April, July, October 2018 and combine them into a single data frame.
yel_jan <- read.csv("yellow_tripdata_2019-01.csv")
yel_apl <- read.csv("yellow_tripdata_2019-04.csv")
yel_jul <- read.csv("yellow_tripdata_2019-07.csv")
yel_oct <- read.csv("yellow_tripdata_2019-10.csv")
all <- rbind(yel_jan, yel_apl, yel_jul, yel_oct)

Data Overview

A quick overview of the dataset: * Check dataset size (noted the dataset has 28.6M rows an 18 columns)

dim(all)
## [1] 28625241       18
  • Preview dataframe
head(all)
##   VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 1        1  2019-01-01 00:46:40   2019-01-01 00:53:20               1
## 2        1  2019-01-01 00:59:47   2019-01-01 01:18:59               1
## 3        2  2018-12-21 13:48:30   2018-12-21 13:52:40               3
## 4        2  2018-11-28 15:52:25   2018-11-28 15:55:45               5
## 5        2  2018-11-28 15:56:57   2018-11-28 15:58:33               5
## 6        2  2018-11-28 16:25:49   2018-11-28 16:28:26               5
##   trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID
## 1           1.5          1                  N          151          239
## 2           2.6          1                  N          239          246
## 3           0.0          1                  N          236          236
## 4           0.0          1                  N          193          193
## 5           0.0          2                  N          193          193
## 6           0.0          1                  N          193          193
##   payment_type fare_amount extra mta_tax tip_amount tolls_amount
## 1            1         7.0   0.5     0.5       1.65         0.00
## 2            1        14.0   0.5     0.5       1.00         0.00
## 3            1         4.5   0.5     0.5       0.00         0.00
## 4            2         3.5   0.5     0.5       0.00         0.00
## 5            2        52.0   0.0     0.5       0.00         0.00
## 6            2         3.5   0.5     0.5       0.00         5.76
##   improvement_surcharge total_amount congestion_surcharge
## 1                   0.3         9.95                   NA
## 2                   0.3        16.30                   NA
## 3                   0.3         5.80                   NA
## 4                   0.3         7.55                   NA
## 5                   0.3        55.55                   NA
## 6                   0.3        13.31                   NA
  • Inspect columns in the dataset
str(all)
## 'data.frame':    28625241 obs. of  18 variables:
##  $ VendorID             : int  1 1 2 2 2 2 2 1 1 1 ...
##  $ tpep_pickup_datetime : Factor w/ 8808859 levels "2001-02-02 14:55:07",..: 3046 3826 87 76 77 78 79 1558 2181 3692 ...
##  $ tpep_dropoff_datetime: Factor w/ 8820499 levels "2001-02-02 15:07:27",..: 2804 4320 87 76 77 78 79 1365 2356 3760 ...
##  $ passenger_count      : int  1 1 3 5 5 5 5 1 1 2 ...
##  $ trip_distance        : num  1.5 2.6 0 0 0 0 0 1.3 3.7 2.1 ...
##  $ RatecodeID           : int  1 1 1 1 2 1 2 1 1 1 ...
##  $ store_and_fwd_flag   : Factor w/ 3 levels "N","Y","": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PULocationID         : int  151 239 236 193 193 193 193 163 229 141 ...
##  $ DOLocationID         : int  239 246 236 193 193 193 193 229 7 234 ...
##  $ payment_type         : int  1 1 1 2 2 2 2 1 1 1 ...
##  $ fare_amount          : num  7 14 4.5 3.5 52 3.5 52 6.5 13.5 10 ...
##  $ extra                : num  0.5 0.5 0.5 0.5 0 0.5 0 0.5 0.5 0.5 ...
##  $ mta_tax              : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ tip_amount           : num  1.65 1 0 0 0 0 0 1.25 3.7 1.7 ...
##  $ tolls_amount         : num  0 0 0 0 0 5.76 0 0 0 0 ...
##  $ improvement_surcharge: num  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
##  $ total_amount         : num  9.95 16.3 5.8 7.55 55.55 ...
##  $ congestion_surcharge : num  NA NA NA NA NA NA NA NA NA NA ...

Random sample a subset of data Considering data size and running speed, I randomly sampled 1 millions rows of data from the all dataset for EDA (Data Exploratory Analysis)

set.seed(123)
index <- sample(x = 1:nrow(all), size= 1000000)

Retain the following columns will be revelant to taxi fare

sampled <- all[index , c("tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count", "trip_distance", "PULocationID", "DOLocationID","payment_type", "total_amount")]

#Save the sampled dataset
write.csv(sampled, file="NYC Taxi Fare Data Sample.csv", row.names=FALSE)

Load back the sampled smalldataset

nyc_taxi <- read.csv("NYC Taxi Fare Data Sample.csv")

I am going to use the sample dataset to perform Data Exploratoy Analysis and build prediction model to test my hypothesis above.

Also loading the Taxi Zone Lookup Table from the same nyc.gov webpage link.

#Loading the taxi zone lookup table, which we will be used for locations referrence 
taxi_zone <- read.csv("taxi_zone_lookup.csv")

After reviewing the summary of taxi_zone table, founded that there are 2 LocationID (264 and 265) recorded as unknown borough. We’re excluding data rows with these two locationID in the all table later.

unknown_loc <- taxi_zone[taxi_zone$Borough == "Unknown", ]$LocationID
  • Data Summary (check min, max, missing values, any noise data)

Obtain a quick understanding of the dataset. Here are a few questions that I found : + There is negative values in total_amount and the xax value is 36090.30(extreme values) + zero passenger or passengers greater than 6 + NA values in 2 columns passenger_count and payment_type (same number of NAs in both columns)

summary(nyc_taxi)
##           tpep_pickup_datetime         tpep_dropoff_datetime
##  2019-10-22 17:44:55:     5    2019-04-15 00:00:00:     6   
##  2019-01-01 19:43:36:     4    2019-07-03 00:00:00:     6   
##  2019-01-04 19:37:06:     4    2019-07-24 00:00:00:     6   
##  2019-01-05 11:50:14:     4    2019-07-27 00:00:00:     6   
##  2019-01-05 19:01:13:     4    2019-01-18 00:00:00:     5   
##  2019-01-05 20:36:15:     4    2019-01-19 19:13:30:     5   
##  (Other)            :999975    (Other)            :999966   
##  passenger_count trip_distance      PULocationID    DOLocationID  
##  Min.   :0.000   Min.   :  0.000   Min.   :  1.0   Min.   :  1.0  
##  1st Qu.:1.000   1st Qu.:  0.970   1st Qu.:125.0   1st Qu.:112.0  
##  Median :1.000   Median :  1.600   Median :162.0   Median :162.0  
##  Mean   :1.563   Mean   :  2.964   Mean   :163.6   Mean   :161.8  
##  3rd Qu.:2.000   3rd Qu.:  3.000   3rd Qu.:233.0   3rd Qu.:234.0  
##  Max.   :9.000   Max.   :183.680   Max.   :265.0   Max.   :265.0  
##  NA's   :2829                                                     
##   payment_type    total_amount     
##  Min.   :1.000   Min.   : -224.80  
##  1st Qu.:1.000   1st Qu.:   10.55  
##  Median :1.000   Median :   14.12  
##  Mean   :1.292   Mean   :   18.42  
##  3rd Qu.:2.000   3rd Qu.:   20.13  
##  Max.   :4.000   Max.   :36090.30  
##  NA's   :2829

Inspect missing data. Noted the missing values in passenger_count and payment_type are in the same rows

apply(nyc_taxi, 2, function(x) {sum(is.na(x))})
##  tpep_pickup_datetime tpep_dropoff_datetime       passenger_count 
##                     0                     0                  2829 
##         trip_distance          PULocationID          DOLocationID 
##                     0                     0                     0 
##          payment_type          total_amount 
##                  2829                     0

Remove rows where there is missing values

nyc_taxi <- nyc_taxi %>% drop_na()
dim(nyc_taxi)
## [1] 997171      8

Data Cleaning

Remove unusual data or outliers

Here I performed data cleaning in the following: 1. Remove taxi trips for which the pickup or drop-off location was unknown (filtering out the two unknown locationID)

  1. Remove rows with zero passenger, zero trip_distance, zero or negative fare amount, zero or negative tolls amount, zero or negative total amount.

  2. Filter for taxi trips with payment type as credict card or cash, excluding trips that are no-charging, dispute, unknown, or voided. (payment type: 1 = credit card, 2 = cash, 3 = No charge, 4 = Dispute, 5 = Unknown, 6 = Voided trip, from Data Dictionary – Yellow Taxi Trip Records link)

nyc_taxi <- nyc_taxi %>% 
  filter((! PULocationID %in% unknown_loc) & (!DOLocationID %in% unknown_loc)) %>%
  filter(passenger_count != 0 & total_amount > 0  & trip_distance != 0 ) %>% 
  filter(payment_type == 1 | payment_type == 2)

Check dataset summary again after removing outliers. Confirm that all values (min, max, quantiles) are more reasonable and we can move on to feature engineerings

summary(nyc_taxi)
##           tpep_pickup_datetime         tpep_dropoff_datetime
##  2019-10-22 17:44:55:     5    2019-07-03 00:00:00:     6   
##  2019-01-01 19:43:36:     4    2019-07-24 00:00:00:     6   
##  2019-01-04 19:37:06:     4    2019-07-27 00:00:00:     6   
##  2019-01-05 11:50:14:     4    2019-01-19 19:13:30:     5   
##  2019-01-05 19:01:13:     4    2019-04-13 00:00:00:     5   
##  2019-01-05 20:36:15:     4    2019-04-15 00:00:00:     5   
##  (Other)            :951605    (Other)            :951597   
##  passenger_count trip_distance     PULocationID    DOLocationID  
##  Min.   :1.000   Min.   : 0.010   Min.   :  1.0   Min.   :  1.0  
##  1st Qu.:1.000   1st Qu.: 0.990   1st Qu.:116.0   1st Qu.:107.0  
##  Median :1.000   Median : 1.620   Median :162.0   Median :162.0  
##  Mean   :1.596   Mean   : 2.942   Mean   :162.7   Mean   :160.6  
##  3rd Qu.:2.000   3rd Qu.: 3.000   3rd Qu.:233.0   3rd Qu.:233.0  
##  Max.   :9.000   Max.   :77.540   Max.   :263.0   Max.   :263.0  
##                                                                  
##   payment_type    total_amount   
##  Min.   :1.000   Min.   :  0.30  
##  1st Qu.:1.000   1st Qu.: 10.56  
##  Median :1.000   Median : 14.12  
##  Mean   :1.276   Mean   : 18.12  
##  3rd Qu.:2.000   3rd Qu.: 19.80  
##  Max.   :2.000   Max.   :403.30  
## 

Feature engineering

Since location ID is not easily interpretable. We’re going to convert location ID to location by name. Perform a “Vlookup” to identify at which borough the taxi trip began using the taxi_zone table

nyc_taxi <- nyc_taxi %>% 
  left_join(taxi_zone[, c(1,2,3)], by = c("PULocationID" =  "LocationID"))

nyc_taxi <- nyc_taxi %>% 
  rename("puBorough" = "Borough", "puZone" = "Zone")    #puBorough = pick-up borough

Performed another “vlookup” to identify at which Borough the taxi trip ended

nyc_taxi <- nyc_taxi %>% 
  left_join(taxi_zone[, c(1,2,3)], by = c("DOLocationID" =  "LocationID"))

nyc_taxi <- nyc_taxi %>% 
  rename("doBorough" = "Borough", "doZone" = "Zone")      #doBorough = drop-off borough
head(nyc_taxi)
##   tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
## 1  2019-10-25 22:36:03   2019-10-25 23:11:50               3          2.59
## 2  2019-04-20 20:04:44   2019-04-20 20:45:45               2         17.90
## 3  2019-07-25 16:43:39   2019-07-25 17:10:24               1          2.92
## 4  2019-07-23 13:08:56   2019-07-23 13:25:46               1          0.78
## 5  2019-07-06 07:31:05   2019-07-06 07:44:01               1          4.30
## 6  2019-10-14 15:31:31   2019-10-14 15:43:18               1          2.07
##   PULocationID DOLocationID payment_type total_amount puBorough
## 1          249          163            1        31.62 Manhattan
## 2          132          230            2        61.42    Queens
## 3          161          211            2        21.80 Manhattan
## 4          100          164            1        17.16 Manhattan
## 5           50           12            1        20.30 Manhattan
## 6          161          137            1        16.62 Manhattan
##             puZone doBorough                    doZone
## 1     West Village Manhattan             Midtown North
## 2      JFK Airport Manhattan Times Sq/Theatre District
## 3   Midtown Center Manhattan                      SoHo
## 4 Garment District Manhattan             Midtown South
## 5     Clinton West Manhattan              Battery Park
## 6   Midtown Center Manhattan                  Kips Bay

Create New Features to classify at what time during the day and on which day during the week the taxi trip is taken * pickup hour * pickup day of the week * pickup day of the month

nyc_taxi <- nyc_taxi %>% 
  mutate( pk_hour = hour(tpep_pickup_datetime),
          pk_wday = wday(tpep_pickup_datetime,  label = TRUE),
          pk_month = month(tpep_pickup_datetime, label = TRUE))

To change pk_hour and passenger count as factor variables

nyc_taxi$pk_hour <- as.factor(nyc_taxi$pk_hour)
nyc_taxi$passenger_count <- as.factor(nyc_taxi$passenger_count) 

To classify whether the taxi trips were going to or coming from airports. We first identify locationID for airports from the taxi zone table.

airportID <- taxi_zone[str_detect(taxi_zone$Zone, "Airport"), ]$LocationID
airportID <- airportID[! is.na(airportID)]
airportID
## [1]   1 132 138

Create a new column named airports_trip for which TRUE stands for airport trips.

nyc_taxi <- nyc_taxi %>% 
  mutate(airports_trip = (PULocationID %in% airportID | DOLocationID  %in% airportID ))

Data Exploratory Analysis

head(nyc_taxi)
##   tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
## 1  2019-10-25 22:36:03   2019-10-25 23:11:50               3          2.59
## 2  2019-04-20 20:04:44   2019-04-20 20:45:45               2         17.90
## 3  2019-07-25 16:43:39   2019-07-25 17:10:24               1          2.92
## 4  2019-07-23 13:08:56   2019-07-23 13:25:46               1          0.78
## 5  2019-07-06 07:31:05   2019-07-06 07:44:01               1          4.30
## 6  2019-10-14 15:31:31   2019-10-14 15:43:18               1          2.07
##   PULocationID DOLocationID payment_type total_amount puBorough
## 1          249          163            1        31.62 Manhattan
## 2          132          230            2        61.42    Queens
## 3          161          211            2        21.80 Manhattan
## 4          100          164            1        17.16 Manhattan
## 5           50           12            1        20.30 Manhattan
## 6          161          137            1        16.62 Manhattan
##             puZone doBorough                    doZone pk_hour pk_wday
## 1     West Village Manhattan             Midtown North      22     Fri
## 2      JFK Airport Manhattan Times Sq/Theatre District      20     Sat
## 3   Midtown Center Manhattan                      SoHo      16     Thu
## 4 Garment District Manhattan             Midtown South      13     Tue
## 5     Clinton West Manhattan              Battery Park       7     Sat
## 6   Midtown Center Manhattan                  Kips Bay      15     Mon
##   pk_month airports_trip
## 1      Oct         FALSE
## 2      Apr          TRUE
## 3      Jul         FALSE
## 4      Jul         FALSE
## 5      Jul         FALSE
## 6      Oct         FALSE

Histogram of Taxi Fares

The distribution of taxi fares is highly right-skewed, which means the dataset contains lots of outliers.

ggplot(nyc_taxi, aes(x = total_amount)) +
  geom_histogram(fill = "blue", bins =  50) +
  ggtitle("Histogram of NYC Taxi Fares") + 
  theme(plot.title = element_text(hjust = 0.5))

While taking logarithm to transform the target variable can help conforming to normality, I found it hard to interprect the plots. I am more opt for excluding outliers in the plot.

Histogram of Taxi Fares (first 99.99 percentile)

To “zoon in” the distribution plot, I filtered the dataset for taxi fares (total amount) below the 99.99th quantile ($134.8). This can help us to have a better understanding of the distribution of the majority(99.99%) of data in our sample. I also highlight the median taxi fare in the plot.

total_amount_99 <-quantile(nyc_taxi$total_amount, .9999)
total_amount_99
##  99.99% 
## 136.367
median_fare <- nyc_taxi %>% 
  summarise(median_fare = median(total_amount))
median_fare
##   median_fare
## 1       14.12
nyc_taxi %>% 
  filter(total_amount < total_amount_99) %>% 
  ggplot(aes(x = total_amount)) +
  geom_histogram(color = "white", fill = "blue", bins =  50) +
  geom_vline(xintercept = median_fare$median_fare,linetype="dashed", color = "black", size=1) +
  annotate(geom="text", x=30, y=140000, label="median = 11.8", color="black") +
  ggtitle("Histogram of NYC Taxi Fares(median = 11.8)") + 
  theme(plot.title = element_text(hjust = 0.5))

I am applying the first 99.99th quantile filter on total_amount(taxi fare) for the following data exploratory analysis as a way to elimiate noise(outliers), which will help us identify the general trends in taxi fares easier.

Taxi fares by different pick-up boroughs

Next I plotted the density of taxi fares by different pick-up boroughs(see below). This plot shows that there are differences in taxi fares between different pickup locations. Thought the median taxi fares of Newark is the highest among all boroughs, the distribution of EWR-picked-up taxi fares is the most widely spread out. Taxi fares of Staten-Island-pick-up follows the similar patterns as EWR.

The taxi fares of the other four boroughs (Queens, Manhattan, Brooklyn and Bronx) follow approximately normal distributions. Taxi trips startting from Queens have a median taxi fares about 2 times larger than from Manhattan, Brooklyn and Bronx.

I also computed the number of taxi trips taken at each boroughs and the percentage over total. It’s worthwhile to be aware that Manhattan by itself accounts for 92.3% of all the taxi trips in New York City, followed by Queens 6.3% and Brooklyn 1.23%. The rest of three boroughs has very small amount of taxi trips.

pu_median <- nyc_taxi %>% 
  filter(total_amount < total_amount_99) %>% 
  group_by(puBorough) %>% 
  summarize(median = median(total_amount), num_trips = n()) %>% 
  mutate(percentage = round(num_trips/sum(num_trips), digits=4))
pu_median
## # A tibble: 6 x 4
##   puBorough     median num_trips percentage
##   <fct>          <dbl>     <int>      <dbl>
## 1 Bronx           17.3      1343     0.0014
## 2 Brooklyn        15.8     10065     0.0106
## 3 EWR             85.3        15     0     
## 4 Manhattan       13.6    877632     0.922 
## 5 Queens          46.7     62446     0.0656
## 6 Staten Island   49.8        33     0
# Density Plot : Taxi fares by different pick-up boroughs
nyc_taxi %>%  
  filter(total_amount < total_amount_99) %>% 
  ggplot(aes(x =total_amount, y = puBorough, fill = puBorough)) +
  geom_density_ridges(scale = 1, quantile_lines = TRUE, quantiles = 2) +
  ggtitle("Density and Median Taxi Fares by Pick-up Boroughs") +
  geom_text(data = pu_median, aes(x= median, y= puBorough, label= median), 
            position=position_nudge(y=-0.1), colour="black", size=3.5) +
  theme(plot.title = element_text(size = 12, hjust = 0.5))
## Picking joint bandwidth of 6.43

Taxi fares by different pick-up zones

Then, I want to how taxi fares differe by different zones. Computer the average and median of taxi fares, count of taxi trips by pick-up boroughs and further by pick-up zones.

puZone_summary <- nyc_taxi %>%
  group_by(puBorough, puZone) %>% 
  summarise(avg_taxi_fare_puzone = mean(total_amount),median_taxi_fare_puzone = median(total_amount), num_trips = n()) %>% 
  arrange(desc(avg_taxi_fare_puzone)) 
head(puZone_summary,20)
## # A tibble: 20 x 5
## # Groups:   puBorough [6]
##    puBorough   puZone        avg_taxi_fare_pu… median_taxi_fare_… num_trips
##    <fct>       <fct>                     <dbl>              <dbl>     <int>
##  1 EWR         Newark Airpo…             104.                96.6        18
##  2 Staten Isl… Port Richmond              99.5               99.5         1
##  3 Staten Isl… Heartland Vi…              95.0               95.0         2
##  4 Staten Isl… Arden Heights              84.4               93.9         7
##  5 Brooklyn    Marine Park/…              75.7               73.7         3
##  6 Queens      Jamaica Bay                73.7               73.7         1
##  7 Staten Isl… Bloomfield/E…              66.4               86.8         9
##  8 Staten Isl… South Beach/…              65.8               65.8         1
##  9 Bronx       Pelham Bay                 64.8               79.4         5
## 10 Queens      Flushing Mea…              63.9               70.3       233
## 11 Staten Isl… Saint George…              62.9               49.8         3
## 12 Queens      Baisley Park               61.1               67.3       305
## 13 Queens      South Jamaica              60.5               66.4       139
## 14 Manhattan   Randalls Isl…              58.7               68.8        46
## 15 Queens      Far Rockaway               58.3               63.9        13
## 16 Queens      Springfield …              56.0               61.9        79
## 17 Queens      JFK Airport                55.3               61.4     29116
## 18 Queens      South Ozone …              52.5               61.4       112
## 19 Queens      Jamaica                    50.4               61.4       148
## 20 Queens      Astoria Park               50.2               59.4        12

The summary table shows that the majority of zones with the highest average taxi fares actually have very small number of trips. This indicates that those zones with high taxi fares are not necessarily the most profitable choice or the go-to places for taxi drivers when considering the volumn of demand. Since the dataset is about New York Yellow Cap Taxi Trips, pick-ups from Newark Airport is of very small amount, while drop-off at Newark Airport has high volumn(presented below). Given the distance to Newark Airport from NYC, a New York taxi driver is more likely to pick up a passenger within the NYC and drop off her at Newark Airport. Queens is a very promising choice as it accounts for 13 out of 20 top zones on this rank. The median taxi fare of Queens-pick-up is $45.36.

Then I filtered for zones with more than 100 trips to focus on highly taxi-wanted areas, and sorted the summary table by average taxi fare from highest to the lowest.

puZone_summary <- puZone_summary %>%
  filter(num_trips > 100) %>% 
  arrange(desc(avg_taxi_fare_puzone)) 

We can see that, among the top 20 zones by average taxi fares, half of them are located within Queens where passengers are very likely to take an expensive taxi trip, not to mention the taxi trips that people takes from JFK and LaGuardia airports are also within Queens. Manhattan and Brooklyn are two other boroughs associated with high average taxi fares.

head(puZone_summary,20)
## # A tibble: 20 x 5
## # Groups:   puBorough [3]
##    puBorough puZone          avg_taxi_fare_pu… median_taxi_fare_… num_trips
##    <fct>     <fct>                       <dbl>              <dbl>     <int>
##  1 Queens    Flushing Meado…              63.9               70.3       233
##  2 Queens    Baisley Park                 61.1               67.3       305
##  3 Queens    South Jamaica                60.5               66.4       139
##  4 Queens    JFK Airport                  55.3               61.4     29116
##  5 Queens    South Ozone Pa…              52.5               61.4       112
##  6 Queens    Jamaica                      50.4               61.4       148
##  7 Queens    Briarwood/Jama…              47.3               61.4       119
##  8 Queens    LaGuardia Airp…              41.8               43.1     24123
##  9 Brooklyn  East New York                32.9               25.0       124
## 10 Queens    Long Island Ci…              31.5               19.3       805
## 11 Queens    Rego Park                    31.0               22.4       102
## 12 Queens    Steinway                     27.2               18.3       478
## 13 Queens    Elmhurst                     25.5               11.8       268
## 14 Queens    East Elmhurst                24.3               21.6       136
## 15 Manhattan Financial Dist…              23.9               22.3      3600
## 16 Queens    Forest Hills                 23.8               18.3       143
## 17 Manhattan Financial Dist…              22.9               20.8      6873
## 18 Manhattan Battery Park                 22.6               22.3       469
## 19 Manhattan World Trade Ce…              22.5               19.8      5456
## 20 Queens    Sunnyside                    22.4               16.6      1301

Taxi fares by different drop-off boroughs

Plot density ridges for a comparison in taxi fare distributions between different drop-off boroughs. As taxi driver are not likely to predict where the next passengers are going to until he picks up the passenger, this information is less useful in helping taxi drivers to boost their earnings.

do_median <- nyc_taxi %>% 
  filter(total_amount < total_amount_99) %>% 
  group_by(doBorough) %>% 
  summarize(do_median = median(total_amount))
do_median
## # A tibble: 6 x 2
##   doBorough     do_median
##   <fct>             <dbl>
## 1 Bronx              34.1
## 2 Brooklyn           29.8
## 3 EWR                96.8
## 4 Manhattan          13.5
## 5 Queens             34.8
## 6 Staten Island      83.3
nyc_taxi %>%  
  filter(total_amount < total_amount_99) %>% 
  ggplot(aes(x =total_amount, y = doBorough, fill = doBorough)) +
  geom_density_ridges(scale = 1, quantile_lines = TRUE, quantiles = 2) +
  geom_text(data = do_median, aes(x= do_median, y= doBorough, label= do_median), 
            position=position_nudge(y=-0.1), colour="black", size=3.5) +
  ggtitle("Density of Taxi Fares by Different Drop-off Boroughs") +
  theme(plot.title = element_text(size = 12, hjust = 0.5))
## Picking joint bandwidth of 2.4

Distribution of Taxi fares between airport taxi trips and non-airport trips

Plot distributions of taxi fares between airport taxi trips and non-airport trips. Both the density and boxplot show that taxi rides travel to or from the airport cost higher than non-airport trips, which makes sense as the distance to airports are most likely longer than a within-city trip. With this information, I would suggest taxi drivers to take more airports trips by stopping for passengers with suitcases with them.

nyc_taxi %>%  
  filter(total_amount < total_amount_99) %>% 
  ggplot(aes(x = total_amount, color = airports_trip)) +
  geom_density() +
  ggtitle("Density of Taxi Fares by Airport-trips or not") +
  theme(plot.title = element_text(size = 12, hjust = 0.5))

nyc_taxi %>% 
  filter(total_amount < total_amount_99) %>% 
  ggplot() +
  geom_boxplot(aes(x=airports_trip, y=total_amount, fill = airports_trip)) +
  ggtitle("Boxplot of total taxi fare by airport-trip or not ")+
  theme(plot.title = element_text(size = 12, hjust = 0.5))

Taxi Fares by Hours of the Day

Plot the density of taxi fare by pick-up hour. It looks like there is very minimum difference in taxi fares between different hours of the day

nyc_taxi %>% 
  filter( total_amount < total_amount_99 ) %>% 
  ggplot(aes(x=total_amount, color = pk_hour), alpha = 0.5) +
  geom_density()+
  ggtitle("Density of Taxi Fare by Pick-up Time") +
  theme(plot.title = element_text(size = 12, hjust = 0.5)) 

Computed the median taxi fares by hour of the day, sorted it in descending order and listed the top 10. Noted two things from the table: * The best time to work for taxi drivers are probably these 10 hours that I grouped into three time slots:3am to 5am(early morning), 4pm to 7pm(after work rush hour) and 10pm to 1am (late evening). * By arranging these top 10 hours by number of trips, the morning shift (4am to 6am) has highest average value but the lowest in demand.

pkh <- nyc_taxi %>% 
  group_by(pk_hour) %>% 
  summarize(avg_pkh = round(mean(total_amount), 2), num_trips = n())

pkh_top10 <- pkh  %>% 
  arrange(desc(avg_pkh)) %>% 
  top_n(wt = avg_pkh, n=10)

pkh_top10 %>% 
  arrange(desc(num_trips))
## # A tibble: 10 x 3
##    pk_hour avg_pkh num_trips
##    <fct>     <dbl>     <int>
##  1 17         18.9     57051
##  2 15         18.4     53556
##  3 21         18.4     53081
##  4 14         18.6     52623
##  5 16         19.6     49905
##  6 22         18.6     49153
##  7 23         19.0     37898
##  8 0          18.8     26689
##  9 5          21.5      8477
## 10 4          20.3      7115

Given the observation above, I created a new variable called rush_hour to group 24 hours a day into 4 time-slot:Early Morning(4am-6am), Mid_Day(1pm-6pm), Midnight(10pm-1am),Regular Hour(any other time). This is one of techniques of feature engineering, Discretization, which is converting a continuous variable to a categorical variable by creating meaningful classes

nyc_taxi <- nyc_taxi %>% 
  mutate(rush_hour =as.factor(ifelse(pk_hour %in% c(22,23,0), "Midnight(9pm-2am)", 
                                     ifelse(pk_hour %in% c(4,5), "Early Morning(3am-5am)", 
                                            ifelse(pk_hour %in% c(13,14,15,16,17), "Mid_day(1pm-6pm)" , "Other Hours")))))

rush_hr_median <- nyc_taxi %>% 
  filter(total_amount < total_amount_99) %>% 
  group_by(rush_hour) %>% 
  summarize(median_rush_hr = round(median(total_amount), 2), avg_rush_hr = round(mean(total_amount), 2), num_trips = n())
rush_hr_median
## # A tibble: 4 x 4
##   rush_hour              median_rush_hr avg_rush_hr num_trips
##   <fct>                           <dbl>       <dbl>     <int>
## 1 Early Morning(3am-5am)           13.8        20.9     15589
## 2 Mid_day(1pm-6pm)                 14.1        18.7    263015
## 3 Midnight(9pm-2am)                14.8        18.8    113729
## 4 Other Hours                      13.8        17.6    559201

Taxi fares by 4 pickup timeframe during the day

Plot boxplots for taxi fares by four time frames.This graph shows that there is some advantage for taxi drivers to work during those three special time frames, early morning, mid_day and midnight than any other hours.

nyc_taxi %>%
  filter( total_amount < total_amount_99 ) %>%
  ggplot(aes(y=total_amount, x= rush_hour, color = rush_hour)) +
  geom_boxplot()+
  stat_summary(fun.y=mean, colour="darkblue", geom="point", 
               shape=18, size=3, show.legend =FALSE) + 
  geom_text(data = rush_hr_median, aes(y= avg_rush_hr  , x= rush_hour, label= avg_rush_hr  ),
            position=position_nudge(y = 3), colour="black", size=3.5) +
  ggtitle("Boxplot of Taxi Fares by Pick-up Time Frame") +
  theme(plot.title = element_text(size = 12, hjust = 0.5)) 

Taxi fares by days in a week

Let’s look at taxi fares by day of the week. According to this plot, we can see that Thursday, Friday and Monday are best days to work given they rank the highest both in average and mean taxi fares, while Saturday is worst day to work during a week.

pk_wday_median <- nyc_taxi %>% 
  filter(total_amount < total_amount_99) %>% 
  group_by(pk_wday) %>% 
  summarize(median_wday = round(median(total_amount), 2),avg_wday = round(mean(total_amount), 2), num_trips = n())
pk_wday_median
## # A tibble: 7 x 4
##   pk_wday median_wday avg_wday num_trips
##   <ord>         <dbl>    <dbl>     <int>
## 1 Sun            13.3     17.9    107372
## 2 Mon            13.8     18.4    128949
## 3 Tue            14.2     18.2    156613
## 4 Wed            14.2     18.3    152927
## 5 Thu            14.3     18.4    150020
## 6 Fri            14.2     18.3    131617
## 7 Sat            13.6     17.1    124036
nyc_taxi %>%  
  filter(total_amount < total_amount_99) %>% 
  ggplot(aes(y =total_amount, x = pk_wday, color = pk_wday)) +
  geom_boxplot() +
  geom_text(data = pk_wday_median, aes(y= avg_wday, x= pk_wday, label= avg_wday), 
            position=position_nudge(y= -3), colour="black", size=3.5) +
  stat_summary(fun.y=mean, colour="darkblue", geom="point", 
               shape=18, size=3, show.legend =FALSE) + 
  ggtitle("Boxplots & Mean of Taxi Fares by pick-up day of the week") +
  theme(plot.title = element_text(size = 12, hjust = 0.5)) 

Taxi fares by number of passengers

I also want to see if passenger count affects the taxi fares. It seems there is little difference in taxi fare between taxi rides with different number of passengers.

nyc_taxi %>% 
  filter(passenger_count %in% c(1:6)) %>% 
  filter(total_amount <= total_amount_99) %>% 
  group_by(passenger_count) %>% 
  summarise(mean= mean(total_amount)) %>% 
  ggplot(aes(y = mean, x=passenger_count, fill = passenger_count)) +
  geom_bar(stat = "identity") +
  ggtitle("Average Taxi Fares by Number of Passengers") +
  theme(plot.title = element_text(size = 12, hjust = 0.5)) 

Dataset preparation

Let’s get ready for some modeling!

Let’s check datatype

str(nyc_taxi)
## 'data.frame':    951630 obs. of  17 variables:
##  $ tpep_pickup_datetime : Factor w/ 943663 levels "2001-02-02 14:55:07",..: 897115 416959 661190 644670 524670 807291 194142 549687 13304 294695 ...
##  $ tpep_dropoff_datetime: Factor w/ 943193 levels "2001-02-02 15:07:27",..: 896812 417011 660883 644391 524449 807066 194056 549478 13253 294636 ...
##  $ passenger_count      : Factor w/ 8 levels "1","2","3","4",..: 3 2 1 1 1 1 2 1 1 1 ...
##  $ trip_distance        : num  2.59 17.9 2.92 0.78 4.3 2.07 1.13 0.77 1.78 1.41 ...
##  $ PULocationID         : int  249 132 161 100 50 161 263 170 148 237 ...
##  $ DOLocationID         : int  163 230 211 164 12 137 140 229 107 236 ...
##  $ payment_type         : int  1 2 2 1 1 1 2 2 2 1 ...
##  $ total_amount         : num  31.6 61.4 21.8 17.2 20.3 ...
##  $ puBorough            : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 5 4 4 4 4 4 4 4 4 ...
##  $ puZone               : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 247 128 157 97 48 157 261 166 144 235 ...
##  $ doBorough            : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ doZone               : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 159 228 209 160 10 133 136 227 102 234 ...
##  $ pk_hour              : Factor w/ 24 levels "0","1","2","3",..: 23 21 17 14 8 16 10 12 7 23 ...
##  $ pk_wday              : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 6 7 5 3 7 2 6 4 5 6 ...
##  $ pk_month             : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 10 4 7 7 7 10 1 7 1 4 ...
##  $ airports_trip        : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
##  $ rush_hour            : Factor w/ 4 levels "Early Morning(3am-5am)",..: 3 4 2 2 4 2 4 4 4 3 ...

Noted that pk_wday and pk_month are ordered factor variables. I converted them to unordered factor variables because there is no actual ranking between these classes.

nyc_taxi$pk_wday <- factor(nyc_taxi$pk_wday, ordered = FALSE )
nyc_taxi$pk_month <- factor(nyc_taxi$pk_month, ordered = FALSE)

standard scale the numerical variables trip_distance

nyc_taxi$trip_distance_scaled <- scale(nyc_taxi$trip_distance, center = TRUE, scale = TRUE)

Downsize the dataset to 100,000 rows of data for modeling

set.seed(123)
index_2 <- sample(1:nrow(nyc_taxi), 120000)
df_model <- nyc_taxi[index_2 , c("passenger_count", "trip_distance_scaled", "total_amount", "puBorough","doBorough", "pk_hour", "pk_wday", "airports_trip", "rush_hour") ]

# save the dataset
write.csv(df_model, file = "nyctaxi_df_for_model.csv")

Train test split - create Train and Test set

Split the dataset into training and test set

# Re-load data
df <- read.csv(file = "nyctaxi_df_for_model.csv")
# 80/20 split for training and validation
set.seed(12)
shuffle <- sample(nrow(df))
split <- round(nrow(df) * 0.8)
shuffled <- df[shuffle, ]
train <- shuffled[1 : split, ]
test <-  shuffled[(split + 1) : nrow(df_model), ] 

Further split the training set to create a validation set

# 80/20 split for training and validation
set.seed(12)
shuffle <- sample(nrow(train))
split <- round(nrow(train) * 0.8)
shuffled <- train[shuffle, ]
train_set <- train[1 : split, ]
val <-  train[(split + 1) : nrow(train), ] 
head(train)
##             X passenger_count trip_distance_scaled total_amount puBorough
## 119130 329956               3           -0.6240004         7.80 Manhattan
## 83191  174456               1            0.2552267        24.80 Manhattan
## 2222    19349               1           -0.3229317        11.80 Manhattan
## 42053  714597               1           -0.5041058        20.76 Manhattan
## 66082  374532               1           -0.2776382        12.95 Manhattan
## 118875 453929               1           -0.2216874        13.57 Manhattan
##        doBorough pk_hour pk_wday airports_trip              rush_hour
## 119130 Manhattan      10     Sat         FALSE            Other Hours
## 83191     Queens       7     Tue         FALSE            Other Hours
## 2222   Manhattan      20     Sun         FALSE            Other Hours
## 42053  Manhattan      21     Sat         FALSE            Other Hours
## 66082  Manhattan       5     Tue         FALSE Early Morning(3am-5am)
## 118875 Manhattan       6     Fri         FALSE            Other Hours

Model Training and Evaluation

Baseline - Linear regression with all variables

#train linear model 
lm_model <- lm(total_amount ~ passenger_count + trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
summary(lm_model)
## 
## Call:
## lm(formula = total_amount ~ passenger_count + trip_distance_scaled + 
##     puBorough + doBorough + pk_wday + airports_trip + rush_hour, 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.265  -2.327  -0.401   1.684 159.681 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 10.773211   0.464660  23.185  < 2e-16 ***
## passenger_count              0.007267   0.014003   0.519 0.603772    
## trip_distance_scaled        12.270246   0.027280 449.793  < 2e-16 ***
## puBoroughBrooklyn            0.519726   0.483835   1.074 0.282745    
## puBoroughManhattan           2.210184   0.453001   4.879 1.07e-06 ***
## puBoroughQueens              0.694887   0.466174   1.491 0.136066    
## puBoroughStaten Island     -14.344327   4.817263  -2.978 0.002905 ** 
## doBoroughBrooklyn            2.971156   0.229195  12.963  < 2e-16 ***
## doBoroughEWR                32.333726   0.470098  68.781  < 2e-16 ***
## doBoroughManhattan           3.574108   0.216343  16.521  < 2e-16 ***
## doBoroughQueens              2.578734   0.230938  11.166  < 2e-16 ***
## doBoroughStaten Island       8.649402   1.153049   7.501 6.39e-14 ***
## pk_wdayMon                  -0.410407   0.064384  -6.374 1.85e-10 ***
## pk_wdaySat                  -1.005385   0.065084 -15.447  < 2e-16 ***
## pk_wdaySun                  -1.429337   0.067351 -21.222  < 2e-16 ***
## pk_wdayThu                   0.224001   0.062076   3.609 0.000308 ***
## pk_wdayTue                  -0.155243   0.061407  -2.528 0.011470 *  
## pk_wdayWed                   0.072874   0.062021   1.175 0.240002    
## airports_tripTRUE            2.090534   0.139266  15.011  < 2e-16 ***
## rush_hourMid_day(1pm-6pm)    2.422515   0.136903  17.695  < 2e-16 ***
## rush_hourMidnight(9pm-2am)   1.271482   0.141682   8.974  < 2e-16 ***
## rush_hourOther Hours         1.722493   0.134877  12.771  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.66 on 76778 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8797 
## F-statistic: 2.675e+04 on 21 and 76778 DF,  p-value: < 2.2e-16

What’s the model summary telling us? * The Adjusted R-squared of the linear regression model is 0.8854, which means the model(independent variables) explains 88.54% of the variance in the data(target variable). * The p-value on the F-statistic is significantly less than 5%, indicating this linear model fits the dataset * Most of the coefficients of the linear are significant (using 5% threshold) except for passenger count, which has a p values higher than 5%.

Let’s check the goodness of fit of the linear model by predicting on validation set, and compute the RMSE, MAPE

# predict on the validation set
pred_val_lm <- predict(lm_model, val)

# compute RMSE and MAPE of the prediction on validation set
MAPE_val_lm <- MAPE(pred_val_lm , val$total_amount)
RMSE_val_lm <- sqrt(MSE(pred_val_lm , val$total_amount))
MAPE_val_lm
## [1] 0.185982
RMSE_val_lm
## [1] 4.725017

Based on the result of first model, linear regression with all independent variables, not all the variables are significant. I am going apply step model on the linear model to see if any features we can drop.

Run a step model to see if any of the variables could be dropped. Again, passenger count is the least important predicting variable here as dropping the passenger count variable leads to a very small decrease in AIC compared to the full model.

step.model <- step(lm_model, direction = "both")
## Start:  AIC=236422.5
## total_amount ~ passenger_count + trip_distance_scaled + puBorough + 
##     doBorough + pk_wday + airports_trip + rush_hour
## 
##                        Df Sum of Sq     RSS    AIC
## - passenger_count       1         6 1667454 236421
## <none>                              1667448 236423
## - airports_trip         1      4894 1672342 236646
## - puBorough             4      6365 1673814 236707
## - rush_hour             3     14747 1682196 237093
## - pk_wday               6     22417 1689866 237436
## - doBorough             5    116744 1784193 241610
## - trip_distance_scaled  1   4393812 6061260 335540
## 
## Step:  AIC=236420.8
## total_amount ~ trip_distance_scaled + puBorough + doBorough + 
##     pk_wday + airports_trip + rush_hour
## 
##                        Df Sum of Sq     RSS    AIC
## <none>                              1667454 236421
## + passenger_count       1         6 1667448 236423
## - airports_trip         1      4896 1672351 236644
## - puBorough             4      6369 1673823 236706
## - rush_hour             3     14748 1682202 237091
## - pk_wday               6     22415 1689869 237434
## - doBorough             5    116755 1784209 241608
## - trip_distance_scaled  1   4394075 6061530 335542
summary(step.model)
## 
## Call:
## lm(formula = total_amount ~ trip_distance_scaled + puBorough + 
##     doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.263  -2.328  -0.401   1.685 159.677 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 10.78403    0.46419  23.232  < 2e-16 ***
## trip_distance_scaled        12.27034    0.02728 449.809  < 2e-16 ***
## puBoroughBrooklyn            0.51960    0.48383   1.074 0.282857    
## puBoroughManhattan           2.21037    0.45300   4.879 1.07e-06 ***
## puBoroughQueens              0.69461    0.46617   1.490 0.136219    
## puBoroughStaten Island     -14.35272    4.81721  -2.979 0.002888 ** 
## doBoroughBrooklyn            2.97107    0.22919  12.963  < 2e-16 ***
## doBoroughEWR                32.33440    0.47009  68.783  < 2e-16 ***
## doBoroughManhattan           3.57405    0.21634  16.520  < 2e-16 ***
## doBoroughQueens              2.57843    0.23094  11.165  < 2e-16 ***
## doBoroughStaten Island       8.65367    1.15301   7.505 6.20e-14 ***
## pk_wdayMon                  -0.41046    0.06438  -6.375 1.84e-10 ***
## pk_wdaySat                  -1.00486    0.06508 -15.441  < 2e-16 ***
## pk_wdaySun                  -1.42895    0.06735 -21.218  < 2e-16 ***
## pk_wdayThu                   0.22381    0.06207   3.605 0.000312 ***
## pk_wdayTue                  -0.15555    0.06140  -2.533 0.011301 *  
## pk_wdayWed                   0.07267    0.06202   1.172 0.241328    
## airports_tripTRUE            2.09104    0.13926  15.015  < 2e-16 ***
## rush_hourMid_day(1pm-6pm)    2.42315    0.13690  17.700  < 2e-16 ***
## rush_hourMidnight(9pm-2am)   1.27243    0.14167   8.982  < 2e-16 ***
## rush_hourOther Hours         1.72305    0.13487  12.775  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.66 on 76779 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8797 
## F-statistic: 2.809e+04 on 20 and 76779 DF,  p-value: < 2.2e-16

Second model: Linear Regression without passenger count

Train linear regression without the passenger count variable. Noted this modification does improve the model a little bit as it yields a slightly smaller RMSE.

lm_model_v2 <- lm(total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
summary(lm_model_v2)
## 
## Call:
## lm(formula = total_amount ~ trip_distance_scaled + puBorough + 
##     doBorough + pk_wday + airports_trip + rush_hour, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.263  -2.328  -0.401   1.685 159.677 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 10.78403    0.46419  23.232  < 2e-16 ***
## trip_distance_scaled        12.27034    0.02728 449.809  < 2e-16 ***
## puBoroughBrooklyn            0.51960    0.48383   1.074 0.282857    
## puBoroughManhattan           2.21037    0.45300   4.879 1.07e-06 ***
## puBoroughQueens              0.69461    0.46617   1.490 0.136219    
## puBoroughStaten Island     -14.35272    4.81721  -2.979 0.002888 ** 
## doBoroughBrooklyn            2.97107    0.22919  12.963  < 2e-16 ***
## doBoroughEWR                32.33440    0.47009  68.783  < 2e-16 ***
## doBoroughManhattan           3.57405    0.21634  16.520  < 2e-16 ***
## doBoroughQueens              2.57843    0.23094  11.165  < 2e-16 ***
## doBoroughStaten Island       8.65367    1.15301   7.505 6.20e-14 ***
## pk_wdayMon                  -0.41046    0.06438  -6.375 1.84e-10 ***
## pk_wdaySat                  -1.00486    0.06508 -15.441  < 2e-16 ***
## pk_wdaySun                  -1.42895    0.06735 -21.218  < 2e-16 ***
## pk_wdayThu                   0.22381    0.06207   3.605 0.000312 ***
## pk_wdayTue                  -0.15555    0.06140  -2.533 0.011301 *  
## pk_wdayWed                   0.07267    0.06202   1.172 0.241328    
## airports_tripTRUE            2.09104    0.13926  15.015  < 2e-16 ***
## rush_hourMid_day(1pm-6pm)    2.42315    0.13690  17.700  < 2e-16 ***
## rush_hourMidnight(9pm-2am)   1.27243    0.14167   8.982  < 2e-16 ***
## rush_hourOther Hours         1.72305    0.13487  12.775  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.66 on 76779 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8797 
## F-statistic: 2.809e+04 on 20 and 76779 DF,  p-value: < 2.2e-16

Check goodness of fit of second model

# predict on the validation set
pred_val_lm_2 <- predict(lm_model_v2, val)

# compute RMSE and MAPE of the prediction on validation set
MAPE_val_lm_2 <- MAPE(pred_val_lm_2 , val$total_amount)
RMSE_val_lm_2 <- sqrt(MSE(pred_val_lm_2 , val$total_amount))
MAPE_val_lm_2
## [1] 0.1859735
RMSE_val_lm_2
## [1] 4.724935

Third model - Decision tree using rpart

# decision tree using rpart
tree <- rpart(total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour,
              data = train_set, method = "anova", control=rpart.control(minsplit=100) )

The decision tree model used two variables trip distance and drop-off locations to slipt the data, while the trip distance and airport-trip are ranks top 2 in terms of variable importance

fancyRpartPlot(tree)

Check feature importance

printcp(tree)
## 
## Regression tree:
## rpart(formula = total_amount ~ trip_distance_scaled + puBorough + 
##     doBorough + pk_wday + airports_trip + rush_hour, data = train_set, 
##     method = "anova", control = rpart.control(minsplit = 100))
## 
## Variables actually used in tree construction:
## [1] doBorough            trip_distance_scaled
## 
## Root node error: 13867605/76800 = 180.57
## 
## n= 76800 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.649825      0   1.00000 1.00004 0.0123804
## 2 0.098677      1   0.35018 0.35069 0.0051532
## 3 0.068593      2   0.25150 0.25239 0.0048994
## 4 0.016607      3   0.18291 0.18417 0.0041534
## 5 0.015781      4   0.16630 0.16582 0.0041885
## 6 0.010709      5   0.15052 0.15179 0.0041152
## 7 0.010000      6   0.13981 0.14361 0.0038744
tree$variable.importance
## trip_distance_scaled        airports_trip            puBorough 
##             11785705              5528916              3218144 
##            doBorough 
##              1055740

While the decision tree model is easier to interprect, it’s prediction on taxi fare of the test dataset is not as good as the linear regression model, as the decision tree model has a higher RMSE.

# predict on the validation set
pred_val_dt <- predict(tree, val)

# compute RMSE and MAPE of the prediction on validation set
MAPE_val_dt <- MAPE(pred_val_dt , val$total_amount)
RMSE_val_dt <- sqrt(MSE(pred_val_dt , val$total_amount))
MAPE_val_dt
## [1] 0.2029875
RMSE_val_dt
## [1] 5.11615

The 4th model: random forest

The random forest model shows that trip distance and drop-off location are most important variables

set.seed(123)
rf_model <- randomForest(total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour,
                           data = train_set, importance = TRUE)
saveRDS(rf_model, "nyctaxi_model_rf.rds")

#rf_model <- readRDS("~/Folders/R/NYC_Taxi_Project/nyctaxi_model_rf.rds")

Model summary

print(rf_model)
## 
## Call:
##  randomForest(formula = total_amount ~ trip_distance_scaled +      puBorough + doBorough + pk_wday + airports_trip + rush_hour,      data = train_set, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 18.96557
##                     % Var explained: 89.5

Check feature importance

varImp(rf_model)
##                       Overall
## trip_distance_scaled 67.57463
## puBorough            27.87959
## doBorough            26.74395
## pk_wday              19.24118
## airports_trip        19.45401
## rush_hour            29.56983
varImpPlot(rf_model,type=1)

Take a look at how well the random forest model perform predictions on the test dataset. The RMSE of random forest model is lower than that of the decision tree model.

# predict on the validation set
pred_val_rf <- predict(rf_model, val)

# compute RMSE and MAPE of the prediction on validation set
MAPE_val_rf <- MAPE(pred_val_rf , val$total_amount)
RMSE_val_rf <- sqrt(MSE(pred_val_rf , val$total_amount))
MAPE_val_rf
## [1] 0.1876614
RMSE_val_rf
## [1] 4.421974

The 5th model : random forest with cross-validation

rf_cv <- train(
  total_amount ~ trip_distance_scaled + puBorough + doBorough + pk_wday + airports_trip + rush_hour,
  train_set, method = "ranger", 
  trControl = trainControl( method = "cv", number = 5, verboseIter = FALSE))
## Growing trees.. Progress: 84%. Estimated remaining time: 5 seconds.
## Growing trees.. Progress: 86%. Estimated remaining time: 5 seconds.
## Growing trees.. Progress: 82%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 83%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 9 seconds.
## Growing trees.. Progress: 84%. Estimated remaining time: 5 seconds.
## Growing trees.. Progress: 83%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 75%. Estimated remaining time: 10 seconds.
## Growing trees.. Progress: 81%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 9 seconds.
saveRDS(rf_cv, "rf_cv_model.rds")
#cv_model <- readRDS("~/Folders/R/NYC_Taxi_Project/model_cv.rds")

Check model summary

print(rf_cv)
## Random Forest 
## 
## 76800 samples
##     6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 61441, 61440, 61441, 61440, 61438 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared   MAE     
##    2    variance    6.843012  0.7849198  4.773034
##    2    extratrees  7.610567  0.7255375  5.386164
##   11    variance    4.278346  0.8986082  2.668861
##   11    extratrees  4.271784  0.8992138  2.676828
##   21    variance    4.466950  0.8897395  2.801595
##   21    extratrees  4.411736  0.8923715  2.774423
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 11, splitrule =
##  extratrees and min.node.size = 5.
print(rf_cv$finalModel)
## Ranger result
## 
## Call:
##  ranger::ranger(dependent.variable.name = ".outcome", data = x,      mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size,      splitrule = as.character(param$splitrule), write.forest = TRUE,      probability = classProbs, ...) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      76800 
## Number of independent variables:  21 
## Mtry:                             11 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        extratrees 
## OOB prediction error (MSE):       18.14932 
## R squared (OOB):                  0.8994888

Use the cross-validated random forest model to predict validation set and estimate the accuracy of the predictions. The random forest model with cross-validation resulting in the lowest RMSE score, so therefore it’s the best model of 5 all models.

# predict on the validation set
pred_val_rfcv <- predict(rf_cv, val)

# compute RMSE and MAPE of the prediction on validation set
MAPE_val_rfcv <- MAPE(pred_val_rfcv , val$total_amount)
RMSE_val_rfcv <- sqrt(MSE(pred_val_rfcv , val$total_amount))
MAPE_val_rfcv
## [1] 0.1760098
RMSE_val_rfcv
## [1] 4.338317

A summary of all validation errors by all five models, and noted random forest model with cross validation has the lowest RSME and MAPE, and therefore it’s the best model so far.

modelname <- c("linear model","linear modelv2", "decision tree","random forest","random forest with cv")
modelRMSE <- c(RMSE_val_lm,  RMSE_val_lm_2, RMSE_val_dt , RMSE_val_rf , RMSE_val_rfcv)
modelMAPE <- c(MAPE_val_lm,  MAPE_val_lm_2, MAPE_val_dt , MAPE_val_rf , MAPE_val_rfcv)
val_error <- as.data.frame( cbind(modelname,modelRMSE , modelMAPE)  )
colnames(val_error) <- c("Models", "RMSE","MAPE")
val_error
##                  Models             RMSE              MAPE
## 1          linear model 4.72501668693151 0.185981961819124
## 2        linear modelv2 4.72493547559144 0.185973510098375
## 3         decision tree 5.11614991470542 0.202987507884024
## 4         random forest 4.42197421988426 0.187661375033456
## 5 random forest with cv 4.33831742298959 0.176009821749615

Prediction on test set

Apply the best model(5th model, random forest with cv) to predict on test set

# predict on the test set
pred_test_rfcv <- predict(rf_cv, test)
# compute RMSE and MAPE of the prediction on validation set
MAPE_test_rfcv <- MAPE(pred_test_rfcv , test$total_amount)
RMSE_test_rfcv <- sqrt(MSE(pred_test_rfcv , test$total_amount))
MAPE_test_rfcv
## [1] 0.1744527
RMSE_test_rfcv
## [1] 4.152667

Conclusion

Throughout data exploration, visualization and modeling, I found consistently trip distance is the most important factor for taxi fares, which makes sense as taxi fare is calculated using miles. Still, there are something that taxi drivers can do differently to increase earnings: taxi drivers can pay attention to factors like pick-up and drop-off locations, whether it’s to/from the airports, time blocks during the day, and the day in the week.