Household travel mode choice estimation with large-scale data—an empirical analysis based on mobility data in Milan

Abstract

Data analysis plays a key role in supporting the development of sustainable transportation. Using the large-scale household mobility survey data collected in Milan, Italy during 2005–2006, we study whether the large-scale data contribute to improving accuracy in estimating household travel modes. This paper presents three machine learning methods including multinomial logit (MNL) model, random forest (RF) and support vector machine (SVM) to estimate the household travel mode. Their model accuracies are 70.41%, 71.89%, 72.74% respectively under the full sample size. It is found that the accuracies of these three methods fluctuate fiercely when the sample size is less than 20,000 and then stabilize gradually with continuous increasing it. After stabilization occurs, accuracies with these three methods do not significantly increase as the sample size continues to increase. We also study the travel characteristics derived from the large-scale survey data, which is fundamental for developing a sustainable transportation system. The collected data items include five explanatory variables, i.e., household size (HS), vehicle ownership, household income (HI), travel distance, travel time and one response variable (i.e., household travel mode), which includes public transport (PT), private car, usage of PT and private car simultaneously and the others travel modes (e.g., walk). We further investigate the importance of explanatory variables in terms of estimating household travel mode choice with the MNL model. It is found that vehicle ownership is the most critical factor influencing household travel mode choice, followed by travel distance, travel time, HS and HI. The ranking result is consistent with the RF approach.