This paper presents machine learning-based approaches to classification of historical traffic crashes in Kansas by severity, applied to a data set consisting of highway geometry, weather, and road sensor data. The goal of this work is to identify relevant features using a variety of loss measures and algorithms for feature selection. This is shown to facilitate the discovery of the most relevant sensors for the task of learning to predict severe crashes (those involving bodily injury). The key technical challenges are to cope with class imbalance (as a 75% majority of crashes are non-severe) and a highly correlated and redundant set of features from multiple coalesced sources. The major novel contributions of this work are the development of a random oversampling strategy for data augmentation, combined with the systematic application of multiple feature selection measures over a range of supervised inductive learning models and algorithms. Positive results from this approach, on a data set of 277 initial ground features and 20,000 vehicle crashes collected over 9 years (2007 – 2015) by the Kansas Department of Transportation (KDOT), included models trained using 30 features (out of 277) that achieve cross-validation precision and recall comparable to those obtained using the full set of features. These and other results point towards potential use of feature selection findings and the resultant models in planning future road construction.
The different versions of the original document can be found in: