Data Analysis in Action: LA Traffic

Data Analysis in Action: LA Traffic

Traffic is an issue familiar to pretty much everyone, Los Angeles especially. Hi my name is Jai, and as a Los Angeles resident for 7 years I’ve sat in more than my fair share of gridlock.

That dead stopped line of cars seemingly regardless of time of day or day of week. As a result, when I stumbled upon a traffic collision data set maintained by the city of Los Angeles I put my data analytics training into action.

This data is cool for several reasons. Certainly it doesn’t directly measure traffic, but it measures a closely-related proxy. Therefore it’s not a stretch to hypothesize that more traffic correlates with more collisions which directly cause more traffic. Subsequently, data sets like this one can be used to create safer and more efficient communities for everyone, including LA traffic. 

What a Client Might See

Just below I’ve included a slide presentation. When we do a discovery round for our clients we typically present our findings at a meeting alongside a deck like the one below. Additionally we give our clients a large booklet that contains high level findings and the nitty gritty of how we came to these findings so they can replicate our research. 

While this was not a commissioned study, I wanted to include the slide presentation so you would get a feel for our style. Moreover this slide deck will  provide an idea of the quality of work you should expect from your companies internal data analysts or any external consultants you might hire. 

However, if you’re just here because you want to know more about LA traffic or what the work of a data analysts looks, like please feel free to scroll by it for the nitty gritty details.

Defining the Scope

Using the data set Traffic Collision Data from 2010 to Present I settled on 3 overarching questions I would attempt to answer:

  1. Patterns by Time: How do traffic collision patterns vary by time of day, day of week, and time of year?
  2. Patterns by Geography: How are traffic collisions distributed geographically? Is it possible to identify high-risk intersections or areas?
  3. Prediction: Is it possible to predict the number of collisions in a given time frame?

Important to note, the data set used, Traffic Collision Data from 2010 to Present, is actively maintained by the city of Los Angeles and is available to be used by anyone.

Before diving into the questions, let’s learn a bit more about this data set.

The Data: Warts and All

Firstly, the data begins in January 2010 and is updated weekly. At the time of my endeavor, I had data from January 2010 – July 2019, which ended up being ~500K rows of data. Each row corresponds to a collision. This data is transcribed from original paper traffic reports, so it’s very likely there are some errors. Below is a sample of some of the key fields:

Columns and rows of info from the City of Los Angeles tracking information about collision victims

The availability of these fields is what guided the key questions above.

Secondly, before starting analysis there was some cleaning to be done. For example, there are a few fields with only one value, reflecting the fact that all of the rows in this data corresponds to traffic collisions. Additionally, there are multiple fields with the approximate street names of collisions (not shown above).

So, these text fields needed cleaning; specifically removing extra spaces. Similarly, in the image above we can see the latitude/longitude coordinates contained in a string. I extracted these coordinates in separate fields for later use.

Subsequently, I checked for null or missing values. ~16% of collisions (~78K) don’t have an associated victim age. There are also a small subset (~400) collisions that do not have valid latitude/longitude coordinates and are excluded from the mapping section below.

An Analysts First Steps: Exploration

After cleaning, but before diving into the key questions above, I wanted to do some general data exploration. As a result I started by plotting out a few interesting looking variables.

A bar graph depicting the age of traffic accident victims. Notable spikes in through the downward trend chart

What are some things you can pick out from this graph?

  • There are not many collision victims below age 15.
  • Most collision victims are in their 20s. The number of collision-victims per age generally decreases after age 30.
  • Note the spikes at most multiples of 5 (25, 30, 35, etc). This suggests some ages are estimated and that identification isn’t always used in collision reports.
  • 99 seems to be a catch-all age, perhaps because it’s the maximum age recorded.

 

From only one graph we should ask ourselves some questions:

    • How are collisions with multiple victims dealt with?
    • Is there something else going on with the age 99 bucket?

 

I emailed data owner 2019-08-30, but am waiting to hear back. So I’ll speak on what’s to be done with this outlier data further down.

Exploration: Play with the Data

Usually I find this sort of analysis useful. Even though I don’t have a specific goal in mind, I often find interesting trends or insights. After that let’s look at collisions by gender.

Bar graph showing the gender of those in traffic accidents. Males is notably higher.

What are some things you can pick out from this graph?

  • “X” represents unknown gender.
  • Given that a collision occurred, the victim is much more likely to be male than female.

Above all this plot would be more interesting if I had the total number of drivers by gender, allowing for a collisions per capita measure. Subsequently this will be a recurring theme with this data and would be one of my main extensions of this analysis.

Exploration: Playing in the Raw Data

The next thing I wanted to do was look at the lowest and highest collision days in the data. I limited to the top and bottom 0.5% so I could review the results manually. 

Here are the lowest collision days:

Columns and rows depicting the day of the year, incident count, day of the week, and closest holiday if applicable

What are some things you can pick out from just this data?

  • First, most low-collision days occur before early 2014. We’ll see later that monthly collisions start rising after 2014.
  • Second, most low-collision days are around holidays.

And here are the highest collision days:

Two columns and many rows showing date, incidents, and day of the week in descending order

Things of note:

  • Firstly, most high-collision days are Fridays occurring after 2015.
  • We’ll see later that Fridays typically have the highest number of collisions of any day of the week.

From these two data sets we should ask ourselves some questions:

    • Why are only some holidays associated with low-collision days? For example, MLK Day often has a low number of collisions but Independence Day never does.
    • 4th of July, Memorial Day, or Labor Day show up as low or high collision days, why?
    • Daylight Savings Time does not show up as any kind of outlier in any year, why?
    • Would it be interesting to look at the weather for high-collision days?

Exploration: Time

Now I’m ready to look into the main questions I identified above. In this section, I analyze how collision patterns vary by time of day, day of the week, and time of year.

First, I look at a plot of daily collisions for 1 year. In addition to collision date, the data has a field for the reporting date. This is the date the collision was actually reported to police. In most cases, the reporting date is the same day or 1 day after the collision date, but not always.

What are some things you can pick out from just these graphs?

  • There’s substantial variation in collisions and collisions reported per day.
  • The outliers in the data don’t have an obvious pattern.
  • At the daily level, collisions and collisions reported have noticeable differences. Look at the mid-April spike in collisions reported…there’s nothing similar in actual collisions!
  • There may be administrative dynamics at play with when collisions are reported or processed.

Exploration: Broaden Your Time View

Let’s look at a similar plot aggregated by month. At this level, I can include the entire time frame from 2010-2019.

What are some things you can pick out from just these graphs?

  • Monthly collisions were roughly constant from 2010-2014, rose from 2014 to 2017, and have been roughly constant since.
  • The overall trend is the same for collisions vs. collisions reported.

From this data set we should ask ourselves a question:

    • Why the rise? Is it the population growth in LA? Ridesharing service growth? 

Exploration: Time From Broad to Narrow

Now I look at the distribution of collisions throughout the day.

Bar graph showing collision incidents by time of day. Graph is an imperfect S curve.

Notice anything from just this graph?

  • Collisions are:
    • sharply increasing from 4/5am to 730/8am
    • decreasing from 730/8am to 830/9am
    • generally increasing from 830/9am to 6pm
    • sharply decreasing from 8pm to 4/5am
    • at their daily minimum at 4/5am
    • at their daily maximum at 5/6pm
  • These results likely mirror the number of vehicles on the road.
  • As mentioned previously, it would be more interesting to have a measure of total vehicles on the road per time to get a measure of collisions per capita.
  • No hourly timestamps available for when collisions are reported.

Next I look at collisions by day of the week.

Bar graph showing collision incidents by the days of the week. Very little change between days.

How would you analyze these graphs?

  • Collisions are:
    • increasing from Sun to Fri, with a sharp increase from Thu to Fri
    • Weekly minimum is on Sundays
    • Maximum on Fridays
  • The end of the week is my current hypothesis for the high number of Friday collisions. 
  • For collisions reported:
    • Sun / Fri is still the lowest / highest
    • Weekdays are pretty even

Finally, I look at collisions by month.

Bar graph showing collision incidents by the month of the year. Very slight change between months, but could be significant.

Based on the graphs we see that:

  • Collisions are:
    • generally constant from April to August
    • generally lower from September to February
    • highest in March
    • Collisions are lower in the colder months.
  • Collisions and collisions reported are very similar at the monthly level.
Further Questions

From this data set we should ask ourselves:

    • Since none of the high-collision outliers occurred on the day of Daylights Savings, what is causing the overall shift?
    • Could spring break tourists cause the spike in April? Or because there are less tourists in colder months?

Exploration of Time: Initial Hypothesis

Given all of these plots, here are my takeaways for the temporal pattern of collisions:

  • Collisions and collisions reported vary substantially at the daily level, but not at the monthly level.
  • Monthly collisions were roughly constant from 2010-2014, rose from 2014 to 2017, and have been roughly constant since.
  • Number of collisions is lowest at 4/5 AM and highest at 5/6 PM.
  • Lowest number of collisions happen on Sundays and highest on Fridays.
  • Collisions are highest in March and lowest from September to December.

Exploration: Collisions by Geography

The next topic I want to cover is analyzing collisions geographically. Before getting into mapping, I start by looking at the distribution of collisions by “area”, a field provided in the data set.

Bar graph showing incidents by region in descending order.
  • Some areas obviously have more collisions than others.
  • Therefore this plot would be more informative with a measure of area size or traffic density.

Exploration: Narrow Geography

Notice the data also includes fields called “location” and “cross_street”. “location” is the main street a collision occurred on, while “cross_street” is the nearest cross street. I’ll look at the 10 most common values for these fields and their combination.

  • The 10 most common “location” are some of the longest and most used roads in LA.
  • These top 10 streets account for >10% of total collisions. There are >25K total “location”, so there’s a very long tail.
  • 5% of collisions have no associated “cross_street”.
  • Otherwise, this list has a lot of overlap with the previous list.
  • The most common “location” / “cross_street” combinations contains many of the streets we saw in the previous 2 lists.
  • However, there are exceptions: the components of row 2 (Tampa Ave. and Nordhoff St.) don’t appear in either the most “location” or “cross_street”.
  • Even the most collision-prone intersections account for a small proportion of overall collisions.

Exploration: Broad Geography

Now I’ll use mapping because it’s the easiest way to see bigger patterns in relation to geography.

Heat map of traffic accidents overall in 2018
  • This plot shows the interesting shape of Los Angeles.
  • Blue / Red points indicate coordinates with a low / high number of collisions.
  • Coordinates with a low number of collisions (blue points) are rendered with low intensity and look faded.
  • Even on this zoomed-out map, I can see high collision coordinates in the Valley and East LA.

Broad Geography: East LA

To get a better view, I zoom in on one area.

Heat map of East LA traffic accidents overall in 2018
  • This plot shows much of central and downtown LA.
  • Many high and medium collision coordinates are clearly visible.
  • Some main streets seem to have high or medium collision intersections occurring quite often.

Next, I layer the time of day on the map.

EAST LA: 2018 Collisions by Daypart (Coordinates with 5+ collisions)
Early Morning and Late Morning collision heat mat
Afternoon and Evening traffic collision heat map
Late Night traffic collision heat map
  • This plot only includes coordinates with 5+ collisions. Each coordinate is assigned to the daypart where most of its collisions occur in. Therefore, coordinates with ties are thrown out.
  • There are no coordinates with a majority of collisions occurring in the early morning. This makes sense given the Collisions by Time analysis above.
  • The afternoon and evening are when many collisions happen. We also saw this in the Collisions by Time analysis
  • Interestingly, we can see a cluster of coordinates where late night collisions are common.

Broad Geography: Weekday Vs. Weekend

Let’s look at a similar map broken out by weekday/weekend.

Heat map of collisions in East LA comparing weekdays to weekends

What are some things you can pick out from this side by side comparison?

  • There are obviously more days and collisions in the “Weekday” bucket.
  • This map can be used to identify areas with many weekend collisions.
  • It might also be interesting to include part or all of Friday in the “Weekend” bucket.
EAST LA: 2018 Collisions by Season (Coordinates with 3+ collisions)
Heat map of traffic collisions in East LA between Dec – Feb
Heat map of traffic collisions in East LA between Mar – May and June – Aug
Heat map of traffic collisions in East LA between Sep – Nov

What are some things you can pick out from this series of maps?

  • It looks like more collisions occur in the summer months (Mar-Aug). This would line up with the results in the Collisions by Time section.
  • Given that Los Angeles doesn’t have distinct seasons like fall or winter, there may be other ways to split up the year for a plot like this.

Exploration of Geography: Initial Hypothesis

Here are my conclusions for the Collisions by Geography section:

  • Through the location field, cross_street field, and mapping it’s possible to identify the most accident-prone coordinates in Los Angeles.
  • The most common daypart for collisions is the afternoon or evening. Mapping collisions shows areas where particular dayparts are most common.
  • Many more collisions occur during the weekend than the weekend.
  • More collisions happen in the summer than winter months.

Application of Hypothesis: Collision Prediction

The final section will deal with trying to use our initial insights above to predict collisions. I specifically try to predict the number of collisions that will occur per month and area.

Therefore, let’s start by looking at an example of the collision time series for one area (listed below as Area 2):

Line graph of collisions in East LA over the years
  • The trend for area 2 generally matches the trend of overall monthly collisions

Application of Hypothesis: Decompose Time Series

Let’s decompose this time series into trend, seasonality, and remainder components.

  • Keep the shape of the seasonality curve in mind. I’ll compare it against overall monthly collisions next.
  • The trend looks generally as expected.
  • Seasonality curve for the overall data looks very different! This is my first indication that different areas can have different dynamics.

Next, I look at more exploratory plots for area 2.

  • The ACF analyzes how correlated lagged values of area 2 collisions are to the current value.
  • PACF analyzes how much previously unexplained variance each lag explains.
  • These plots indicate that any model predicting area 2 collisions should include at least 2 lagged terms.

 

A Brief Interlude on Modeling Types

I try a few different model types in the examples above:

  • 3 and 6 month moving average models (MA)
  • ARIMA
  • Prophet

MA models average past values to generate predictions and are the simplest time series model. 

ARIMA models can use past values, differencing, and previous errors. Prophet is an additive forecasting model where nonlinear trends are fit with yearly and monthly seasonality. 

The MAPE gives me an idea of how far my predictions are from actual collision values while the bias lets me know if I am systematically over- or under-predicting the data.

A Comparison of the Models

I evaluate each model on the final 12 months of data (August 2018 to July 2019) with the following metrics:

  • MAPE: average absolute percentage difference between prediction and actual
  • Bias: average percentage difference between prediction and actual

Here are the overall model results:

  • This table reflects:
    • Auto-fit ARIMA models for each area. So each area can have a different p, d, and q.
    • The best result of multiple Prophet model specifications.
  • 3 and 6 month MA models have very similar results.
  • ARIMA model has a similar MAPE to the MA models, but worse bias.
  • Prophet model has the worse results by far (even after tuning).
  • It’s surprising that the MA models have the best performance!

Applying the Models to our Data

Let’s look at the model predictions for area 2 only.

Line graph comparing different predictive models
  • Prophet model has much higher predictions than other models for the first 5 months.
  • All models miss the drop in Jan 2019 and the up/down pattern of April 2019-July 2019.
  • 6 month MA model predictions don’t vary that much.
  • MA and ARIMA models seem to make conservative predictions that don’t capture the fluctuating nature of the data.

Next, I’ll look at the average MAPE per month per model.

Line graph comparing different predictive models
  • For the first half of the validation data, the MA and ARIMA models move together.

It’s also worth looking at the average bias per month per model.

  • Interestingly the bias for the MA and ARIMA models moves together.
  • Notes that the prophet model has positive bias (overpredicts) for the entire validation set.

Now, I’ll look at the average MAPE and bias per model per area.

Bar graph comparing models; MA, ARIMA, and Prophet
  • Of note, the average MAPE performance varies substantially by area. For example, model performance on area 12 [3] is pretty good [bad].
Bar graph comparing models; MA, ARIMA, and Prophet
  • Average bias performance also varies substantially by area.
  • Surprisingly, most models have positive bias (over predict) most areas!

Modeling Using the Extremes

Next, I’ll look at the worst area/month predictions per model.

What should you notice?

  • Area 14 shows up twice
  • Both Jan and Feb of 2019 show up
  • It’s interesting to see cases where all models struggled (Jan 2019 in area 2) vs. cases where one model in particular struggled (Sept 2018 in area 14).

So based on the worst predictions, let’s zoom into area 14.

Comparing the models using data from Area 14
  • The trend for area 14 is completely different than for area 2 above.
  • Importantly all models except Prophet miss the spike in July 2019.

What have the Models Proved?

Based on the graphed out models, here are my conclusions for the Collision Prediction section:

  • Overall monthly performance is not bad (<10% MAPE and bias in most cases).
  • However, MA models having the best performance indicates that longer-term lagged data, differencing, and previous errors doesn’t help performance.
  • Certainly, this is a surprising result, but suggests that the number of collisions per month/area is largely random within a certain range.
  • Trends seem to vary by area. Consequently this should in theory be addressed by the ARIMA and Prophet methods which fit a separate model per area.

From Analysis to Science: Next Steps

Most importantly, to extend this analysis I would get a measure of per capita collisions. To do this, I need to know how many vehicles are on the road by time, day, gender, etc. This would add context to much of the Collisions by Time section.

Additionally, I would also do work on Collision Prediction. External data sources (such as weather) may be helpful. For instance, we could start by zooming into 1 area and trying to understand the dynamics affecting actual collision values.

Similarly, I would also need to elaborate on some concerns presented by this data; the catch all ages, the possibility of multiple victims, and unreported minor incidences. If this were an active client relationship I could go back to them and request clarification or design a process for better data collection moving forward.

Subsequently we would have a base to make a machine learning algorithm that could eventually help prevent these collisions by dispatching resources to where they are needed most. 

Jai is a Data scientist with Acorn who specializes in building models that help our team understand the avalanche of numbers that come in from our clients. 

 

If you like the work you just read and want to work with Jai directly, fill out the Contact Us form and mention him or this blog. 

 

Leave a Reply

Your email address will not be published. Required fields are marked *