The Tortoise, the Hare, & the Data Scientist

“Maybe you should quit. This might be too hard for you.”

No, not quit Metis. Metis is going great! That urge to quit came from a well-meaning stranger who saw me stumbling through the 2013 Vancouver Marathon. That marathon was my 7th marathon and, by far, most infamous. I expected to finish in roughly 3 hours, 50 minutes, but instead shuffled and stumbled my way to a 5-hour finish. It’s always been a mystery why.

Another mystery came 2 years later, when the script was flipped. I ran the Vancouver Marathon again, but training was a struggle and I entered the marathon unsure I’d finish. Shockingly, I cruised through the race and pumped my fist at the end after shattering my PR.

My ups and downs in Vancouver exemplify how marathons are notoriously unpredictable. I ran Vancouver 4 years in a row, 2013-2016, and despite being in similar shape all 4 years, my times varied by 75 minutes.

But if something is notoriously unpredictable … why not try to predict it?

Predicting the Chicago Marathon

For my latest data science project, I scraped data from almost 37,000 runners who completed the 2016 Chicago Marathon and used the data to predict finishing times. I was particularly interested in how a runner’s pacing strategy might influence their final time. This website allows you to collect runners’ times during each 5K interval, which I used to essentially test the moral behind “The Tortoise and the Hare” – can a slow and steady tortoise outrun the quicker hare who peaks too early?

Honestly, this project was more fun than I should admit in public. I’m relatively new to web scraping but enjoy it a lot. And I was excited to test conventional wisdom about marathons, which basically says you should start slow and end fast when running. I’ve always hated that advice – not because experts are wrong, but because it’s easy to give that advice and much harder to follow it.

So, after a week of model development and cross-validation, do I think experts are wrong? Um … no. In fact, it was stunning to see how conventional wisdom, a few imperfect variables, and ridge regression could accurately predict runner performance.

How my model evolved

As a first step, I used runners’ sex, age group, nationality, and bib number to predict their time. These basic runner characteristics alone accounted for 63% of the variance in finishing times. (I know “bib number” sounds odd, but in the case of Chicago, it’s essentially a proxy for general speed. Ideally, I would know a runner’s actual speed, but in absence of that, bib number was a reasonable backup.)

Sixty-three percent is a solid start, but it was a dull 63%. Sex, age, nationality, and general speed are features that a runner has no control over on Race Day. If a data scientist told me at the start of a marathon, “63% of your fate is already set in stone – have a nice run,” I’d probably punch him. I don’t care about things I can’t control during a race; I want to hear about things I can control. Besides, the mean error from this preliminary model was 30 minutes. That leaves a lot of room for improvement, especially to hard-core runners, many of whom would donate their kidney if it shaved 1 minute off their PR.

That’s where pacing comes in – can how you pace yourself affect your time? To test this, I carefully selected 3 pacing measures that are not inherently tied to time: A) when runners hit their peak (i.e., which 5K was fastest?), B) percent change during the first half of the marathon, and C) how erratic their pace was, from one 5K to the next, during the first half. Adding these 3 measures to the model improved the R^2 from 63% to 77% and reduced the mean error from 30 minutes to 23 minutes.

It was also fascinating to see how different combinations of pacing could affect your final time. Using final results from the ridge regression model, I predicted final times of 4 hypothetical runners who have the same characteristics as me (same sex, age, nationality, and general speed) but vary their pacing in different ways. The predicted final times of these 4 runners varied by 72 minutes! Coincidentally or not, that’s almost exactly how much my pace varied during those 4 years in Vancouver.

“But it’s just a model – how do you know you’re right?”

The best way to test a model is to, well … test the model. As luck would have it, the 2017 Chicago Marathon was this past weekend. Because I’m a nerd (and a hyper-competitive nerd at that) I tested my model by randomly picking 10 runners during the marathon on Sunday and using their data during the first half to predict their final times.

How did my model do? Not bad, considering that I threw this model together in a week and didn’t use any measure of their actual speed during the race. My predictions were within 6 minutes for 4 of the 10 runners, and within 15 minutes for 6 of the 10. The mean error was only 20 minutes despite one runner clearly hitting a wall in the second half (causing the model to whiff by 70 minutes on that poor guy.)

The real question – does the tortoise win?

No. Not unless it’s a very fit tortoise. It was obvious in the model results that nothing replaces being in great condition. As a trainer once told me, “To be fast … first you need to be fast.” According to both conventional wisdom and my model, no magic pacing strategy will turn me for an Olympian. Still, it was staggering to see how much pacing could affect your time – and, during my 4 years in Vancouver, probably did.

The Tortoise, the Hare, & the Data Scientist

Predicting the Chicago Marathon

How my model evolved

“But it’s just a model – how do you know you’re right?”

The real question – does the tortoise win?

The Science of Storytelling: Why Talks Go Viral (Part 1)

What Have I Gotten Myself Into?

Contact Me