by Jai Bansal
How much time did you invest in your machine learning model? There’s a chance that your results are horribly skewed. If a data scientist doesn’t pay attention to random seeds they could be negating all their work.
What is a Random Seed?
A random seed is used to ensure that results of your model are reproducible. In other words, the use of this parameter ensures that anyone who runs your code will get the exact same outputs. In data science reproducibility is extremely important. Lots of people have already written about reproducibility at length, so I won’t discuss it any further in this post.
When to use a random seed
You may not need a random seed. However, there are 2 common tasks where they are used:
- When you split data into training/validation/test sets: random seeds ensure that the data is divided the same way every time the code is run
- Model training: Algorithms such as random forest and gradient boost are non-deterministic (for a given input, the output is not always the same) and so require a random seed argument for reproducible results
In addition to reproducibility, random seeds are also important to benchmark results. So, if you test multiple versions of an algorithm it’s important that all versions use the same data. Additionally, they should be as similar as possible (except for the parameters you are testing).
How Random Seeds Are Usually Set
Despite their importance, random seeds are often set without much thought. I’m guilty of this. I use the date of whatever day I work on (so on March 1st, 2020 I would use the seed 20200301). Some people use the same seed every time. Others randomly generate their seeds. Overall, random seeds are typically treated as an afterthought in the modeling process. In the next few sections we’ll see that this can be problematic because the choice of parameter can significantly affect results.
Example: Titanic Data
Now, I’ll demonstrate just how much impact the choice of a random seed can have. I’ll use the well-known Titanic dataset to do this (available for download here: https://www.kaggle.com/c/titanic/data). The following code and plots are created in Python, but I found similar results in R.
First, let’s look at a few rows of this data:
import pandas as pd train_all = pd.read_csv('train.csv') # Show selected columns train_all.drop(['PassengerId', 'Parch', 'Ticket', 'Embarked', 'Cabin'], axis = 1).head()
The Titanic data is already divided into training and test sets. A classic task for this dataset is to predict passenger survival. This is encoded in the “Survived” column above. The test data does not come with labels for the “Survived” column.
So secondly, I’ll do the following:
- Hold out a portion of the training data to serve as a validation set
- Train a model to predict survival on the remaining training data and evaluate that model against the validation set created in step 1
Titanic Data: Splitting Data
First look at the overall distribution of the “Survived” column.
In : train_all.Survived.value_counts() / train_all.shape Out: 0 0.616162 1 0.383838 Name: Survived, dtype: float64
When modeling, we want our training, validation, and test data to be as similar as possible. This is so our model is trained on the same kind of data that it’s being evaluated against. Note that this does not mean that any of these 3 data sets should overlap! They should not. But we want the observations contained in each of them to be broadly comparable. I’ll now split the data using different random seeds and compare the resulting distributions of “Survived” for the training and validation sets.
from sklearn.model_selection import train_test_split # Create data frames for dependent and independent variables X = train_all.drop('Survived', axis = 1) y = train_all.Survived # Split 1 X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 135153) In : y_train.value_counts() / len(y_train) Out: 0 0.655899 1 0.344101 Name: Survived, dtype: float64 In : y_val.value_counts() / len(y_val) Out: 0 0.458101 1 0.541899 Name: Survived, dtype: float64
In this case, the proportion of survivors is much lower in the training set than the validation set.
# Split 2 X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 163035) In : y_train.value_counts() / len(y_train) Out: 0 0.577247 1 0.422753 Name: Survived, dtype: float64 In : y_val.value_counts() / len(y_val) Out: 0 0.77095 1 0.22905 Name: Survived, dtype: float64
Here, the proportion of survivors is much higher in the training set than the validation set.
Full disclosure, these examples are the most extreme ones I found after looping through 200K random seeds. Regardless, there are a couple concerns with these results. First, in both cases, the survival distribution is substantially different between the training and validation sets. This will likely negatively affect model training. Second, these outputs are very different from each other. If, as most people do, you set a random seed arbitrarily, your resulting data splits can vary drastically depending on your choice.
I’ll discuss best practices at the end of the post. Next, I want to show how the training set “Survival” percentage varied for all 200K random seeds I tested.
~23% of data splits resulted in a survival percentage difference of at least 5% between training and validation sets. More than 1% of splits resulted in a survival percentage difference of at least 10%. The largest survival percentage difference was ~20%. The takeaway here is that using an arbitrary random seed can result in large differences between the training and validation set distributions. These differences can have unintended downstream consequences in the modeling process.
Titanic Data: Model Training
The previous section showed how random seeds can influence data splits. In this section, I train a model using different random seeds after the data has already been split into training and validation sets (more on exactly how I do that in the next section).
As a reminder, I’m trying to predict the “Survived” column. I’ll build a random forest classification model. Since the random forest algorithm is non-deterministic, a random seed is needed for reproducibility. I’ll show results for model accuracy below, but I found similar results using precision and recall.
First, I’ll create a training and validation set.
X = X[['Pclass', 'Sex', 'SibSp', 'Fare']] # These will be my predictors # The “Sex” variable is a string and needs to be one-hot encoded X['gender_dummy'] = pd.get_dummies(X.Sex)['female'] X = X.drop(['Sex'], axis = 1) # Divide data into training and validation sets # I’ll discuss exactly why I divide the data this way in the next section X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 20200226, stratify = y)
Now I’ll train a couple models and evaluate accuracy on the validation set.
# Model 1 from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Create and fit model clf = RandomForestClassifier(n_estimators = 50, random_state = 11850) clf = clf.fit(X_train, y_train) preds = clf.predict(X_val) # Get predictions In : round(accuracy_score(y_true = y_val, y_pred = preds), 3) Out: 0.765 # Model 2 # Create and fit model clf = RandomForestClassifier(n_estimators = 50, random_state = 2298) clf = clf.fit(X_train, y_train) preds = clf.predict(X_val) # Get predictions In : round(accuracy_score(y_true = y_val, y_pred = preds), 3) Out: 0.827
I tested 25K random seeds to find these results, but a change in accuracy of >6% is definitely noteworthy! Again, these 2 models are identical except for the random seed.
The plot below shows how model accuracy varied across all of the random seeds I tested.
While most models achieved ~80% accuracy, there are a substantial number of models scoring between 79%-82% and a handful of models that score outside of that range. Depending on the specific use case, these differences are large enough to matter. Therefore, model performance variance due to random seed choice should be taken into account when communicating results with stakeholders.
Now that we’ve seen a few areas where the choice of a random seed impacts results, I’d like to propose a few best practices.
For data splitting, I believe stratified samples should be used. This is so the proportions of the dependent variable (“Survived” in this post) are similar in the training, validation, and test sets. Consequently, this would eliminate the varying survival distributions above and allows a model be trained and evaluated on comparable data.
train_test_split function can implement stratified sampling with 1 additional argument. Note that if a model is later evaluated against data with a different dependent variable distribution, performance may be different than expected. However, I believe stratifying by the dependent variable is still the preferred way to split data.
Here’s how stratified sampling looks in code.
# Overall distribution of “Survived” column In : train_all.Survived.value_counts() / train_all.shape Out: 0 0.616162 1 0.383838 Name: Survived, dtype: float64 # Stratified sampling (see last argument) X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 20200226, stratify = y) In : y_train.value_counts() / len(y_train) Out: 0 0.616573 1 0.383427 Name: Survived, dtype: float64 In : y_val.value_counts() / len(y_val) Out: 0 0.614525 1 0.385475 Name: Survived, dtype: float64
stratify argument, the proportion of “Survived” is similar in the training and validation sets. I still use a random seed as I still want reproducible results. However, it’s my opinion that the specific random seed value doesn’t matter in this case.
Best Practices in Model Testing
Previously we addressed data splitting best practices, but how about model training? While testing different model specifications, a random seed should be used for fair comparisons but I don’t think the particular seed matters too much.
However, before reporting performance metrics to stakeholders the final model should be trained and evaluated with 2-3 additional seeds to understand possible variance in results. As a result this practice allows more accurate communication of model performance.
For a critical model running in a production environment, it’s worth considering running that model with multiple seeds and averaging the result. However, this is probably a topic for a separate blog post.
Hopefully I’ve convinced you to pay a bit of attention to the often-overlooked random seed parameter. I’ve also hopefully convinced you to use the
stratify argument. Feel free to get in touch with other ideas for best practices or experiences you’ve had with the small but mighty random seed!
Want to Read Something Else?
- Visualizing COVID-19 Survey Data
- Misleading Covid-19 Charts and Data
- How to Use Random Seeds Effectively
About the Author
If you liked this article you might like his other article which analyzes LA traffic patterns.
If you want to work with the Acorn Team fill out the Contact Us form.