Building Signal-Based Models with Cross-Validation

Data

Building Signal-Based Models with Cross-Validation

One challenge in building analytical models involves separating signal from noise. Signal describes the patterns we consider “real” in the sense that we believe they reflect a true relationship between the factors we’re analyzing. For instance, if our data indicates that a person’s height is related to her weight, we’d probably consider that signal.

Noise, on the other hand, describes patterns present only in the subset of the data we’re examining, and we wouldn’t expect them to continue beyond that. So, if our data tells us that a person’s height is related to her first initial, we’re probably picking up on noise.

Separating Signal from Noise

In predictive analytical models, we hope that basing our models on signal variables will lead to higher accuracy. This consideration invariably leads us to split our data into a training set and a validation set. We feed the training set to our algorithm to detect patterns and fit a model. We then use that model to predict the target values of the validation set we withheld and compare the predicted values to the recorded values. We repeat the process, finally settling on the model whose predictions come the closest:

predictive analytics models

Above: Comparing the actual recorded values to the values our model predicted

The idea is this: if we give our algorithm all the data we have, we get a model optimized to “predict” the target values we already have. But that’s not what we want. Instead, we want to optimize our model to predict values we do not already have. So, we withhold a part of our data and force the algorithm to select those model variations that work well on this “unknown” data. That, we hope, will drive the algorithm to use signal variables in fitting the model:

predicting signal in analytics modeling

Above: Breaking up our dataset into training data (used to build different models) and validation data (used to evaluate those models). In most instances, the validation set should be a randomized sample.

Identifying Signal with Cross-Validation

Splitting into training and validation, though, doesn’t eliminate the risk of incorporating noise variables as predictors (also called overfitting). Sure, we’re optimizing a model to predict “unknown” data, but it’s still the same dataset over and over, so we’re building a model that’s optimized to predict a specific subset of our data.

Cross-validation, in response, takes it one step further. Instead of splitting our data into training and validation sets once at the beginning, we go through several rounds, each time re-shuffling and re-splitting our data:

cross-validation analytics modeling

Above: In cross-validation, we go through multiple rounds of model-building, each time drawing a new validation set.

A variation of this practice is called k-fold cross-validation, in which we split our data into a specific number of subsets, or folds, and use a different fold as the validation set in each round:

k-fold cross-validation

Above: K-fold cross-validation involves splitting our data into several folds, each of which will serve as the validation set for one the model-building rounds

By using these cross-validation methods, we end up with a final model that is a composite of all the models that worked best on all the different validation sets. This approach, while resource-intensive, significantly reduces the risk of overfitting and enhances our ability to create robust, accurate, signal-based models.

What This All Means

Whenever we’re building predictive models, we want to make sure we’re picking up on the underlying truth of the data, rather than the randomness inherent to it. Cross-validation methods are a powerful tool to achieve that end without having to collect additional data, and models built using cross-validation are better equipped to make accurate predictions on unknown data.

More About the Author

Chris Steingass

Analytics Consultant
Building Signal-Based Models with Cross-Validation One challenge in building analytical models involves separating signal from noise. Signal describes the patterns we consider “real” in ...
P-Values and the Reductio Ad Unlikely Relying on p-values to reach business decisions is common, which is reason for concern. Conducting analysis that’s too heavily based on ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!