Playing with R-squared

While demonstrating our Simulator the other day, we realized that it includes a really great way to play with and develop an intuitive understanding of what an R-squared (r2) value represents.

How? Well, it includes an interactive interface to specify the relationship between two variables, which in turn will define the R-squared value of the resulting data.

What can it do?

The Simulator can, amongst other things, simulate a model of something where two numeric variables, which are normally distributed, have a linear relationship. You can open the model used in this post in DataClassroom here. (You’ll need to register if you don’t have an account)

This relationship can be adjusted using the interactive interface shown below.

 
 

What you’re seeing:

  • On the X axis, the distribution of the Predictor variable around it’s mean (the blue curve).

  • On the Y axis, the residual (see below) distribution of the Response variable around it’s mean (another blue curve).

  • The slope of the relationship, drawn as an orange bar, which can be dragged to adjust the slope.

The purple curve shows the additional variation of the Response variable that is explained by the Predictor variable and the slope of the relationship.

In the example above the resulting R-squared value is displayed as 0.5, which means:

50% of the variation in the Response variable is explained by the variation in the Predictor variable

This can also be seen in that the left-hand blue and purple curves are the same size.

What are the dots?

The blue dots are a visualization of some possible randomly-taken samples from this model. With a larger R-squared value, you’ll see them cluster more tightly around the line of best fit:

 
 

Now you see that the purple curve is much larger than the blue, and the R-squared value is around 0.95, meaning that

95% of the variation of the Response variable is explained by the Predictor variable

The rest of the variation (the blue curve) could be called the “residual” variation of the Response variable.

 
 

What do we mean by “residual” variation? By this we mean the variation you would see on the Response variable if the Predictor did not vary at all. In real life this would be random variation due to a large number of other effects - the large number being what tends to make such variation normally distributed in nature.

 
 

Similarly, you can adjust it the other way, so that very little of the variation is due to the Predictor variable, and most is this nominal variation.

 
 

The above is just a tiny sample of what the DataClassroom Simulations can do, and how DataClassroom can help with an intuitive understanding of methods such as Linear Regression. As you will have noted, we didn’t even run the simulation here!

You can have a play with the above model by opening it here. Register for an free trial on DataClassroom if you don’t already have a login.

You can also read more about the simulation features in our User Guide here. Run some! Have fun!

Also of interest?

There are other ways to look at Linear Regression with DataClassroom, for example:

Dan TempleComment