Identifying Collinear Relationships with Ames Housing Dataset in Python

5 min readJan 24, 2023

In machine learning, it is important to cull down model inputs to avoid overfitting. Two ways to identify which variables to remove include (i) understanding how to deal with null values and (ii) using correlation to identify different relationships. Depending on the data and there are opportunities to use contextual knowledge.

With 81 different columns, the Ames Housing dataset on Kaggle is great for practicing EDA and preprocessing. For this article, I will be referring to my exploration of the training dataset labeled as data.

My exploration can be found in my github notebook.

The Ames Dataset starts with 80 different variables (43 categorical, 37 numerical).

Pt.1 — Identifying & Exploring Missing Data

The most common way to identify missing data is to use the .info() method to view the columns, amount of missing values, and datatype of the dataset. This is great for a quick glance, but considering the 80 columns and to avoid a super-scrolling notebook, let’s visualize the number of missing columns.

vars_missing = [var for var in data.columns if data[var].isnull().sum() > 0]
data[vars_missing].isnull().mean().sort_values(ascending = False).plot.bar(figsize = (12,4))
plt.ylabel('Percentage of missing data')
plt.title('Missing data frome Ames Dataset')
plt.axhline(0.8, color = 'green', linestyle = '--')
plt.axhline(0.5, color = 'orange', linestyle = '--')

Bar chart of missing data in Ames training dataset — Result of Missing Data in Ames Test dataset

We see there are about 5 columns with ~50% of the data that is missing. There is too much missing, and there is no accurate way to accurately fill this in that will not represent bias in the model. Additionally, these features seem to be less important to the impact of a home’s sale price. For example, the pool quality implies there is a pool. The overall pool quality will not make a difference in the Sale Price. When there is this much missing data within a column, and an understanding of the target variable, don’t think twice about it — just delete it!

Other missing variables seem to fall into groups relating to either the garage, basement, or masonry of a house.

The shortcut approach for dealing with null values:
- Dropping all columns with null values
- Fill in missing values with column's mean, median, or mode.

Before taking any action on the missing values, we want to explore the relationships of these variables.

Pt.2— Identifying & Exploring Missing Data

Variables vs. SalePrice

Let’s take a look at the variables' relationship with SalePrice.

corr_mat = (data.select_dtypes(['int64', 'float64']).corr())
corr_mat[['SalePrice']].sort_values('SalePrice', ascending= False).style.format(precision = 3).background_gradient(cmap="Greys")

After running the code, there are alot of exected variables that have the strongest relationship to Sale Price. Some of the more stand-out stronger relationships are GarageCars (0.64), GarageArea (0.62), TotalBsmtSF (0.61), and 1stFlrSF (0.6). (Hmmmmm….my spidey senses are telling me this may have an impact of how we deal with the rest of our missing variables.)

Variables vs. Variables.

Heatmaps can help us identify collinear relationships and help us simplify what variables to include in our model. They also help us identify how we can begin feature engineering for our model.

mask = np.triu(np.ones_like(corr_mat.corr()))
fig, ax = plt.subplots(figsize = (20, 14))
sns.heatmap(corr_mat, mask = mask, 
            annot = True, annot_kws = {'size':8}, fmt=".2f", 
           cmap="PiYG")
plt.xticks(fontsize = 7)
plt.yticks(fontsize = 7)
plt.show()

Collinearity can reduce the precision of the overall model. While there are a lot of very detailed variables in this dataset that help describe different types of houses, many of them have obvious overlaps.

In this heatmap, we can see that “TotalBsmtSF” and “1stFlrSF” have a correlation of 0.82. Many of the “GarageX___” variables also have strong relationships with each other. This can be an indication of multicollinearity. In both of these cases, we can assume they give almost the same information.

Pt.4 — Removing & Filling in columns

Data w/ ~50% missing among sample — As reasoned above, after going through the correlation matrix exercise, we realize that none of these features are significant to homebuyers anyway.
Garage Features — We know that all of the Garage variables are highly correlated with each other. We are going to remove all of the GarageX features except for “GarageCars” which has the highest correlation with SalePrice.
Basement Features — “TotalBsmtSF” is clearly the primary concern for home buyers. We can remove all of the other features to avoid collinear variables. This also helps alleviate our missing data concerns.
Masonry Vaneer Features — Masonry Vaneer area has a 0.41 relationship with YearBuilt. We can assume that if we were to dummy out Masonry type, it would also have a strong relationship with YearBuilt. Let’s just lose it.
LotFrontage — With about 18% of its data missing, we can consider ways to fill in the missing data. While it has a somewhat-strong correlation with SalePrice (0.35), it has an even stronger correlation with LotArea (0.43). This correlation makes it worth removing.
1stFlrSF — Has a 0.82 correlation with TotalBsmtSF. This can be a situation of multicollinearity. If we think about these two, we can conclude that they give the same information.
TotRmsAbvGrd — Has a 0.83 correlation with GrLivArea. We’re going to drop this variable because of this relationship. GrLivArea has a stronger overall correlation with SalePrice. In addition, it’s very easy to label some things as a “Room” when it can barely fit a desk+bed.
Electrical — Drop the single observation or fill in with mode
Other — We will also remove the following items from the dataset as their relationship with SalePrice is between 0.05 and -0.05 [Id, PoolArea, MoSold, MiscVal, LowQualFinSF]

Pt.5 — Summary and Next Steps

After this exploration, we dropped 32 columns. We are now left with 28 categorical and 20 numerical variables to help us predict Ames housing prices. Other preprocessing steps to consider before modeling include

Feature engineering — Creating new variables by combining/consolidating current features in a certain way (ex/ TotalBath = FullBath + HalfBath + BsmtBath
Categorical variables exploration — Discover relationships of categorical varaibles with SalePrice and convert to indicator variables.
Outliers — Remove any observations in the dataset that are skewing important variables

Additonal Notes:

Leading up to modeling, ensure you remove the same columns in the training and testing dataset. For Ames, after you perform the actions on the test dataset, you may realize there are missing values in columns where the training dataset is didn’t have any null values. I chose to resolve this issue by filling in the missing values with the column mode. Review my notebook for the code.

Identifying Collinear Relationships with Ames Housing Dataset in Python

Pt.1 — Identifying & Exploring Missing Data

Pt.2— Identifying & Exploring Missing Data

Pt.4 — Removing & Filling in columns

Pt.5 — Summary and Next Steps

Written by Brianna Lytle