top of page
Search

KAGGLE COMPETITION - PREDICTING HOUSING PRICES USING ADVANCED REGRESSION TECHNIQUES

  • Writer: Amar Haiqal Che Hussin
    Amar Haiqal Che Hussin
  • Nov 28, 2021
  • 2 min read

Overview


This is my second time participating in Kaggle Competition, which the case study is about predicting the price of the house based on the features given. This is what they called a "Getting Started" competition which is suitable for those who wants to test the water in machine learning world. Eventhhough for me, it is rather quite tough especially in the data-preprocessing part


The flowchart on how I executed the task is as shown below:


ree

Excecution



I developed the machine learning models and perform all those pre-processing in Google Colab. So, what I need to do first is to import Kaggle token and importing the dataset from the Kaggle into my workspace



ree

Once the dataset has been imported, I proceed to pre-process the data, starting with the training data first. I can just use simple command as


df.info()

To see how many non-null values in every input. But since there are too many columns to deal with, it is much easier to illustrate as heatmap as shown below. Those white splothces indicates missing data. Some of it just several cell but you may noticed there are several columns that have too many missing data. These columns needs to be removed

ree

My intuition says that any features with more than 40% missing data is unacceptable, thus, I find which columns that belongs to this category


ree

Second time visualising data in heatmap,


ree

A bit better, but still not enough


Next, I replace all NaN is numerical columns with mean while categorical columns with the mode. However for some reason it still does not work. So I handled those problematic feautes one by one, until the data is fully clean as shown below:



ree

Now that's more like it. Now we can do the same with the test data. Next step is we want to encode those categorical features. The thing is, there is a possibility that there will be some categorical values that present in the Test set but not in training set or vice versa.


For instance



Type_in_Training_set = ["a" , "b", "d"]
Type_in_Test_set     = ["a", "b", "c", "d"]

Once we concatenated the data, we can encode the categorical features by introducing dummy variables


ree

Now, we can split back the dataset to its original constituent. By right, the dataset for me its good to go. But what I wanted to do is to observe the correlation between features and the SalePrice and categorized it under highest and lowest correlation



ree

I wanted to do Collinearity analysis but that will be on another submission. So let's proceed with model development. The model is traine with 10-fold cross validation using RMSE as metric. The scoring for each fold is as show below

ree

I think there is something I need to alter on, but this time, let's continue to predict using the test set right away


ree

After submission. I was ranked 2542 with Test set RMSE of 0.14766


ree

I think this is a good start and I can't wait to see how it performs once the hyperparameter tuning is performed

 
 
 

Comments


Post: Blog2_Post
  • Facebook
  • LinkedIn

©2021 by Amar Haiqal

bottom of page