import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("../DATA/AMES_Final_DF.csv")
df=df.drop('Unnamed: 0',axis=1)
df.head()
Lot Frontage | Lot Area | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Mas Vnr Area | BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | ... | Sale Type_ConLw | Sale Type_New | Sale Type_Oth | Sale Type_VWD | Sale Type_WD | Sale Condition_AdjLand | Sale Condition_Alloca | Sale Condition_Family | Sale Condition_Normal | Sale Condition_Partial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 141.0 | 31770 | 6 | 5 | 1960 | 1960 | 112.0 | 639.0 | 0.0 | 441.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 80.0 | 11622 | 5 | 6 | 1961 | 1961 | 0.0 | 468.0 | 144.0 | 270.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 81.0 | 14267 | 6 | 6 | 1958 | 1958 | 108.0 | 923.0 | 0.0 | 406.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
3 | 93.0 | 11160 | 7 | 5 | 1968 | 1968 | 0.0 | 1065.0 | 0.0 | 1045.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
4 | 74.0 | 13830 | 5 | 5 | 1997 | 1998 | 0.0 | 791.0 | 0.0 | 137.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 274 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2925 entries, 0 to 2924 Columns: 274 entries, Lot Frontage to Sale Condition_Partial dtypes: float64(11), int64(263) memory usage: 6.1 MB
TASK: The label we are trying to predict is the SalePrice column. Separate out the data into X features and y labels
X=df.drop('SalePrice',axis=1)
y=df['SalePrice']
TASK: Use scikit-learn to split up X and y into a training set and test set. Since we will later be using a Grid Search strategy, set your test proportion to 10%. To get the same data split as the solutions notebook, you can specify random_state = 101
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1,random_state=101)
TASK: The dataset features has a variety of scales and units. For optimal regression performance, scale the X features. Take carefuly note of what to use for .fit() vs what to use for .transform()
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
TASK: We will use an Elastic Net model. Create an instance of default ElasticNet model with scikit-learn
from sklearn.linear_model import ElasticNet
elastic_model=ElasticNet(max_iter = 100000)
TASK: The Elastic Net model has two main parameters, alpha and the L1 ratio. Create a dictionary parameter grid of values for the ElasticNet. Feel free to play around with these values, keep in mind, you may not match up exactly with the solution choices
# l1_ratio = 0 then it's L2 penalty
# L1_ratio = 1 then it's L1 penalty
# l1_ratio = range of values approaching 1 means we're doing CV with L1 penalty
pars={'alpha':[1,100],'l1_ratio':[.1, .5, .7, .9, .95, .99, 1]}
TASK: Using scikit-learn create a GridSearchCV object and run a grid search for the best parameters for your model based on your scaled training data. In case you are curious about the warnings you may recieve for certain parameter combinations
from sklearn.model_selection import GridSearchCV
grid_model=GridSearchCV(elastic_model,pars,scoring='neg_mean_squared_error')
grid_model.fit(X_train,y_train)
GridSearchCV(estimator=ElasticNet(max_iter=100000), param_grid={'alpha': [1, 100], 'l1_ratio': [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1]}, scoring='neg_mean_squared_error')
TASK: Display the best combination of parameters for your model
grid_model.best_params_
{'alpha': 100, 'l1_ratio': 1}
TASK: Evaluate your model's performance on the unseen 10% scaled test set. In the solutions notebook we achieved an MAE of $\$$14149 and a RMSE of $\$$20532
pred=grid_model.predict(X_test)
from sklearn.metrics import mean_squared_error , mean_absolute_error
MAE= mean_absolute_error(y_test,pred)
RMS=np.sqrt(mean_squared_error(y_test,pred))
MAE
14195.354900562172
RMS
20558.508566893164