Due to the simplicity of KNN for Classification, let's focus on using a PipeLine and a GridSearchCV tool, since these skills can be generalized for any model.
Sonar (sound navigation ranging) is a technique that uses sound propagation (usually underwater, as in submarine navigation) to navigate, communicate with or detect objects on or under the surface of the water, such as other vessels.
The data set contains the response metrics for 60 separate sonar frequencies sent out against a known mine field (and known rocks). These frequencies are then labeled with the known object they were beaming the sound at (either a rock or a mine).
Our main goal is to create a machine learning model capable of detecting the difference between a rock or a mine based on the response of the 60 separate sonar frequencies.
Data Source: https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)
TASK: Run the cells below to load the data.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('../DATA/sonar.all-data.csv')
df.head()
Freq_1 | Freq_2 | Freq_3 | Freq_4 | Freq_5 | Freq_6 | Freq_7 | Freq_8 | Freq_9 | Freq_10 | ... | Freq_52 | Freq_53 | Freq_54 | Freq_55 | Freq_56 | Freq_57 | Freq_58 | Freq_59 | Freq_60 | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0200 | 0.0371 | 0.0428 | 0.0207 | 0.0954 | 0.0986 | 0.1539 | 0.1601 | 0.3109 | 0.2111 | ... | 0.0027 | 0.0065 | 0.0159 | 0.0072 | 0.0167 | 0.0180 | 0.0084 | 0.0090 | 0.0032 | R |
1 | 0.0453 | 0.0523 | 0.0843 | 0.0689 | 0.1183 | 0.2583 | 0.2156 | 0.3481 | 0.3337 | 0.2872 | ... | 0.0084 | 0.0089 | 0.0048 | 0.0094 | 0.0191 | 0.0140 | 0.0049 | 0.0052 | 0.0044 | R |
2 | 0.0262 | 0.0582 | 0.1099 | 0.1083 | 0.0974 | 0.2280 | 0.2431 | 0.3771 | 0.5598 | 0.6194 | ... | 0.0232 | 0.0166 | 0.0095 | 0.0180 | 0.0244 | 0.0316 | 0.0164 | 0.0095 | 0.0078 | R |
3 | 0.0100 | 0.0171 | 0.0623 | 0.0205 | 0.0205 | 0.0368 | 0.1098 | 0.1276 | 0.0598 | 0.1264 | ... | 0.0121 | 0.0036 | 0.0150 | 0.0085 | 0.0073 | 0.0050 | 0.0044 | 0.0040 | 0.0117 | R |
4 | 0.0762 | 0.0666 | 0.0481 | 0.0394 | 0.0590 | 0.0649 | 0.1209 | 0.2467 | 0.3564 | 0.4459 | ... | 0.0031 | 0.0054 | 0.0105 | 0.0110 | 0.0015 | 0.0072 | 0.0048 | 0.0107 | 0.0094 | R |
5 rows × 61 columns
TASK: Create a heatmap of the correlation between the difference frequency responses.
# CODE HERE
sns.heatmap(df.corr())
<AxesSubplot:>
TASK: What are the top 5 correlated frequencies with the target\label?
Note: You many need to map the label to 0s and 1s.
Additional Note: We're looking for absolute correlation values.
df['Target']=df['Label'].map({'R':0,'M':1})
df.corr()['Target'].sort_values(ascending=False).head(6)
Target 1.000000 Freq_11 0.432855 Freq_12 0.392245 Freq_49 0.351312 Freq_10 0.341142 Freq_45 0.339406 Name: Target, dtype: float64
Our approach here will be one of using Cross Validation on 90% of the dataset, and then judging our results on a final test set of 10% to evaluate our model.
TASK: Split the data into features and labels, and then split into a training set and test set, with 90% for Cross-Validation training, and 10% for a final test set.
Note: The solution uses a random_state=42
# CODE HERE
X=df.drop('Label',axis=1)
y=df['Label']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.1,random_state=42)
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
scaler=StandardScaler()
opeartions=[('scaler',scaler),('knn',knn)]
TASK: Create a PipeLine that contains both a StandardScaler and a KNN model
# CODE HERE
from sklearn.pipeline import Pipeline
pip=Pipeline(opeartions)
TASK: Perform a grid-search with the pipeline to test various values of k and report back the best performing parameters.
# CODE HERE
from sklearn.model_selection import GridSearchCV
# converting a range of objects to a list
values= list(range(1,30))
param_grid={'knn__n_neighbors':values}
grid_model=GridSearchCV(pip,param_grid,cv=5,scoring='accuracy')
grid_model
GridSearchCV(cv=5, estimator=Pipeline(steps=[('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]), param_grid={'knn__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}, scoring='accuracy')
grid_model.fit(x_train,y_train)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]), param_grid={'knn__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]}, scoring='accuracy')
grid_model.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=1))])
(HARD) TASK: Using the .cvresults dictionary, see if you can create a plot of the mean test scores per K value.
df_2=grid_model.cv_results_['mean_test_score']
df_2
array([0.92532006, 0.81280228, 0.84523471, 0.80768137, 0.84025605, 0.81294452, 0.82944523, 0.80227596, 0.81322902, 0.79687055, 0.82930299, 0.78634424, 0.83485064, 0.79743954, 0.80782361, 0.80241821, 0.82375533, 0.80241821, 0.80782361, 0.78648649, 0.80782361, 0.79231863, 0.79743954, 0.78150782, 0.77055477, 0.7544808 , 0.77596017, 0.77055477, 0.8027027 ])
plt.plot(values,df_2)
plt.xlabel('KNN value')
plt.ylabel('Accuracy')
Text(0, 0.5, 'Accuracy')
TASK: Using the grid classifier object from the previous step, get a final performance classification report and confusion matrix.
#Code Here
from sklearn.metrics import classification_report,plot_confusion_matrix
pred=grid_model.predict(x_test)
plot_confusion_matrix(grid_model,x_test,y_test)
C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1c7d119fdc8>
print(classification_report(y_test,pred))
precision recall f1-score support M 1.00 0.92 0.96 13 R 0.89 1.00 0.94 8 accuracy 0.95 21 macro avg 0.94 0.96 0.95 21 weighted avg 0.96 0.95 0.95 21