Data Engineering: A Feature Selection Example with the Iris Dataset

Justin Zhang
Nerd For Tech
Published in
5 min readApr 16, 2021

--

Introduction

As for a best ratio of data engineer vs data scientist member, 8:2 is a very popular one. Of course there is no fixed ‘best’ ratio, it all depends on a company’s setup, developers availabilities and etc… But from this ratio roughly we can see the work loads fall on those 2 categories: data engineering vs machine learning algorithm research. In the reality, a better data engineering job done can greatly benefit machine learning algorithm and ends up with responsive feedback and cost saving.

In this article here, a demostration is given to show how feature selection can benefit the overall machine learning process.

Following content is limited to supervised, classification tasks. The idea may apply to other category of machine learning tasks, but they may require different data engineering and feature analysis process.

Data Engineering

From machine learning perspective, data engineering involves dataset collecting, dataset cleansing/transforming, feature selecting, feature transformation. Here we focus on feature selection to show how does it benefit a machine learning process.

Feature Analysis

We are using the famous iris datasets in our example. It is well-formed, clean, balanaced already.

from sklearn import datasets

# load data to dict derived class Bunch
iris = datasets.load_iris()
target = datasets.load_iris().target


# convert to dataframe for processing
iris = pd.DataFrame(iris.data, columns = iris.feature_names)

to make sure the data is balanced. It is in our case, the same 50 samples on each class.

import seaborn as snsplt.hist(target)

check the its min, max and other basic information to make sure we don’t have outliers

iris.describe()

Now let’s normalize it and viusalize feature’s correlations with classes

#to normalize dataset, we use this handy MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(iris)
iris_norm=scaler.transform(iris)# visualizing features and target
iris_norm = pd.DataFrame(iris_norm, columns = iris.columns)
iris_norm_ = pd.DataFrame(np.hstack((iris_norm, target[:, np.newaxis])), columns = iris.columns.tolist() + ['class'])
sns.pairplot(iris_norm_, hue = 'class', diag_kind='hist')

As we can see, all other pairs can separate 3 classes very clear, other than sepal width/sepal length pair. Class 1 and class 2 are tangled in the chart.

Further more, let’s check their coviances among features and class:

# manually verify the correlation among features and classes
iris_cov = iris_norm_.cov()
sns.heatmap(iris_cov, annot = True, cbar = False)

Feature Selection

Ideally we want a feature which is a)more relevant to the class and b)less relevant to other features. a) is the most important factor, because it can’t contribute an algorithm if it is totally irrelevant. b) can make the process more effect, but it is another topic beyond this article.

From the covariance heatmap, we can see ‘sepal width’ is the least relevant to the class. This can explain why is class 1 and 2 are tangled in the pairplot chart from the previous section.

Let’s use sklearn SelectKBest model to select the best 3 features. Since all 4 features are continous and we use F-test to do this. Our goal is the remove ‘sepal width’ feature.

from sklearn.feature_selection import SelectKBest, f_classifbestfeatures = SelectKBest(score_func=f_classif, k=3)
iris_trim = bestfeatures.fit_transform(iris_norm, target)
print(bestfeatures.scores_)
print(bestfeatures.pvalues_)
print(iris_trim.shape)

As you can see the second feature has the least score and the largest p-value. And the resulting dataset is of shape 150 x 3, the second feature (sepal width) was removed.

let’s see the pairplot again:

We can draw a 3D chart for the 3 features now for a more intruitive view.

from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(8,8))
ax = plt.axes(projection='3d')
ax.scatter3D(iris_trim[:, 0], iris_trim[:, 1], iris_trim[:, 2], c = target, cmap='Accent', marker = '>')

Validation

Now let’s compare both 4 feature case and 3 feature case. Define a training and validation function first, then prepare both datesets.

def train_and_validate(X_train, X_test, y_train, y_test):
mode = GaussianNB()
mode.fit(X_train, y_train);
y_calc = mode.predict(X_test)
y_prob = mode.predict_proba(X_test)
#print(y_prob)
mat = confusion_matrix(y_test, y_calc)
sns.heatmap(mat.T, annot=True, cbar = False)
X_train4, X_test4, y_train, y_test = train_test_split(iris_norm, target, test_size = 0.10, stratify = None, random_state=0)
X_train3, X_test3 = X_train4.drop(['sepal width (cm)'], axis=1), X_test4.drop(['sepal width (cm)'], axis=1)

Run and compare

As we can see, the reduced feature set has a better result. In the confusion matrics the 3 feature dataset yields a 100% accuracy, while the 4 feature set model misses one sample.

I changed the random_state to generate different sets of data to repeat the process, and I can see the 3-feature dataset performs better or at least equally good as a 4-feature dataset.

Conclusion

A better prepared dataset can benefit a machine learning process. Properly selected feature set not only saves model training time, storage space, but also leads to more accurate result.

--

--

Justin Zhang
Nerd For Tech

Data Driven Application Architect, Tech lead, full stack developer for 15+ years