Machine Learning: Feature Selection and Extraction with Examples

Justin Zhang
Nerd For Tech
Published in
5 min readApr 20, 2021

--

Introduction

It is always worth putting more time and effort on understanding the dataset you are dealing with. Selecting a machine learning algorithm without deep understanding datasets is blindfolded, and very likely ends up frustration and wasting time.

Dataset cleansing, feature selection and feature extraction are the steps to achieve this understanding.

Feature Selection

Machine learning is about the extract target related information from the given feature sets. Given a feature dataset and target, only those features can contribute the target are relevant in the machine learning process. Irrelevant features not only waste the computing resource but also introducing unnecessary noises. An example of feature selection is discribed in this article.

Correlation Analysis is a key for eliminating irrelevant features. here are criteria:

  • A feature dataset should not be a constant or should have a certain variant level.
  • A feature should be correlated with target, or it does not have any contribution an target estimation
  • Features should not be highly correlated, or one of them does not offer any additional information than other ones. It can only add sampling noises at this point.

There are couple of tools in sklearn module for this, please refer this paper for more details:

https://scikit-learn.org/stable/modules/feature_selection.html

In addition, Lasso Regression can also eliminate irrelevant features during model training process, but it limits only to linear estimations.

Feature Extraction

This is one step further from feature selection. To make machine learning effective and responsive, we are expecting smaller feature dimension space, and each of them to have more contribution to the estimation target. Feature extraction is a transformation to have a new set of feature where new feature sets

  • Have a smaller dimension
  • Have a maximum correlation with target

For linear system, PCA, ICA, Maniford are typical algorithms. For nonlinear system, a variety of Maniford based algorithm, kernelled ICA. For text, image datasets, they often have large feature dimension, and features are highly coorelated, where deep learning based embedding, or CNN, RNN based algorithm fit well.

One thing worth mentioning, in most of cases, feature extraction is part of core machine learning itself. To have feature extraction in a separate process pipeline or not depends on the data collection, storage, and processing infrastructure, and also depends on engineering and business requirements.

Manifold Example

We use famous digits dataset from sklearn module in this example.

from sklearn import datasets
digits = datasets.load_digits()
plt.hist(digits.target, histtype = 'barstacked', rwidth=0.8)
for key in digits.keys():print(key)
print(digits.data.shape)
print(digits.feature_names)

let normalized it and apply GaussianNB estimator.

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
norm_model = MinMaxScaler()
data_norm = norm_model.fit_transform(digits.data)
X_train, X_test, y_train, y_test = train_test_split(data_norm, digits.target, train_size = 0.7, random_state = 41)bay = GaussianNB()
bay.fit(X_train, y_train)
y_model1 = bay.predict(X_test)
print(accuracy_score(y_test, y_model1))
mat1 = confusion_matrix(y_test, y_model1)
sns.heatmap(mat1, annot=True, cbar = False)

We have 80% accurcy rate which is not bad for this simple and fast GaussianNB.

Here we want to add feature extraction to reduce feature dimension and improve accuracy Let’s view the visualized the dataset first.

fig, axes = plt.subplots(10, 10, figsize=(5, 5),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]), transform=ax.transAxes, color='green')

As we can see from the digit image, dots are clustered to form a number. Manifold seems a good fit for this situation. let’s use Isomap to reduce feature dimension to 15.

from sklearn.manifold import Isomap
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
model = make_pipeline(Isomap(n_components = 15), GaussianNB())
model.fit(X_train, y_train)
y_model = model.predict(X_test)
print(accuracy_score(y_test, y_model))mat = confusion_matrix(y_test, y_model)
sns.heatmap(mat, annot=True, cbar = False)

As we can see the feature dimension goes down from 48 to 15, and at the same time the estimation accuracy is improved from 80% to 97%.

As a note, the hyperparameter n_components = 15 was selected with GridSearchCV(). The code is as follows:

from sklearn.model_selection import GridSearchCVmodel = make_pipeline(Isomap(), GaussianNB())
grid = GridSearchCV(model, param_grid = {'isomap__n_components':[2,5,7,9,10,12,15,20,30]}, cv = 7)
grid.fit(X_train, y_train)
print(grid.best_params_){'isomap__n_components': 15}

the Chart shows 15 is a best number before it goes to overfit.

VAE Example

Deep learning model works on both linear and nonlinear data. For the highly correlated feature sets (like text, image) CNN, RNN can dramatically reduce feature dimension by learning patterns among feature.

In this example, 3 portrait photos are compressed in to 4-D vector, and later recovered back into images accurately. To make it more interesting, I am using VAE model instead. Instead of outputting vectors in AEM model, VAEM model outputs a gaussian distribution, which vectors can be sampled from. By sampling vectors from the output distributions, and decoding vector back into images, we can see image’s transition from one to another.

Here is the tensor board scalars training curves:

After training, we have following vectors decoded into images.

original images → vectors → decoded images:

Since the model is trained based on the same 3 images, this extreme result doesn’t have a practical value. It merely showcases here how far deep learning can go in the feature reduction area.

Interestingly we can sample vectors on the (z_mean, z_log_var) distribution to get some blended images:

The project notebook is available here

--

--

Justin Zhang
Nerd For Tech

Data Driven Application Architect, Tech lead, full stack developer for 15+ years