10 Python One-Liners That Will Simplify Feature Engineering

10-python-one-liners-that-will-simplify-feature-engineering

Source: MachineLearningMastery.com

10 Python One-Liners That Will Simplify Feature Engineering

10 Python One-Liners That Will Simplify Feature Engineering 
Image by Editor | Midjourney

Feature engineering is a key process in most data analysis workflows, especially when constructing machine learning models. It involves the creation of new features based on existing raw data features to extract deeper analytical insights and enhance model performance. To help turbocharge and optimize your feature engineering and data preparation workflows, this article presents 10 one-liners — single lines of code that accomplish meaningful tasks efficiently and concisely — specifically introducing 10 practical one-liners to keep in on-hand to perform feature engineering processes in various situations and kinds of data, all in a simplified manner.

Before starting, you may need to import some key Python libraries and modules we will use. In addition, we will import two datasets openly available in Scikit-learn’s datasets module: the wine dataset and the Boston housing dataset.

from sklearn.datasets import load_wine, fetch_openml

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, KBinsDiscretizer, PolynomialFeatures

from sklearn.feature_selection import VarianceThreshold

from sklearn.decomposition import PCA

# Dataset loading into Pandas dataframes

wine = load_wine(as_frame=True)

df_wine = wine.frame

boston = fetch_openml(name=“boston”, version=1, as_frame=True)

df_boston = boston.frame

Notice that the two datasets have been loaded into two Pandas dataframes, whose variables are named df_wine and df_boston, respectively.

1. Standardization of Numerical Features (Z-score Scaling)

Standardization is a common approach to scale numerical features in a dataset when their values span varying ranges or magnitudes, and there may be some moderate outliers. This scaling method transforms the numerical values of an attribute to follow a standard normal distribution, with a mean of 0 and a standard deviation of 1. Scikit-learn’s StandardScaler class provides a seamless implementation of this method: all you need to do is call its fit_transform method, passing in the features of the dataframe that need to be standardized:

df_wine_std = pd.DataFrame(StandardScaler().fit_transform(df_wine.drop(‘target’, axis=1)), columns=df_wine.columns[:1])

The resulting standardized attributes will now have small values around 0, some will be positive, some negative. This is totally normal, even if your original feature values were all positive, because standardization not only scales the data, it also centers values around the original attribute’s mean.

2. Min-Max Scaling

When values in a feature vary rather uniformly across instances — for instance, the number of students per classroom in a high school — min-max scaling can be a suitable way to scale your data: it consists in normalizing the feature values to move in the unit interval [0,1], by applying this formula for every value x: x’ = (x – min)/(max – min), based on the maximum (resp. minimum) value in the feature x belongs to. Scikit-learn provides a similar class to the one for standardization.

df_boston_scaled = pd.DataFrame(MinMaxScaler().fit_transform(df_boston.drop(‘MEDV’, axis=1)), columns=df_boston.columns[:1])

In the above example, we used the Boston housing dataset to scale all features except MEDV (median house value), which is meant to be the target variable for machine learning tasks like regression, hence, it was dropped before normalizing.

3. Add Polynomial Features

Adding polynomial features can be extremely useful when the data is not strictly linear but exhibits nonlinear relationships. The process boils down to adding new features resulting from raising original features to a power, as well as the interactions between them. This example uses the PolynomialFeatures to create, based on two features describing wines’ alcohol and malic acid properties, new features that are the square (degree = 2) of the original two, plus another feature that shows the interaction between the two features by applying the product operator:

df_interactions = pd.DataFrame(PolynomialFeatures(degree=2, include_bias=False).fit_transform(df_wine[[‘alcohol’, ‘malic_acid’]]))

The result is the creation of three new features on top of the original two: “alcohol^2”, “malic_acid^2”, and “alcohol malic_acid”.

4. One-Hot Encoding Categorical Variables

One-hot encoding consists of taking a categorical variable that takes “m” possible values or categories, and creating “m” numerical — or more precisely, binary — features, each describing the occurrence or non-occurrence of a category in the data instance, using values of 1 and 0, respectively. Thanks to Pandas’ get_dummies function, the process couldn’t be made easier. For the example below, we assume that the CHAS attribute should be deemed as a categorical one, and apply the aforesaid function to it for one-hot encoding the feature.

df_boston_ohe = pd.get_dummies(df_boston.astype({‘CHAS’: ‘category’}), columns=[‘CHAS’])

Since this feature originally took two possible values, two new binary features are built upon it. One-hot encoding is a very important process in many data analysis and machine learning processes where purely categorical features cannot be handled as such, requiring encoding.

5. Discretizing Continuous Variables

Discretizing continuous numerical variables in several equal-width subintervals or bins is a frequent process in analysis processes like visualizations, helping obtain plots like histograms or line plots that may look less overwhelming but still capture “the big picture”. This example one-liner shows how to discretize the “alcohol” attribute in the wine dataset into four bins, labeled 0 to 3:

df_wine[‘alcohol_bin’] = pd.qcut(df_wine[‘alcohol’], q=4, labels=False)

6. Logarithmic Transformation of Skewed Features

If one of your numerical features is right-skewed or positively skewed, that is, it visually exhibits a long tail to the right-hand side due to a few unduly larger values than the rest, a logarithmic transformation helps scale them into a better form for further analyses. Numpy’s log1p is used to perform this transformation, by just specifying the feature(s) in the dataframe that require being transformed. The result is stored in a newly built dataset feature.

df_wine[‘log_malic’] = np.log1p(df_wine[‘malic_acid’])

7. Creating a Ratio Between Two Features

One of the most straightforward yet common feature engineering steps in data analysis and preprocessing is the creation of a new feature as the ratio (division) between two that are semantically related. For instance, given the alcohol and malic acid levels of a wine sample, we could be interested in having a new attribute describing the ratio between these two chemical properties, as follows:

df_wine[‘alcohol_malic_ratio’] = df_wine[‘alcohol’] / df_wine[‘malic_acid’]

Thanks to the might of Pandas, the division operation leading to the new feature is performed at the instance level for every single instance (row) in the dataset, without the need for any loops.

8. Removing Features with Low Variance

Oftentimes, some features may show a very small variability among their values, to the point that not only does it give little contribution to analyses or machine learning models trained on the data, but also might even make results worse. Therefore, it’s not a bad idea to identify and remove these features with low variance. This one-liner illustrates how to use Scikit-learn’s VarianceThreshold class to automatically remove features whose variance falls below a threshold. Try adjusting the threshold to see how it affects the resulting feature removal, being more or less aggressive depending on the variance threshold specified.

df_boston_high_var = pd.DataFrame(VarianceThreshold(threshold=0.1).fit_transform(df_boston.drop(‘MEDV’, axis=1)))

Note: the MEDV attribute has been manually removed due to being the target variable of the dataset, regardless of other features being later removed because of the low variance threshold.

9. Multiplicative Interaction

Suppose our client, a wine producer in Lanzarote (Spain), is using for marketing purposes a quality score that synthesizes information about the alcohol degree and color intensity of a wine into a single score. This can be done via feature engineering, just by taking the features that take part in the calculation of the new score to be registered for every wine, and applying the math our client wishes us to reflect. For instance, the product of the two features:

df_wine[‘wine_quality’] = df_wine[‘alcohol’] * df_wine[‘color_intensity’]

10. Keeping Track of Outliers

While in most data analysis scenarios, outliers are often removed from a dataset, sometimes it may be interesting to keep them on track after identifying them. Why not do this by creating a new feature that indicates whether a data instance is an outlier or not?

df_boston[‘tax_outlier’] = ((df_boston[‘TAX’] < df_boston[‘TAX’].quantile(0.25) 1.5 * (df_boston[‘TAX’].quantile(0.75) df_boston[‘TAX’].quantile(0.25))) | (df_boston[‘TAX’] > df_boston[‘TAX’].quantile(0.75) + 1.5 * (df_boston[‘TAX’].quantile(0.75) df_boston[‘TAX’].quantile(0.25)))).astype(int)

The one-liner manually applies the inter-quartile range (IQR) method to discover possible outliers for the TAX attribute, which is why it spans a significant length compared to the previous examples. Depending on the dataset and target feature you’re analyzing to discover outliers, none may be found, in which case the newly added feature would have a value of 0 for all instances in the dataset.

Conclusion

This article took a glimpse at ten effective Python one-liners that, once familiar with, will turbocharge your process of performing a variety of feature engineering steps efficiently, thereby turning your data into a great shape for further analyses or building machine learning models trained on it.

No comments yet.