
Creating New Features in pandas
So you've got your dataset loaded into a pandas DataFrame, and you're ready to take your data analysis or machine learning project to the next level. Often, the raw data you start with isn't enough—you need to create new features that better capture the patterns and relationships hidden within. Whether you're working on a predictive model or conducting exploratory analysis, feature engineering is a crucial skill. Let's explore how you can create new features effectively using pandas.
Basic Column Operations
The simplest way to create a new feature is by performing operations on existing columns. pandas makes this incredibly straightforward. Let's say you have a DataFrame with sales data containing 'price' and 'quantity' columns, and you want to create a 'revenue' feature.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'price': [10, 15, 20, 25],
'quantity': [100, 150, 200, 250]
})
# Create revenue feature
df['revenue'] = df['price'] * df['quantity']
print(df)
This basic arithmetic operation gives you immediate insight into total revenue for each transaction. You can perform all sorts of mathematical operations including addition, subtraction, division, and even more complex calculations.
Working with Dates and Times
DateTime features are incredibly rich sources of information. When you have date columns, you can extract numerous meaningful features that might be predictive or informative.
# Create a sample DataFrame with dates
df_dates = pd.DataFrame({
'transaction_date': pd.date_range('2023-01-01', periods=5, freq='D')
})
# Extract multiple date features
df_dates['year'] = df_dates['transaction_date'].dt.year
df_dates['month'] = df_dates['transaction_date'].dt.month
df_dates['day_of_week'] = df_dates['transaction_date'].dt.dayofweek
df_dates['is_weekend'] = df_dates['day_of_week'].isin([5, 6])
df_dates['quarter'] = df_dates['transaction_date'].dt.quarter
print(df_dates)
These extracted features can help you identify seasonal patterns, weekly trends, or other temporal effects in your data.
Date Feature | Description | Potential Use Case |
---|---|---|
Year | Extracts the year component | Year-over-year analysis |
Month | Extracts the month (1-12) | Seasonal pattern detection |
Day of Week | Monday=0 to Sunday=6 | Weekend vs weekday patterns |
Quarter | Business quarter (1-4) | Quarterly performance analysis |
Is Weekend | Boolean for Saturday/Sunday | Weekend behavior analysis |
Handling Categorical Data
Categorical variables often need transformation before they can be used in machine learning models. pandas provides several methods to handle these effectively.
# Sample categorical data
df_cat = pd.DataFrame({
'size': ['small', 'medium', 'large', 'medium', 'small']
})
# One-hot encoding
size_dummies = pd.get_dummies(df_cat['size'], prefix='size')
df_cat = pd.concat([df_cat, size_dummies], axis=1)
# Ordinal encoding (manual mapping)
size_map = {'small': 0, 'medium': 1, 'large': 2}
df_cat['size_encoded'] = df_cat['size'].map(size_map)
print(df_cat)
One-hot encoding creates binary columns for each category, which is useful when there's no intrinsic ordering. Ordinal encoding assigns numerical values when there is a meaningful order to the categories.
Creating Interaction Features
Interaction features capture relationships between variables that might not be apparent from individual features alone. These can be particularly powerful for predictive modeling.
# Create interaction features
df['price_quantity_interaction'] = df['price'] * df['quantity']
df['price_per_unit'] = df['price'] / df['quantity']
# More complex interaction
df['log_revenue'] = np.log(df['revenue'] + 1) # Adding 1 to handle zero values
print(df[['price', 'quantity', 'price_quantity_interaction', 'price_per_unit', 'log_revenue']])
Interaction features can reveal non-linear relationships and often improve model performance significantly.
Binning and Discretization
Sometimes continuous variables work better when converted into categorical ranges. This process, called binning, can help models capture non-linear effects.
# Create age data
ages = pd.DataFrame({'age': [18, 22, 35, 47, 52, 68, 71, 29, 41, 58]})
# Equal-width binning
ages['age_group_equal'] = pd.cut(ages['age'], bins=3, labels=['young', 'middle', 'senior'])
# Custom binning
bins = [0, 30, 50, 100]
labels = ['young', 'middle_aged', 'senior']
ages['age_group_custom'] = pd.cut(ages['age'], bins=bins, labels=labels)
# Quantile-based binning (equal frequency)
ages['age_group_quantile'] = pd.qcut(ages['age'], q=3, labels=['low', 'medium', 'high'])
print(ages)
Each binning strategy serves different purposes. Equal-width binning creates ranges of equal size, while quantile-based binning ensures each bin has approximately the same number of observations.
Text Data Feature Engineering
When working with text data, you can extract numerous features that might be relevant for your analysis or modeling tasks.
# Sample text data
df_text = pd.DataFrame({
'description': [
'Excellent product, fast delivery',
'Good quality but expensive',
'Not what I expected',
'Amazing value for money'
]
})
# Basic text features
df_text['char_count'] = df_text['description'].str.len()
df_text['word_count'] = df_text['description'].str.split().str.len()
df_text['has_excellent'] = df_text['description'].str.contains('excellent', case=False)
df_text['sentiment_score'] = df_text['description'].apply(lambda x: len([w for w in x.split() if w in ['excellent', 'amazing', 'good']]) - len([w for w in x.split() if w in ['expensive', 'not']]))
print(df_text)
These simple text features can provide immediate insights before you dive into more complex natural language processing techniques.
Key text features to consider creating: - Character and word counts - Presence of specific keywords - Sentiment indicators - Readability scores - Specific pattern matches
Time-Based Rolling Features
For time series data, rolling window calculations can create powerful features that capture trends and patterns over time.
# Create time series data
dates = pd.date_range('2023-01-01', periods=20)
ts_data = pd.DataFrame({
'date': dates,
'sales': np.random.randint(50, 200, size=20)
}).set_index('date')
# Rolling features
ts_data['rolling_3day_avg'] = ts_data['sales'].rolling(window=3).mean()
ts_data['rolling_7day_max'] = ts_data['sales'].rolling(window=7).max()
ts_data['sales_lag_1'] = ts_data['sales'].shift(1)
ts_data['sales_diff'] = ts_data['sales'].diff()
print(ts_data.tail(10))
Rolling features are particularly valuable for forecasting models as they capture recent trends and patterns.
Advanced Feature Creation with apply()
For more complex feature engineering, the apply()
method gives you maximum flexibility to create custom features.
# Complex feature creation
def create_complex_features(row):
features = {}
# Business logic example
if row['price'] > 20 and row['quantity'] > 150:
features['premium_high_volume'] = 1
else:
features['premium_high_volume'] = 0
# Ratio feature with safeguard
features['price_quantity_ratio'] = row['price'] / row['quantity'] if row['quantity'] != 0 else 0
return pd.Series(features)
# Apply the function
complex_features = df.apply(create_complex_features, axis=1)
df = pd.concat([df, complex_features], axis=1)
print(df)
The apply()
method is powerful but can be slower than vectorized operations for large datasets, so use it judiciously.
Handling Missing Values in New Features
When creating new features, you often need to handle missing values appropriately to avoid introducing errors.
# Create features with missing value handling
df['safe_ratio'] = df.apply(
lambda x: x['price'] / x['quantity'] if x['quantity'] != 0 else np.nan,
axis=1
)
# Fill missing values
df['safe_ratio_filled'] = df['safe_ratio'].fillna(df['price'].mean())
# Alternatively, use where() for vectorized operation
df['safe_ratio_vectorized'] = np.where(
df['quantity'] != 0,
df['price'] / df['quantity'],
np.nan
)
print(df[['price', 'quantity', 'safe_ratio', 'safe_ratio_filled', 'safe_ratio_vectorized']])
Proper missing value handling is crucial because division by zero or operations on missing values can distort your analysis and model performance.
Feature Scaling and Normalization
After creating new features, you might need to scale them, especially for machine learning algorithms that are sensitive to feature magnitudes.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Create sample features
features_to_scale = df[['price', 'quantity', 'revenue']].copy()
# Standardization (mean=0, std=1)
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features_to_scale)
df_standardized = pd.DataFrame(features_standardized,
columns=['price_std', 'quantity_std', 'revenue_std'])
# Normalization (0-1 range)
minmax_scaler = MinMaxScaler()
features_normalized = minmax_scaler.fit_transform(features_to_scale)
df_normalized = pd.DataFrame(features_normalized,
columns=['price_norm', 'quantity_norm', 'revenue_norm'])
# Combine with original DataFrame
df = pd.concat([df, df_standardized, df_normalized], axis=1)
print(df.head())
Different scaling methods serve different purposes. Standardization works well for algorithms that assume normally distributed data, while normalization is useful when you need features bounded between 0 and 1.
Best Practices for Feature Engineering
As you create new features, keep these best practices in mind to ensure your feature engineering efforts are effective and maintainable.
- Start with domain knowledge: The most powerful features often come from understanding the business context
- Keep it simple: Complex features aren't always better; sometimes simple transformations work best
- Validate feature usefulness: Use correlation analysis or feature importance methods to validate your new features
- Document your process: Keep track of how each feature was created for reproducibility
- Consider computational efficiency: Vectorized operations are much faster than iterative approaches
# Example: Validating feature usefulness
correlation_matrix = df.corr()
print(correlation_matrix['revenue'].sort_values(ascending=False))
Regularly check how your new features correlate with your target variable to ensure they're adding value rather than noise.
Efficient Feature Creation with Vectorization
Whenever possible, use pandas' vectorized operations instead of iterative approaches. They're significantly faster and more efficient.
# Inefficient way (avoid this)
df['inefficient_feature'] = [row['price'] * 1.1 for index, row in df.iterrows()]
# Efficient vectorized way
df['efficient_feature'] = df['price'] * 1.1
# Another example: conditional operations
df['discounted_price'] = np.where(df['quantity'] > 100, df['price'] * 0.9, df['price'])
Vectorized operations can be hundreds of times faster than using iterrows()
or apply()
with simple operations, especially on large datasets.
Remember that feature engineering is both an art and a science. The best features often come from deep understanding of your specific domain and problem. Experiment with different transformations, validate their usefulness, and don't be afraid to create features that might seem unconventional—sometimes they turn out to be the most predictive ones. Happy feature engineering!