
Python Data Analysis Cheatsheet
Welcome to your go-to guide for Python data analysis! Whether you're just starting out or need a quick refresher, this cheatsheet covers the essential libraries, functions, and workflows you'll use every day. Let's dive in!
Essential Libraries
When it comes to data analysis in Python, three libraries are absolutely indispensable: pandas, NumPy, and Matplotlib. Together, they form the backbone of most data tasks.
pandas is your best friend for handling structured data. It provides the DataFrame object, which is like a supercharged Excel spreadsheet right inside your Python code. You can load data from CSV files, databases, or even web APIs, then clean, filter, and aggregate it with ease.
NumPy is the foundation for numerical computing. It offers powerful n-dimensional arrays and a host of mathematical functions to operate on them. Even if you use pandas, under the hood it relies on NumPy for fast computations.
Matplotlib is the classic plotting library. It gives you full control over your visualizations, from simple line charts to complex multi-panel figures. For quick and stylish plots, many analysts also use Seaborn, which builds on Matplotlib.
Here’s how you typically import them:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Loading and Inspecting Data
The first step in any analysis is getting your data into Python. pandas makes this straightforward with functions like read_csv()
, read_excel()
, and read_sql()
.
# Load a CSV file
df = pd.read_csv('data.csv')
# Load an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Once loaded, you’ll want to inspect your data to understand its structure, check for missing values, and get a sense of the distributions.
# First five rows
print(df.head())
# Summary statistics
print(df.describe())
# Data types and non-null counts
print(df.info())
# Check for missing values
print(df.isnull().sum())
Function | Purpose | Example |
---|---|---|
head() |
Show first n rows | df.head(10) |
tail() |
Show last n rows | df.tail(5) |
info() |
Data types and memory | df.info() |
describe() |
Summary statistics | df.describe() |
Data Cleaning and Preparation
Real-world data is often messy. You’ll encounter missing values, duplicates, and inconsistencies. Here’s how to handle them:
- Dropping missing values: Use
dropna()
to remove rows or columns with missing data. - Filling missing values: Use
fillna()
to replace missing values with a specific value or strategy. - Removing duplicates: Use
drop_duplicates()
to eliminate duplicate rows.
# Drop rows with any missing values
df_clean = df.dropna()
# Fill missing values with the mean
df['column'] = df['column'].fillna(df['column'].mean())
# Remove duplicate rows
df = df.drop_duplicates()
Sometimes you need to change data types or create new columns based on existing ones:
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Create a new column
df['total'] = df['price'] * df['quantity']
Data Selection and Filtering
Knowing how to slice and dice your data is crucial. pandas offers multiple ways to select subsets of your DataFrame.
Using column names:
# Select a single column
series = df['column_name']
# Select multiple columns
subset = df[['col1', 'col2']]
Using conditions:
# Rows where column value is greater than 10
filtered = df[df['column'] > 10]
# Multiple conditions
filtered = df[(df['col1'] > 10) & (df['col2'] == 'value')]
Using loc and iloc:
loc
is label-based: you specify row and column names.iloc
is integer-based: you specify row and column indices.
# Select rows 0 to 4 and columns 'A' and 'B'
subset = df.loc[0:4, ['A', 'B']]
# Select first 5 rows and first 2 columns
subset = df.iloc[0:5, 0:2]
Aggregation and Grouping
Grouping data is a powerful way to summarize information. The groupby()
method lets you split your data into groups based on criteria, then apply functions to each group.
# Group by a column and calculate mean
grouped = df.groupby('category')['value'].mean()
# Multiple aggregations
aggregated = df.groupby('category').agg({'value': ['mean', 'sum'], 'quantity': 'count'})
You can also use pivot_table()
for quick cross-tabulations:
# Create a pivot table
pivot = pd.pivot_table(df, values='sales', index='region', columns='month', aggfunc='sum')
Method | Purpose | Example |
---|---|---|
groupby() |
Group data | df.groupby('col') |
agg() |
Multiple aggregations | df.agg({'col': 'mean'}) |
pivot_table() |
Create pivot table | pd.pivot_table(df, values='val', index='row') |
Merging and Joining Data
Often you need to combine data from multiple sources. pandas provides several ways to merge DataFrames.
Concatenation:
# Stack DataFrames vertically
combined = pd.concat([df1, df2])
# Stack horizontally
combined = pd.concat([df1, df2], axis=1)
Merging (like SQL joins):
# Inner join
merged = pd.merge(df1, df2, on='key')
# Left join
merged = pd.merge(df1, df2, on='key', how='left')
Basic Visualization
Visualizing your data helps you spot patterns and communicate findings. Matplotlib and Seaborn are your main tools here.
Line plot:
plt.plot(df['x'], df['y'])
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.title('Line Plot')
plt.show()
Bar chart:
df['category'].value_counts().plot(kind='bar')
plt.title('Count by Category')
plt.show()
Histogram:
df['value'].plot(kind='hist', bins=30)
plt.title('Distribution of Values')
plt.show()
Seaborn for style:
sns.boxplot(x='category', y='value', data=df)
plt.title('Box Plot by Category')
plt.show()
Handling DateTime Data
Time series data is common in analysis. pandas has excellent support for working with dates and times.
# Convert string to datetime
df['date'] = pd.to_datetime(df['date'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# Set as index
df.set_index('date', inplace=True)
# Resample time series
monthly = df['value'].resample('M').mean()
Useful pandas Methods
Here’s a quick reference of frequently used pandas methods:
df.sort_values('column')
: Sort by a column.df.rename(columns={'old':'new'})
: Rename columns.df.drop('column', axis=1)
: Drop a column.df.sample(5)
: Get random rows.df.corr()
: Correlation matrix.df.nlargest(10, 'column')
: Top N values.
NumPy Essentials
While pandas handles most data tasks, sometimes you need to drop down to NumPy for performance or specific operations.
Creating arrays:
arr = np.array([1, 2, 3, 4, 5])
zeros = np.zeros((3, 3))
ones = np.ones((2, 4))
Array operations:
# Element-wise operations
result = arr1 + arr2
# Matrix multiplication
product = np.dot(matrix1, matrix2)
# Statistical functions
mean = np.mean(arr)
std_dev = np.std(arr)
Handling Missing Data with NumPy
NumPy uses np.nan
to represent missing numerical values.
# Create an array with missing values
arr = np.array([1, 2, np.nan, 4, 5])
# Check for missing values
mask = np.isnan(arr)
# Remove missing values
clean_arr = arr[~np.isnan(arr)]
Saving Your Results
After analysis, you’ll want to save your cleaned data or results.
# Save to CSV
df.to_csv('cleaned_data.csv', index=False)
# Save to Excel
df.to_excel('results.xlsx', sheet_name='Data')
# Save to pickle (preserves data types)
df.to_pickle('data.pkl')
Performance Tips
Working with large datasets? These tips can help speed up your code:
- Use vectorized operations instead of loops.
- Specify data types when reading data (
dtype
parameter). - Use
pd.eval()
for complex expressions on large DataFrames. - Consider using Dask for out-of-core computations.
Common Pitfalls and How to Avoid Them
- SettingWithCopyWarning: This occurs when you try to modify a slice of a DataFrame. Use
.copy()
to create an explicit copy. - Memory errors: With large data, read in chunks or use more efficient data types.
- Mixed data types: Ensure columns have consistent types to avoid unexpected behavior.
# Avoid SettingWithCopyWarning
subset = df[df['value'] > 10].copy()
subset['new_col'] = 1 # Safe now
Putting It All Together: A Quick Example
Let’s walk through a mini analysis from start to finish:
# Load data
df = pd.read_csv('sales_data.csv')
# Inspect
print(df.head())
print(df.info())
# Clean
df = df.dropna()
df['date'] = pd.to_datetime(df['date'])
# Analyze
monthly_sales = df.groupby(df['date'].dt.to_period('M'))['sales'].sum()
# Visualize
monthly_sales.plot(kind='bar')
plt.title('Monthly Sales')
plt.ylabel('Total Sales')
plt.show()
# Save results
monthly_sales.to_csv('monthly_sales.csv')
This cheatsheet covers the fundamentals, but remember that data analysis is an iterative process. You’ll often go back and forth between cleaning, exploring, and visualizing. The more you practice, the more intuitive these tools will become.
Happy analyzing!