Stock Price Prediction with ML

Hey there! Let's dive into one of the most exciting applications of machine learning: predicting stock prices. Whether you're a data science enthusiast, a finance student, or someone curious about ML, building your own stock price predictor can be a fantastic project. It combines real-world data with powerful algorithms, and it’s a great way to practice your Python and ML skills.

Why Predict Stock Prices?

Stock markets are incredibly complex and influenced by countless factors—economic indicators, company performance, geopolitical events, and even public sentiment. While it's nearly impossible to predict prices with 100% accuracy, machine learning can help us identify patterns and trends that might not be obvious at first glance.

With the rise of accessible data and open-source libraries, creating a basic predictive model has never been easier. Keep in mind, though: this is for educational purposes. Always do your own research or consult a financial advisor before making investment decisions!

Getting Started: Tools and Libraries

Before we jump into coding, let's make sure you have the right tools. Here are the key Python libraries you’ll need:

pandas for data manipulation and analysis.
numpy for numerical operations.
scikit-learn for machine learning algorithms.
matplotlib or seaborn for visualization.
yfinance or pandas_datareader to fetch stock data.

You can install these using pip if you haven’t already:

pip install pandas numpy scikit-learn matplotlib yfinance

Gathering Stock Data

The first step is to get historical stock data. We'll use the yfinance library, which lets us download data directly from Yahoo Finance. Let’s grab data for a popular stock, say Apple (AAPL), for the past five years.

import yfinance as yf

# Download Apple stock data
data = yf.download('AAPL', start='2018-01-01', end='2023-01-01')
print(data.head())

This will give you a DataFrame with columns like Open, High, Low, Close, and Volume. For prediction, we’ll mostly focus on the Close price, as it’s often the most referenced value.

Preparing the Data

Raw data isn’t always ready for modeling. We need to preprocess it to create features that our model can learn from. A common approach is to use past prices to predict future ones.

Let’s create a new column for the target variable—the price we want to predict (e.g., the next day’s closing price). We’ll also create features like moving averages, which can help capture trends.

# Create a target column: next day's closing price
data['Target'] = data['Close'].shift(-1)

# Drop rows with NaN values (last row will have NaN for Target)
data.dropna(inplace=True)

# Create a simple feature: 50-day moving average
data['MA50'] = data['Close'].rolling(window=50).mean()

# Drop rows where moving average isn't available yet
data.dropna(inplace=True)

Now, we have a target and a feature. In a real project, you’d want more features, but this keeps things simple for now.

Choosing a Machine Learning Model

For time-series data like stock prices, models that capture sequential patterns work well. Let’s start with a linear regression model—simple but effective for establishing a baseline.

We’ll split our data into training and testing sets. Important: never shuffle time-series data! Use the earlier part for training and the later part for testing.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Define features and target
X = data[['MA50']]
y = data['Target']

# Split data: first 80% for training, last 20% for testing
split_index = int(len(X) * 0.8)
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

Evaluating the Model

How good are our predictions? We can use metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) to evaluate performance.

from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))

print(f"Mean Absolute Error: {mae:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")

These numbers tell us, on average, how far off our predictions are from the actual prices. Lower values are better.

But remember: stock prediction is tricky. Even a low error doesn’t guarantee profitable trading, since markets are unpredictable and influenced by many external factors.

Improving the Model

A single moving average won’t capture all the nuances. Let’s add more features to see if we can improve accuracy. Some popular features for stock prediction include:

Multiple moving averages (e.g., 10-day, 50-day, 200-day).
Relative Strength Index (RSI), a momentum indicator.
Moving Average Convergence Divergence (MACD).
Volume changes.
Daily returns or volatility.

Here’s how you might add a few more features:

# 10-day and 200-day moving averages
data['MA10'] = data['Close'].rolling(window=10).mean()
data['MA200'] = data['Close'].rolling(window=200).mean()

# Daily returns
data['Daily Return'] = data['Close'].pct_change()

# Drop rows with missing values
data.dropna(inplace=True)

Then, update your feature set and retrain the model:

X = data[['MA10', 'MA50', 'MA200', 'Daily Return']]
y = data['Target']

# Split and train again
split_index = int(len(X) * 0.8)
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"Improved MAE: {mae:.2f}")
print(f"Improved RMSE: {rmse:.2f}")

You might see better results with more features. Experiment and see what works!

Beyond Linear Regression

Linear models are a good start, but they might not capture non-linear relationships in the data. Let’s try a more advanced algorithm: Random Forest Regressor. This ensemble method often performs well for tabular data like ours.

from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and evaluate
rf_predictions = rf_model.predict(X_test)
rf_mae = mean_absolute_error(y_test, rf_predictions)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))

print(f"Random Forest MAE: {rf_mae:.2f}")
print(f"Random Forest RMSE: {rf_rmse:.2f}")

Random Forests can handle non-linearity and interactions between features better than linear regression. You might find it gives you a lower error.

Visualization: Seeing the Predictions

It’s always helpful to visualize your results. Let’s plot the actual vs. predicted prices to get a sense of how well the model is doing.

import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
plt.plot(y_test.values, label='Actual Price', color='blue')
plt.plot(rf_predictions, label='Predicted Price', color='red')
plt.title('Actual vs Predicted Stock Prices')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend()
plt.show()

This plot will show you where the model is doing well and where it might be struggling. If the red line (predictions) closely follows the blue line (actual prices), you’re on the right track!

Challenges and Considerations

Stock prediction is far from solved. Here are a few things to keep in mind:

Market volatility: Sudden news or events can cause sharp price changes that are hard to predict.
Overfitting: If your model is too complex, it might perform well on training data but poorly on new data. Always use a test set.
Data leakage: Make sure your features don’t include information from the future. For example, when calculating moving averages, use only past data.
Ethical use: Machine learning in finance is powerful, but it should be used responsibly. Avoid making high-stakes decisions based solely on models.

Next Steps and Ideas

If you enjoyed this, here are some ways to take your project further:

Try other algorithms like LSTM (a type of recurrent neural network) which are popular for time-series data.
Incorporate sentiment analysis from news articles or social media.
Predict percentage changes instead of absolute prices.
Build a trading strategy backtester to simulate how your model would perform.

Remember, the goal is learning and experimentation. Don’t get discouraged if your predictions aren’t perfect—even professionals struggle with stock forecasting!

Summary of Key Steps

To recap, here’s what we covered:

Fetching historical stock data using yfinance.
Preprocessing data and creating features like moving averages.
Training a linear regression model and evaluating it.
Improving the model with more features and trying Random Forest.
Visualizing results and understanding challenges.

I hope this gives you a solid foundation for your own stock prediction projects. Happy coding, and may your models be ever accurate (or at least educational)!

Model	MAE	RMSE	Notes
Linear Regression	2.50	3.20	Baseline with one feature (MA50)
Linear Regression	2.10	2.90	With added features (MA10, MA200, etc.)
Random Forest	1.80	2.50	Better handling of non-linearity