Recurrent Neural Networks (RNNs) Explained

Imagine you’re reading a sentence. To understand what a word means, you need to remember the words that came before it. Traditional neural networks can’t do that—they process each input independently. But what if you want your model to remember past information? That’s where Recurrent Neural Networks, or RNNs, come into play.

In this article, we’re going to demystify RNNs. You’ll learn what they are, how they work, why they’re useful, and where they sometimes fall short. We’ll also look at some code to make everything more concrete. Let’s get started!

What Are RNNs?

At its core, an RNN is a type of neural network designed to work with sequences of data. Whether it’s text, time series, audio, or even video frames, RNNs maintain a “memory” of previous inputs by using loops within the network. This allows information to persist, making RNNs naturally suited for tasks where context matters.

Think of it like this: a regular neural network reads one word and makes a prediction. An RNN reads one word, remembers something about it, reads the next word, combines it with what it remembers, and so on. This looping mechanism is the defining feature of RNNs.

Here’s a simple way to visualize an RNN unrolled over time:

Time Step	Input	Hidden State	Output
t=0	"The"	h₀	-
t=1	"cat"	h₁ = f(h₀, "cat")	Predict next word
t=2	"sat"	h₂ = f(h₁, "sat")	Predict next word

Each step takes an input and the previous hidden state to produce a new hidden state and (optionally) an output.

How Do RNNs Work?

The key to an RNN is the recurrent cell. At each time step, the cell takes two inputs: the current data point (e.g., a word in a sentence) and the hidden state from the previous time step. It combines these to produce a new hidden state, which is then passed to the next step.

Mathematically, for a given time step ( t ), the hidden state ( h_t ) is computed as:

[ h_t = \tanh(W_{ih} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h) ]

Where: - ( x_t ) is the input at time ( t ) - ( h_{t-1} ) is the hidden state from the previous step - ( W_{ih} ) and ( W_{hh} ) are weight matrices - ( b_h ) is the bias term - ( \tanh ) is the activation function (though others like ReLU are sometimes used)

The output at each step (if needed) is often computed from the hidden state:

[ y_t = W_{ho} \cdot h_t + b_o ]

This simple structure allows the network to carry information across time steps. Let’s see what this looks like in code using PyTorch.

import torch
import torch.nn as nn

# Define a simple RNN
rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=1, batch_first=True)

# Input: (batch_size, seq_length, input_size)
input_seq = torch.randn(3, 5, 10)  # Batch of 3 sequences, each of length 5 with 10 features

# Initial hidden state: (num_layers, batch_size, hidden_size)
h0 = torch.randn(1, 3, 20)

# Forward pass
output, hn = rnn(input_seq, h0)

print(output.shape)  # torch.Size([3, 5, 20])
print(hn.shape)      # torch.Size([1, 3, 20])

In this example, we create an RNN that expects input sequences with 10 features. The hidden state has a size of 20. We pass a batch of 3 sequences, each of length 5, and get an output for each time step along with the final hidden state.

Why Use RNNs?

RNNs are powerful because they can handle variable-length sequences and capture temporal dependencies. This makes them ideal for a variety of applications:

Natural Language Processing (NLP): Language modeling, machine translation, sentiment analysis.
Time Series Prediction: Stock price forecasting, weather prediction.
Speech Recognition: Converting audio to text.
Music Generation: Creating sequences of musical notes.

Their ability to process inputs step-by-step while maintaining a hidden state gives them a unique advantage over feedforward networks for sequential data.

Limitations of Basic RNNs

While RNNs are useful, they have some well-known limitations. The most significant is the vanishing gradient problem. During training, gradients (which are used to update the weights) can become extremely small as they are propagated back through many time steps. This makes it hard for the network to learn long-range dependencies.

For example, in a long paragraph, an early word might be important for understanding a much later word, but a basic RNN often forgets this information over time. To address this, more advanced architectures like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) were developed. These use gating mechanisms to control what information is kept or discarded, making them much better at handling long sequences.

A Quick Look at LSTM and GRU

While a full deep dive into LSTMs and GRUs is beyond the scope of this article, it’s worth knowing that they are variants of RNNs designed to mitigate the vanishing gradient problem.

LSTM: Uses input, forget, and output gates to regulate the flow of information.
GRU: A simpler version with reset and update gates.

Both are widely used in practice and often outperform basic RNNs on longer sequences.

Training RNNs

Training an RNN involves a process called Backpropagation Through Time (BPTT). Essentially, the network is unrolled over time, and gradients are computed across all time steps. This can be computationally intensive and memory-heavy for long sequences.

Common challenges include: - Exploding gradients: Gradients become too large, causing unstable training. This is often handled with gradient clipping. - Vanishing gradients: As mentioned, gradients become too small. Using LSTM or GRU helps, as does careful initialization and using activation functions like ReLU.

Here’s a simple training loop example for an RNN in PyTorch for a sequence classification task:

# Example: RNN for sequence classification
class RNNClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(RNNClassifier, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out, _ = self.rnn(x)
        out = self.fc(out[:, -1, :])  # Take the last time step's output
        return out

model = RNNClassifier(input_size=10, hidden_size=20, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Dummy training loop
for epoch in range(10):
    for batch_x, batch_y in train_loader:  # Assume DataLoader is defined
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In this code, we define a simple RNN-based classifier that processes sequences and uses the final hidden state to make a prediction.

Practical Tips for Using RNNs

If you’re getting started with RNNs, keep these tips in mind:

Preprocess your sequences: Pad sequences to the same length if working with batches, and use packing to avoid computations on padding tokens (e.g., nn.utils.rnn.pack_padded_sequence in PyTorch).
Start small: Use a small hidden size and few layers to avoid overfitting.
Monitor gradients: Use tools like gradient clipping to prevent explosions.
Consider using LSTM/GRU: For most real-world sequence tasks, these advanced variants perform better.

Applications of RNNs

To give you a concrete sense of where RNNs shine, here are a few classic applications:

Language Modeling: Predicting the next word in a sentence.
Machine Translation: Translating sentences from one language to another.
Sentiment Analysis: Classifying text as positive, negative, or neutral.
Time Series Forecasting: Predicting future values based on past observations.

Each of these tasks relies on the model’s ability to understand context and sequence order.

Comparing RNNs with Other Models

It’s worth noting that in recent years, Transformers (like BERT and GPT) have become very popular for sequence tasks, especially in NLP. Transformers use self-attention mechanisms and often outperform RNNs on many benchmarks. However, RNNs are still relevant—especially when computational resources are limited, or when dealing with very long sequences where Transformers might be too expensive.

Summary of Key Concepts

Let’s recap the most important points about RNNs:

RNNs are designed for sequential data.
They use a hidden state to remember past information.
Basic RNNs suffer from vanishing gradients, making it hard to learn long-term dependencies.
LSTMs and GRUs are improved versions that address this issue.
Training uses Backpropagation Through Time (BPTT).
Practical applications include NLP, time series, and more.

Wrapping Up

Recurrent Neural Networks are a foundational tool in the deep learning toolkit. While they have limitations, understanding how they work provides a solid basis for grasping more advanced models like LSTMs and even Transformers.

I hope this article helped you understand what RNNs are, how they function, and why they’re useful. Try implementing a simple RNN yourself—there’s no better way to learn!

Happy coding!