Loading content...
In the realm of sequence modeling, Recurrent Neural Networks (RNNs) represent a foundational architecture designed to process sequential data by maintaining an internal memory of previous inputs. Unlike feedforward networks that treat each input independently, RNNs create a temporal dependency structure that allows information to persist across time steps, making them exceptionally well-suited for tasks involving temporal or sequential patterns.
At the heart of an RNN lies the hidden state—a dynamically evolving representation that captures the contextual history of the sequence processed thus far. At each time step t, the network computes a new hidden state by combining the current input with the previous hidden state through learned weight transformations:
$$h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)$$
Where:
The output at each time step is then computed from the current hidden state:
$$y_t = W_{hy} \cdot h_t + b_y$$
Training an RNN requires a specialized gradient computation technique called Backpropagation Through Time (BPTT). Since the network's computations unfold across multiple time steps, gradients must flow backward through the entire sequence to capture temporal dependencies.
The key insight of BPTT is that at each time step, the hidden state depends on all previous hidden states. Therefore, when computing gradients, we must:
For this problem, use half Mean Squared Error (MSE) as the loss function:
$$L = \frac{1}{2} \sum_{t=1}^{T} (y_t - \hat{y}_t)^2$$
The total loss is the sum of losses at each individual time step.
Implement a complete recurrent neural network class with the following specifications:
SequentialRecurrentNetworkConstructor: __init__(self, input_size, hidden_size, output_size)
sqrt(2 / (fan_in + fan_out))Forward Pass: forward(self, x)
Backward Pass: backward(self, x, y, learning_rate)
Important Implementation Notes:
np.random.seed(42) at the start of your weight initialization for reproducibilitytanh activation function for the hidden layerinput_sequence = [[1.0], [2.0], [3.0], [4.0]]
expected_output = [[2.0], [3.0], [4.0], [5.0]]
input_size = 1
hidden_size = 5
output_size = 1
learning_rate = 0.01[[-0.0002], [-0.0005], [-0.0007], [-0.001]]The network is initialized with a 1-dimensional input, 5 hidden units, and 1-dimensional output. The forward pass processes the sequence [1.0, 2.0, 3.0, 4.0], updating the hidden state at each step. With Xavier-initialized weights (using seed 42), the initial predictions are small values near zero. The backward pass then computes gradients to learn the pattern of predicting the next value in the sequence. The small output values reflect the network's initial state before significant training.
input_sequence = [[0.5, 0.5], [1.0, 1.0], [1.5, 1.5]]
expected_output = [[1.0, 1.0], [1.5, 1.5], [2.0, 2.0]]
input_size = 2
hidden_size = 4
output_size = 2
learning_rate = 0.01[[0.0001, 0.0002], [0.0001, 0.0004], [0.0002, 0.0006]]This example uses a 2-dimensional input and output with 4 hidden units. The network processes a sequence of 2D vectors, learning to predict the next vector in the pattern. The multi-dimensional setup demonstrates how the recurrent architecture handles higher-dimensional sequential data while maintaining the temporal gradient flow across all dimensions.
input_sequence = [[0.1], [0.2], [0.3], [0.4], [0.5]]
expected_output = [[0.2], [0.3], [0.4], [0.5], [0.6]]
input_size = 1
hidden_size = 10
output_size = 1
learning_rate = 0.001[[0.0], [0.0], [0.0001], [0.0001], [0.0001]]A longer sequence with smaller learning rate and more hidden units (10) demonstrates the network's behavior with increased capacity and smaller gradient steps. The smaller inputs and lower learning rate result in predictions very close to zero initially. The 5-step sequence shows how hidden state accumulates information across longer time spans.
Constraints