How to predict stock or crypto prices. In 2024. With Python. And PyTorch.
Flying and predicting stock prices: two of humanity’s greatest dreams. While growing wings seems impossible, let’s focus on predicting the prices of stocks — or in this case, a ‘cryptocurrenciy’! It sounds catchier, doesn’t it?
Intro
Goal
You’ll learn how to leverage PyTorch, a machine learning library/framework, to construct a learning algorithm. Our focus will be on the cryptocurrency ADA, which operates on the Cardano blockchain.* Unlike most tutorials that solely consider price as the input feature for the model, we’ll incorporate volume and trade amounts. Implementing the sliding window method, we’ll introduce an ‘outlook gap,’ a less commonly utilized technique. Exploring various model architectures and optimizers, I’ll provide a rough explanation of their workings. Additionally, we’ll aim to enhance the model’s performance. Lastly, I’ll touch upon why a simple script and numerical data might not suffice for accurate price prediction: Stock, cryptos or butter. Whatever.
*)I’m not aiming to specifically promote cryptocurrencies. It’s simply convenient to access their historical prices. The primary focus is on PyTorch and machine learning.
Source Code
Jumping straight in without perusing my carefully crafted text? Bold move. But sure, go ahead, dive in. No hard feelings, probably. Check out this Colab:
https://drive.google.com/file/d/1lUef_DQBOV5Ngs5rLn3pGPWRpu9ho2Qd/view?usp=sharing
Let’s get some data
We’ll utilize the stock prices of ADA, a cryptocurrency operating on the Cardano blockchain. Kraken provides an extensive archive of historical data encompassing dozens of currencies across various time frames. I’ve opted for the 60-minute time resolution. Loading this data into a Pandas dataframe isn’t a significant challenge:
df = pd.read_csv("data/ADAEUR_60.csv")
df['date'] = pd.to_datetime(df['timestamp'], unit='s', errors='coerce')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df
Visualize, Visualize, Visualize
One of the most crucial teachings from Daniel Bourke — a fantastic PyTorch and machine learning tutor, and likely much more. As a fellow enthusiast of visualizations, let’s kick things off by plotting the closing price and volume onto a single chart to gain insights into our data.”
# Downsample the data to not crash the plotting mechanism, we don't need to plot everything in the dataset
downsampled_df = df.resample('1D').mean()
# close price to the left y axis
plt.plot(downsampled_df.index, downsampled_df['close'], label='Close', color='blue')
plt.ylabel('Close', color='blue')
plt.tick_params(axis='y', labelcolor='blue')
# duplicate to get a second y axis on the right and plot the volune
ax2 = plt.twinx()
ax2.plot(downsampled_df.index, downsampled_df['volume'], label='Volume', color='red')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params(axis='y', labelcolor='red')
# Title and legend
plt.title('Close Price vs. Volume')
plt.show()py
I might not be a chart analyst, but here’s my take: Even if I had bought 1 million ADA in 2020 and sold them in 2021, I’d likely still be here, writing this article. Writing brings me joy.
Preparation
Let’s delve into the realm of writing machine learning code. I’ll begin by setting some hyperparameters. You might not grasp most of them yet, but we’ll cover them in detail later. For now, it’s beneficial to acknowledge their presence.
hidden_units = 64
num_layers = 4
learning_rate = 0.001
learning_rate_step_size=5
learning_rate_gamma=0.9
num_epochs = 100
batch_size = 32
window_size = 14
prediction_steps = 7
dropout_rate = 0.2
# stats and metrics
epoch_count = []
train_loss_values = []
test_loss_values = []
train_rmse_values = []
test_rmse_values = []
duration = 0
features = ['close', 'volume', 'trades']
target = 'close'
- Hidden Units: The quantity of internal memory cells within our model — visualize them as small algorithms aiding the learning process.
- Num_layers: Denotes the number of layers in the model, where each layer possesses its own memory cells.
- Learning Rate: Represents the magnitude by which the model adjusts its weights and biases (parameters) after each step.
- Learning Rate Step Size: Determines the frequency at which the learning rate should be incremented or decremented.
- Learning Rate Gamma: Serves as a multiplier to modulate the learning rate.
- Num_epochs: Signifies the number of iterations our primary loop will undergo during the training process.
- Batch Size: Specifies the quantity of data processed at a given instance.
- Window Size: Defines the number of past data points considered for our prediction.
- Prediction Steps: Indicates the number of data points to skip before reaching the data point we aim to predict.
- Dropout Rate: Determines the proportion of nodes in the model set to zero.
Lastly, we’ll designate the columns intended for our model input: close, volume, trades, and naturally, the column containing the price we aim to predict: close.
Data Normalisation
Another essential step involves splitting the data into training and test datasets. Initially, we normalize the data, a crucial process that not only accelerates computations but also enhances the overall quality of the model.
# right now start with a small sample, as soon as we have enough computing power, we can skip this step
df_sampled = df[features].head(1000).copy()
#scaler = MinMaxScaler() # MinMax would work, too, but in fact a stock price has not really "min/max values", except the 0 ;)
scaler = StandardScaler()
# Extract the selected features and transform them
selected_features = df_sampled[features].values.reshape(-1, len(features))
scaled_features = scaler.fit_transform(selected_features)
# Replace the original features with the scaled features in the DataFrame
df_sampled[features] = scaled_features
Sliding Window Method
Considering only the current price to predict the subsequent data point might introduce bias and limit the model’s learning ability due to the lack of sufficient context. Relying solely on the preceding data point could lead the model to overly adapt to recent changes, potentially hindering its predictive capabilities for future values.
To mitigate this, we implement the sliding window method, gathering multiple data points together. Additionally, we introduce a prediction gap. Rather than predicting the next hour’s price, which requires swift responses and might cause overfitting, we shift focus to predicting the price in ten hours. This longer-term prediction enables informed decisions regarding selling or buying.
Consequently, we generate two NumPy arrays: one containing the features (X
) and the other (y
) holding all the labels/targets, representing the prices we aim to predict.
The mix of upper and lower case might be confusing. Typically, variables representing scalars and vectors use lowercase names, while matrices and tensors are denoted by uppercase names.
def create_sequences(data, window_size, prediction_steps, features, label):
X = []
y = []
for i in range(len(data) - window_size - prediction_steps + 1):
sequence = data.iloc[i:i + window_size][features]
target = data.iloc[i + window_size + prediction_steps - 1][label]
X.append(sequence)
y.append(target)
return np.array(X), np.array(y)
X, y = create_sequences(df_sampled, window_size, prediction_steps, features, target)
Splitting and Batching
Now, we’ll divide the data into a training and a test dataset, which is a common step in machine learning. Next, we’ll address some technical requirements by organizing the data into a helper object. This object will then supply batches of data points for the training process.
# SPLITTING
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
# BATCHING
X_train_tensor = torch.Tensor(X_train)
y_train_tensor = torch.Tensor(y_train)
X_test_tensor = torch.Tensor(X_test)
y_test_tensor = torch.Tensor(y_test)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
Validation
To ensure that no data was lost during the process, we can easily reverse the procedure. I won’t be posting the code here; you can find it in the notebook. Additionally, I’ve included a small section that generates synthetic data to replicate this entire data wrangling process. We’ll revisit this later, so please remember to keep it in mind.
The initial 14 lines encompass our features derived from the ‘close,’ ‘volume,’ and ‘trade’ columns. Remember how we set the sliding window size to 14? Notably, 7 steps later at position 21, according to prediction steps, we have a closing price marked as 21— this serves as our target!
Furthermore, I’ve implemented several functions to revert the normalization, splitting, and batching processes. These aid us in verifying that the data preparation was executed accurately. The DataFrame presented here showcases the input we supplied to the ‘DataLoader’.
The first three columns represent the features, and the fourth column, labeled as ‘close,’ acts as the target closing price or ‘label’ that our model endeavors to predict. Although it reads as 2.999-something rather than precisely three due to rounding errors, it seems we’re moving in the right direction.
Init Models, Loss Function and Optimizer
The training loop
To train a model, you’ll need three main components: a model, a loss function, and an optimizer:
- The model does the prediction works, in the very first loop (aka epoch), usually it works with totally random parameters. Remember the X from above? It takes the X (also called input or features) and calculates the y, also called the output, targets or labels.
- The loss function calculates the difference between the predicted output and the acutal output, which was also part of the training data set.
- The optimizer takes the loss and decides, how to adapt the models parameters.
- And then everything starts over until we reach a loop limit (num_epochs)
The model class
We’ll begin experimenting with the LSTM (Long Short-Term Memory) approach, which is a type of recurrent neural network (RNN). Unlike a typical neural network that processes data through layers to generate an output, a recurrent neural network can incorporate past information. This ability is facilitated by its memory cells. Here’s the model class:
class StockPriceLSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size=1):
super(StockPriceLSTM, self).__init__()
self.hidden_size = hidden_size # Size of the hidden state in the LSTM
self.num_layers = num_layers # Number of LSTM layers
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True) # LSTM layer
self.fc = nn.Linear(hidden_size, output_size) # Fully connected layer for output prediction
def forward(self, input_data):
# Initialize hidden and cell states for LSTM
initial_hidden = torch.zeros(self.num_layers, input_data.size(0), self.hidden_size).to(input_data.device)
initial_cell = torch.zeros(self.num_layers, input_data.size(0), self.hidden_size).to(input_data.device)
# Forward propagate through LSTM
lstm_output, _ = self.lstm(input_data, (initial_hidden, initial_cell)) # Output shape: (batch_size, seq_length, hidden_size)
# Pass the output of the last time step through the fully connected layer
last_time_step_output = lstm_output[:, -1, :] # Extract the output from the last time step
output = self.fc(last_time_step_output) # Output shape: (batch_size, output_size)
return output
When inheriting from nn.Module, two methods must be defined: init and forward. The init method sets up the model’s structure. In this instance, it configures the LSTM portion and a linear layer for the output. The forward method handles the computations. It receives the input data (our features), processes it through our model’s modules iteratively, and returns the output.
The loss function
To measure the loss we calculate the mean square error (MSE). Sounds complicated, but basically its just the difference between model output and actuall values, squared and divided by the number of data points:
There are several options available for calculating the loss. Selecting the appropriate loss function is a crucial aspect of fine-tuning your modeling process and is dependent on the goal and the nature of your data. For instance, in image classification, the Cross Entropy Loss is commonly utilized.
The optimizer
The optimizer determines how to adjust the parameters of the model. There are various optimizers to choose from, depending on the specific use case. In simplified terms, here’s what an optimizer does: It modifies the parameters of the model based on the learning rate and the loss.
Run the loop
And this is how we initialize all of them before we start the loop:
model_1 = PricePredictionLSTM(input_size=len(features), hidden_size=hidden_units, num_layers=num_layers)
loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model_1.parameters(), lr=learning_rate)
And when the model is the brain, this is the hear of our work, “The Loop”. The steps are always the same, according to the “unofficial optimization loop song”:
- Activate Training Mode: Set the model to training mode if your model includes components like dropout or batch normalization that behave differently during training and inference.
- Calculate Predictions: Pass input data through the model to obtain predictions.
- Calculate Loss: Compute the loss between the predictions and the actual targets.
- Reset Gradients to Zero: Clear the gradients from the previous iteration using
optimizer.zero_grad()
. - Backward Propagation (Calculate Gradients): Perform backpropagation by calling
loss.backward()
to compute gradients of the loss with respect to the model parameters. - Update Parameters: Use an optimizer (
optimizer.step()
) to update the model's parameters based on the computed gradients. - Start over…
The following code also contains a couple of side features to keep track of the loss values, time measuring and last but not least: validating the model against our test data set!
The crucial aspect here is the use of torch.inference_mode()
. When your model isn't in training mode, it should operate within this context to avoid interference with the gradients. Occasionally, you might encounter torch.no_grad()
, which essentially serves the same purpose but is an older implementation.
We additionally compute the accuracy metric. Similar to the loss, it involves calculating the Mean Squared Error (MSE) and then taking the square root. However, it’s not entirely necessary as it may not offer additional insights beyond the loss values.
start = time.time()
start_epoch = 0 if len(epoch_count) == 0 else epoch_count[-1] # helpful if you start over this particular cell
for epoch in tqdm(range(start_epoch, start_epoch + num_epochs)): # tqdm is our progress bar wrapper
model_1.train() # activate training mode
# handle loss monitoring
total_train_loss = 0.0
all_train_targets = []
all_train_outputs = []
# process batches in the training dataloader
for batch_idx, (inputs, targets) in enumerate(train_dataloader):
optimizer.zero_grad() # reset gradients
outputs = model_1(inputs) # calculate predictions
loss = loss_fn(outputs.squeeze(), targets) # calculat the loss
loss.backward() # backward propagation
optimizer.step() # update parameters
total_train_loss += loss.item()
all_train_targets.extend(targets.numpy())
all_train_outputs.extend(outputs.detach().numpy())
# scheduler.step()
model_1.eval() # activate eval mode
# handle loss monitoring
total_test_loss = 0.0
all_test_targets = []
all_test_outputs = []
# process batches in the testing dataloader
for i, (inputs, targets) in enumerate(test_dataloader):
with torch.inference_mode(): # activate inference mode/no grad
outputs = model_1(inputs) # calculate predictions
loss = loss_fn(outputs.squeeze(), targets) # calculate loss
# monitor loss
total_test_loss += loss.item()
all_test_targets.extend(targets.numpy())
all_test_outputs.extend(outputs.detach().numpy())
# calculate average epoch losses
average_epoch_train_loss = total_train_loss / len(train_dataloader)
average_epoch_test_loss = total_test_loss / len(test_dataloader)
# caculate accuracy
train_rmse = math.sqrt(mean_squared_error(all_train_targets, all_train_outputs))
test_rmse = math.sqrt(mean_squared_error(all_test_targets, all_test_outputs))
# VISUALIZE
epoch_count.append(epoch)
train_loss_values.append(average_epoch_train_loss)
test_loss_values.append(average_epoch_test_loss)
train_rmse_values.append(train_rmse)
test_rmse_values.append(test_rmse)
# LOG
if epoch % int(num_epochs / 10) == 0 or epoch == num_epochs - 1:
current_lr = scheduler.get_last_lr()[0]
print(f"Epoch [{epoch + 1}/{start_epoch + num_epochs}], "
f"Train Loss: {average_epoch_train_loss:.4f} | "
f"Test Loss: {average_epoch_test_loss:.4f} | "
f"Train RMSE: {train_rmse:.4f} | "
f"Test RMSE: {test_rmse:.4f} | "
f"Current LR: {current_lr:.8f} | "
f"Duration: {time.time() - start:.0f} seconds")
duration += time.time() - start
Wait… stop… what’s that? The learning rate scheduler?
If you’ve thoroughly read through the loop, you might have noticed a commented-out command: scheduler.step()
. What does it do?
The learning rate typically remains constant and dictates how the algorithm adjusts the model’s parameters in each step. A scheduler assists in dynamically altering this step size. While not always necessary, experimenting with a scheduler can be beneficial, potentially improving results within a shorter duration.
This is how you initialize it:
scheduler = lr_scheduler.StepLR(optimizer, step_size=learning_rate_step_size, gamma=learning_rate_gamma) # Adjust step_size and gamma as needed
Run it!
Now we are ready to run everything and see the training/optimization process in action. To get a first impression how good everything works, I suggest to start with the dummy dataset that we defined above. I will not post the plotting code here. This is how your plot should look like:
The result is not too bad and this within 27 seconds. But considering the source data, not a surprise.
Lets run our real world data through the model. This will not take much much longer and the result after ter 100 epochs looks, well…a little off:
Let’s be realistic: You can’t expect substantial results after just 4minutes of training on a small dataset like this in the first attempt.
This process involves extensive hyperparameter tuning and selecting network architecture. Now, let’s explore another approach: GRU, which stands for Gated Recurrent Unit, is another type of recurrent neural network. Contrary to LSTM, GRU features a simplified architecture. Specifically, GRU employs two gates to manage data flow (reset and update), while LSTM utilizes three gates (input, forget, and output). LSTM is ideal for handling complex dependencies, while GRU offers reduced computational complexity. This is our model class:
class PricePredictionGRU(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size=1):
super(PricePredictionGRU, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# Initialize hidden state
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
# Forward propagate GRU
out, _ = self.gru(x, h0) # out: tensor of shape (batch_size, seq_length, hidden_size)
# Pass the output of the last time step through the fully connected layer
out = self.fc(out[:, -1, :]) # Output shape: (batch_size, output_size)
return out
As you can see, it’s the exact same structure as before. That’s because PyTorch provides “templates” for the models and we just need to feed them correctly.
This is the result after 200 epochs:
Despite the fact that it ran almost double the expected time, contradicting the promised reduction in computational requirements, the results appear noticeably improved. However, they still aren’t sufficient for accurate stock price prediction.
Next steps and final thoughts
Firstly, let me provide an example showcasing the common flaw in many ‘price prediction’ discussions: they often focus on predicting the immediate next data point without considering the data points ahead. This approach simplifies the model’s task significantly. For instance, if presented with today’s closing price, predicting tomorrow’s close becomes relatively easy due to the typical range within which prices fluctuate.
The following chart shows the prediction when setting the prediction gap to 1, meaning predicting the very next data point:
That is the result after 300 epochs for a sample size of 1.000 after 5 minutes of training. Looks tempting, right? As soon as you try to look further into the future, it gets messy. I increased the outlook windows to prediction_steps = 7
:
Here I increased the sample size for the validation data to 10.000:
It’s not terrible; at least the curve somewhat aligns with the general trend. Ultimately, the quality of your model depends on the computational power allocated and the chosen hyperparameters.
You’re all set with the basics and the code. Feel free to give it a try on your own. And if you happen to strike it rich, just a reminder: I’ve got a soft spot for a well-made flat white!