Transformer Models#
Attention-based architectures for time-series representation learning.
TST#
Transformer-based model with masked reconstruction pretraining. Uses BasicEncodingMixin for inference.
- class chronocratic.models.transformer.tst.model.TST(input_dims: int, sequence_length: int, hidden_dims: int = 64, num_heads: int = 8, depth: int = 3, feedforward_dims: int = 256, dropout_rate: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', *, freeze: bool = False, learning_rate: float = 0.001, lr_step: tuple[int, ...] | None = None, lr_factor: float = 0.1, weight_decay: float = 0.0, global_reg: bool = False, sync_dist: bool = False, augmentation: Callable | None = None)#
Bases:
LightningModule,BasicEncodingMixinPyTorch Lightning module for the Time Series Transformer (TST).
Representation-learning model trained with a masked-reconstruction pretraining objective. The same model supports both random-mask imputation and structured-mask transduction pretraining — the masking strategy is configured upstream in the dataloader and is transparent to the model.
- Batch format expected from the DataLoader:
(X, targets, target_masks, padding_masks, IDs)
where
target_masksmarks the positions whose reconstruction is scored, andpadding_masksmarks valid (non-padded) timesteps.forward(x, padding_masks)returns transformer representations of shape(batch, seq_len, hidden_dims), not the masked-reconstruction output. The reconstruction head is internal and used only during pretraining.For downstream classification / regression, use
SupervisedModulefromchronocratic.models.supervised.This model was implemented based on the code available on this GitHub repo gzerveas/mvts_transformer under MIT License.
- forward(x: Tensor, padding_masks: Tensor) Tensor#
Return transformer representations of shape
(batch, seq_len, hidden_dims).
- get_representations(x: Tensor, padding_masks: Tensor) Tensor#
Run the transformer trunk, skipping the reconstruction output layer.
- reconstruct(x: Tensor, padding_masks: Tensor) Tensor#
Run the full backbone, including the reconstruction output layer.
Used during masked-reconstruction pretraining; downstream callers should use
forward/get_representationsinstead.
- training_step(batch: tuple, _batch_idx: int) Tensor#
Compute and log the masked-reconstruction training loss for one batch.
- validation_step(batch: tuple, _batch_idx: int) Tensor#
Compute and log the masked-reconstruction validation loss for one batch.
- configure_gradient_clipping(optimizer: Optimizer, gradient_clip_val: float | None = None, gradient_clip_algorithm: str | None = None) None#
Clip gradients by global norm to stabilise training.
- configure_optimizers() OptimizerLRSchedulerConfig#
Return Adam optimizer with MultiStepLR scheduler.
- property encoder: Module#
Return the transformer encoder for inspection and checkpointing.
- property representation_dim: int#
Flattened representation size handed to a downstream head.
- Returns:
hidden_dims * sequence_length— the number of features after flattening the(batch, seq_len, hidden_dims)representation.
Configuration for the TST (Time Series Transformer) model.
Provides TSTModelParameters with all settings for the transformer backbone used during masked-reconstruction pretraining.
- class chronocratic.models.transformer.tst.config.TSTModelParameters(*, input_dims: int, sequence_length: int, hidden_dims: int = 64, num_heads: int = 8, depth: int = 3, feedforward_dims: int = 256, dropout_rate: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', freeze: bool = False, learning_rate: float = 0.001, lr_step: tuple[int, ...] | None = None, lr_factor: float = 0.1, weight_decay: float = 0.0, global_reg: bool = False, sync_dist: bool = False)#
Bases:
objectConfiguration for the TST model.
- Parameters:
input_dims – Number of input features (channels) in the time series.
sequence_length – Maximum sequence length supported by the positional encoding.
hidden_dims – Transformer model (token) dimensionality.
num_heads – Number of attention heads.
depth – Number of stacked transformer encoder layers.
feedforward_dims – Hidden dimensionality of the transformer feed-forward block.
dropout_rate – Dropout probability used throughout the transformer.
pos_encoding – Positional-encoding type (e.g.
'fixed'or'learnable') passed to the encoder.activation – Activation function name passed to the transformer feed-forward block.
norm – Normalization layer name (
'BatchNorm'or'LayerNorm') used inside the encoder.freeze – When
True, freezes the backbone weights and only trains the output layer.learning_rate – Base learning rate for the Adam optimizer.
lr_step – Milestones (in epochs) for the MultiStepLR scheduler.
Nonemeans no decay (defaults to a single far-future milestone internally).lr_factor – Multiplicative decay factor applied at each
lr_stepmilestone.weight_decay – L2 regularization coefficient. Applied to the output layer only when
global_reg=False, or to all parameters (via optimizer weight decay) whenglobal_reg=True.global_reg – Whether
weight_decayis applied globally as weight decay (True) or only to the output layer (False).sync_dist – Whether to synchronize logged metrics across distributed processes.