Transformer Models#
Attention-based architectures for time-series representation learning.
TST#
Transformer-based model with masked reconstruction pretraining. Uses BasicEncodingMixin for inference.
- class chronocratic.models.transformer.tst.model.TST(feat_dim: int, max_seq_len: int, d_model: int = 64, n_heads: int = 8, num_layers: int = 3, dim_feedforward: int = 256, dropout: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', *, freeze: bool = False, learning_rate: float = 0.001, lr_step: list[int] | None = None, lr_factor: float = 0.1, l2_reg: float = 0.0, global_reg: bool = False, sync_dist: bool = False)#
Bases:
LightningModule,BasicEncodingMixinPyTorch Lightning module for the Time Series Transformer (TST).
Representation-learning model trained with a masked-reconstruction pretraining objective. The same model supports both random-mask imputation and structured-mask transduction pretraining — the masking strategy is configured upstream in the dataloader and is transparent to the model.
- Batch format expected from the DataLoader:
(X, targets, target_masks, padding_masks, IDs)
where
target_masksmarks the positions whose reconstruction is scored, andpadding_masksmarks valid (non-padded) timesteps.forward(x, padding_masks)returns transformer representations of shape(batch, seq_len, d_model), not the masked-reconstruction output. The reconstruction head is internal and used only during pretraining.For downstream classification / regression, use
SupervisedModulefromchronocratic.models.supervised.This model was implemented based on the code available on this GitHub repo gzerveas/mvts_transformer under MIT License.
- forward(x: Tensor, padding_masks: Tensor) Tensor#
Return transformer representations of shape
(batch, seq_len, d_model).
- get_representations(x: Tensor, padding_masks: Tensor) Tensor#
Run the transformer trunk, skipping the reconstruction output layer.
- reconstruct(x: Tensor, padding_masks: Tensor) Tensor#
Run the full backbone, including the reconstruction output layer.
Used during masked-reconstruction pretraining; downstream callers should use
forward/get_representationsinstead.
- training_step(batch: tuple, _batch_idx: int) Tensor#
Compute and log the masked-reconstruction training loss for one batch.
- validation_step(batch: tuple, _batch_idx: int) Tensor#
Compute and log the masked-reconstruction validation loss for one batch.
- configure_gradient_clipping(optimizer: Optimizer, gradient_clip_val: float | None = None, gradient_clip_algorithm: str | None = None) None#
Clip gradients by global norm to stabilise training.
- configure_optimizers() OptimizerLRSchedulerConfig#
Return Adam optimizer with MultiStepLR scheduler.
- property encoder: Module#
Return the transformer encoder for inspection and checkpointing.
- property representation_dim: int#
Flattened representation size handed to a downstream head.
- Returns:
d_model * max_len— the number of features after flattening the(batch, seq_len, d_model)representation.
Configuration for the TST (Time Series Transformer) model.
Provides TSTModelParameters with all settings for the transformer backbone used during masked-reconstruction pretraining.
- class chronocratic.models.transformer.tst.config.TSTModelParameters(*, feat_dim: int, max_seq_len: int, d_model: int = 64, n_heads: int = 8, num_layers: int = 3, dim_feedforward: int = 256, dropout: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', freeze: bool = False, learning_rate: float = 0.001, lr_step: list[int] | None = None, lr_factor: float = 0.1, l2_reg: float = 0.0, global_reg: bool = False, sync_dist: bool = False)#
Bases:
objectConfiguration for the TST model.
- Parameters:
feat_dim – Number of input features (channels) in the time series.
max_seq_len – Maximum sequence length supported by the positional encoding.
d_model – Transformer model (token) dimensionality.
n_heads – Number of attention heads.
num_layers – Number of stacked transformer encoder layers.
dim_feedforward – Hidden dimensionality of the transformer feed-forward block.
dropout – Dropout probability used throughout the transformer.
pos_encoding – Positional-encoding type (e.g.
'fixed'or'learnable') passed to the encoder.activation – Activation function name passed to the transformer feed-forward block.
norm – Normalization layer name (
'BatchNorm'or'LayerNorm') used inside the encoder.freeze – When
True, freezes the backbone weights and only trains the output layer.learning_rate – Base learning rate for the Adam optimizer.
lr_step – Milestones (in epochs) for the MultiStepLR scheduler.
Nonemeans no decay (defaults to a single far-future milestone internally).lr_factor – Multiplicative decay factor applied at each
lr_stepmilestone.l2_reg – L2 regularization coefficient. Applied to the output layer only when
global_reg=False, or to all parameters (via optimizer weight decay) whenglobal_reg=True.global_reg – Whether
l2_regis applied globally as weight decay (True) or only to the output layer (False).sync_dist – Whether to synchronize logged metrics across distributed processes.