Transformer Models#

Attention-based architectures for time-series representation learning.

TST#

Transformer-based model with masked reconstruction pretraining. Uses BasicEncodingMixin for inference.

class chronocratic.models.transformer.tst.model.TST(feat_dim: int, max_seq_len: int, d_model: int = 64, n_heads: int = 8, num_layers: int = 3, dim_feedforward: int = 256, dropout: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', *, freeze: bool = False, learning_rate: float = 0.001, lr_step: list[int] | None = None, lr_factor: float = 0.1, l2_reg: float = 0.0, global_reg: bool = False, sync_dist: bool = False)#

Bases: LightningModule, BasicEncodingMixin

PyTorch Lightning module for the Time Series Transformer (TST).

Representation-learning model trained with a masked-reconstruction pretraining objective. The same model supports both random-mask imputation and structured-mask transduction pretraining — the masking strategy is configured upstream in the dataloader and is transparent to the model.

Batch format expected from the DataLoader:: (X, targets, target_masks, padding_masks, IDs)

where target_masks marks the positions whose reconstruction is scored, and padding_masks marks valid (non-padded) timesteps.

forward(x, padding_masks) returns transformer representations of shape (batch, seq_len, d_model), not the masked-reconstruction output. The reconstruction head is internal and used only during pretraining.

For downstream classification / regression, use SupervisedModule from chronocratic.models.supervised.

This model was implemented based on the code available on this GitHub repo gzerveas/mvts_transformer under MIT License.

forward(x: Tensor, padding_masks: Tensor) → Tensor#: Return transformer representations of shape (batch, seq_len, d_model).

get_representations(x: Tensor, padding_masks: Tensor) → Tensor#: Run the transformer trunk, skipping the reconstruction output layer.

reconstruct(x: Tensor, padding_masks: Tensor) → Tensor#

Run the full backbone, including the reconstruction output layer.

Used during masked-reconstruction pretraining; downstream callers should use forward / get_representations instead.

training_step(batch: tuple, _batch_idx: int) → Tensor#: Compute and log the masked-reconstruction training loss for one batch.

validation_step(batch: tuple, _batch_idx: int) → Tensor#: Compute and log the masked-reconstruction validation loss for one batch.

configure_gradient_clipping(optimizer: Optimizer, gradient_clip_val: float | None = None, gradient_clip_algorithm: str | None = None) → None#: Clip gradients by global norm to stabilise training.

configure_optimizers() → OptimizerLRSchedulerConfig#: Return Adam optimizer with MultiStepLR scheduler.

property encoder: Module#: Return the transformer encoder for inspection and checkpointing.

property representation_dim: int#

Flattened representation size handed to a downstream head.

Returns:: d_model * max_len — the number of features after flattening the (batch, seq_len, d_model) representation.

Configuration for the TST (Time Series Transformer) model.

Provides TSTModelParameters with all settings for the transformer backbone used during masked-reconstruction pretraining.

class chronocratic.models.transformer.tst.config.TSTModelParameters(*, feat_dim: int, max_seq_len: int, d_model: int = 64, n_heads: int = 8, num_layers: int = 3, dim_feedforward: int = 256, dropout: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', freeze: bool = False, learning_rate: float = 0.001, lr_step: list[int] | None = None, lr_factor: float = 0.1, l2_reg: float = 0.0, global_reg: bool = False, sync_dist: bool = False)#

Bases: object

Configuration for the TST model.

Parameters:

feat_dim – Number of input features (channels) in the time series.
max_seq_len – Maximum sequence length supported by the positional encoding.
d_model – Transformer model (token) dimensionality.
n_heads – Number of attention heads.
num_layers – Number of stacked transformer encoder layers.
dim_feedforward – Hidden dimensionality of the transformer feed-forward block.
dropout – Dropout probability used throughout the transformer.
pos_encoding – Positional-encoding type (e.g. 'fixed' or 'learnable') passed to the encoder.
activation – Activation function name passed to the transformer feed-forward block.
norm – Normalization layer name ('BatchNorm' or 'LayerNorm') used inside the encoder.
freeze – When True, freezes the backbone weights and only trains the output layer.
learning_rate – Base learning rate for the Adam optimizer.
lr_step – Milestones (in epochs) for the MultiStepLR scheduler. None means no decay (defaults to a single far-future milestone internally).
lr_factor – Multiplicative decay factor applied at each lr_step milestone.
l2_reg – L2 regularization coefficient. Applied to the output layer only when global_reg=False, or to all parameters (via optimizer weight decay) when global_reg=True.
global_reg – Whether l2_reg is applied globally as weight decay (True) or only to the output layer (False).
sync_dist – Whether to synchronize logged metrics across distributed processes.