Transformer Models#

Attention-based architectures for time-series representation learning.

TST#

Transformer-based model with masked reconstruction pretraining. Uses BasicEncodingMixin for inference.

class chronocratic.models.transformer.tst.model.TST(feat_dim: int, max_seq_len: int, d_model: int = 64, n_heads: int = 8, num_layers: int = 3, dim_feedforward: int = 256, dropout: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', *, freeze: bool = False, learning_rate: float = 0.001, lr_step: list[int] | None = None, lr_factor: float = 0.1, l2_reg: float = 0.0, global_reg: bool = False, sync_dist: bool = False)#

Bases: LightningModule, BasicEncodingMixin

PyTorch Lightning module for the Time Series Transformer (TST).

Representation-learning model trained with a masked-reconstruction pretraining objective. The same model supports both random-mask imputation and structured-mask transduction pretraining — the masking strategy is configured upstream in the dataloader and is transparent to the model.

Batch format expected from the DataLoader:

(X, targets, target_masks, padding_masks, IDs)

where target_masks marks the positions whose reconstruction is scored, and padding_masks marks valid (non-padded) timesteps.

forward(x, padding_masks) returns transformer representations of shape (batch, seq_len, d_model), not the masked-reconstruction output. The reconstruction head is internal and used only during pretraining.

For downstream classification / regression, use SupervisedModule from chronocratic.models.supervised.

This model was implemented based on the code available on this GitHub repo gzerveas/mvts_transformer under MIT License.

forward(x: Tensor, padding_masks: Tensor) Tensor#

Return transformer representations of shape (batch, seq_len, d_model).

get_representations(x: Tensor, padding_masks: Tensor) Tensor#

Run the transformer trunk, skipping the reconstruction output layer.

reconstruct(x: Tensor, padding_masks: Tensor) Tensor#

Run the full backbone, including the reconstruction output layer.

Used during masked-reconstruction pretraining; downstream callers should use forward / get_representations instead.

training_step(batch: tuple, _batch_idx: int) Tensor#

Compute and log the masked-reconstruction training loss for one batch.

validation_step(batch: tuple, _batch_idx: int) Tensor#

Compute and log the masked-reconstruction validation loss for one batch.

configure_gradient_clipping(optimizer: Optimizer, gradient_clip_val: float | None = None, gradient_clip_algorithm: str | None = None) None#

Clip gradients by global norm to stabilise training.

configure_optimizers() OptimizerLRSchedulerConfig#

Return Adam optimizer with MultiStepLR scheduler.

property encoder: Module#

Return the transformer encoder for inspection and checkpointing.

property representation_dim: int#

Flattened representation size handed to a downstream head.

Returns:

d_model * max_len — the number of features after flattening the (batch, seq_len, d_model) representation.

Configuration for the TST (Time Series Transformer) model.

Provides TSTModelParameters with all settings for the transformer backbone used during masked-reconstruction pretraining.

class chronocratic.models.transformer.tst.config.TSTModelParameters(*, feat_dim: int, max_seq_len: int, d_model: int = 64, n_heads: int = 8, num_layers: int = 3, dim_feedforward: int = 256, dropout: float = 0.1, pos_encoding: str = 'fixed', activation: str = 'gelu', norm: str = 'BatchNorm', freeze: bool = False, learning_rate: float = 0.001, lr_step: list[int] | None = None, lr_factor: float = 0.1, l2_reg: float = 0.0, global_reg: bool = False, sync_dist: bool = False)#

Bases: object

Configuration for the TST model.

Parameters:
  • feat_dim – Number of input features (channels) in the time series.

  • max_seq_len – Maximum sequence length supported by the positional encoding.

  • d_model – Transformer model (token) dimensionality.

  • n_heads – Number of attention heads.

  • num_layers – Number of stacked transformer encoder layers.

  • dim_feedforward – Hidden dimensionality of the transformer feed-forward block.

  • dropout – Dropout probability used throughout the transformer.

  • pos_encoding – Positional-encoding type (e.g. 'fixed' or 'learnable') passed to the encoder.

  • activation – Activation function name passed to the transformer feed-forward block.

  • norm – Normalization layer name ('BatchNorm' or 'LayerNorm') used inside the encoder.

  • freeze – When True, freezes the backbone weights and only trains the output layer.

  • learning_rate – Base learning rate for the Adam optimizer.

  • lr_step – Milestones (in epochs) for the MultiStepLR scheduler. None means no decay (defaults to a single far-future milestone internally).

  • lr_factor – Multiplicative decay factor applied at each lr_step milestone.

  • l2_reg – L2 regularization coefficient. Applied to the output layer only when global_reg=False, or to all parameters (via optimizer weight decay) when global_reg=True.

  • global_reg – Whether l2_reg is applied globally as weight decay (True) or only to the output layer (False).

  • sync_dist – Whether to synchronize logged metrics across distributed processes.