TrainingState

class fgvc.core.training.training_state.TrainingState(model: Module, path: str = '.', *, ema_model: Module | None = None, optimizer: Optimizer, scheduler: ReduceLROnPlateau | CosineLRScheduler | CosineAnnealingLR | None = None, resume: bool = False, device: device | None = None)

Class to log scores, track best scores, and save checkpoints with best scores.

Parameters:

model – Pytorch neural network.
path – Experiment path for saving training outputs like checkpoints or logs.
optimizer – Optimizer instance for saving training state in case of interruption and need to resume.
scheduler – Scheduler instance for saving training state in case of interruption and need to resume.
resume – If True resumes run from a checkpoint with optimizer and scheduler state.
device – Device to use (cpu,0,1,2,…).

_save_checkpoint(epoch: int, metric_name: str, metric_value: float)

Save checkpoint to .pth file and log score.

Parameters:

epoch – Epoch number.
metric_name – Name of metric (e.g. loss) based on which checkpoint is saved.
metric_value – Value of metric based on which checkpoint is saved.

finish()

Log best scores achieved during training and save checkpoint of last epoch.

The method should be called after training of all epochs is done.

resume_training(): Resume training state from checkpoint.pth.tar file stored in the experiment directory.

step(epoch: int, scores_str: str, valid_loss: float, valid_metrics: dict | None = None)

Log scores and save the best loss and metrics.

Save checkpoints if the new best loss and metrics were achieved. Save training state for resuming the training if optimizer and scheduler are passed.

The method should be called after training and validation of one epoch.

Parameters:

epoch – Epoch number.
scores_str – Validation scores to log.
valid_loss – Validation loss based on which checkpoint is saved.
valid_metrics – Other validation metrics based on which checkpoint is saved.