Different behavior for model checkpoints if last or best
|
|
0
|
173
|
July 25, 2023
|
Getting element 0 error while fine tuning llm
|
|
3
|
502
|
July 17, 2023
|
Combining loss, predictions in multi gpus
|
|
3
|
986
|
July 9, 2023
|
How to save new lr hyperparameter after using LRFinder when using wandb
|
|
2
|
474
|
July 10, 2023
|
Any example to launch multiple nodes distributed training with deepspeed strategy?
|
|
2
|
1788
|
June 28, 2023
|
How to use textbooks for fine-tuning LLM
|
|
0
|
421
|
June 24, 2023
|
Data collate_fn makes training process super slow!
|
|
0
|
1040
|
June 22, 2023
|
Using SequentialLR with Step, Epoch and ReduceLROnPlateau
|
|
0
|
583
|
June 2, 2023
|
Finetuning using lit-llama
|
|
3
|
556
|
May 24, 2023
|
Transfer learning
|
|
0
|
221
|
May 23, 2023
|
I am lost on custom batch size definition
|
|
2
|
557
|
May 17, 2023
|
Problem that many symbols are output in val_dataloaders
|
|
2
|
395
|
May 6, 2023
|
Error when predicting from checkpoint
|
|
1
|
775
|
May 6, 2023
|
Does not run validation step after epoch when running with all data
|
|
5
|
1826
|
May 1, 2023
|
Why are my training and validation losses only changing by very little?
|
|
2
|
859
|
April 28, 2023
|
Saving checkpoints and logging models
|
|
1
|
209
|
April 28, 2023
|
Different ways of logging model
|
|
0
|
142
|
April 26, 2023
|
How can we skip a step with NaN loss in the training_step when using Distributed Data Parallel (DDP)?
|
|
1
|
1232
|
April 24, 2023
|
Mac M2 MPS: failed assertion `destination kernel width and filter kernel width mismatch'
|
|
0
|
610
|
April 17, 2023
|
Error on trainer = L.Trainer(max_epochs=2000)
|
|
0
|
289
|
April 4, 2023
|
Custom training - RuntimeError due to unused parameters
|
|
0
|
1583
|
April 3, 2023
|
MLFlowLogger always generates the same run name
|
|
1
|
526
|
April 3, 2023
|
LR Scheduler monitoring multiple metrics
|
|
2
|
680
|
April 3, 2023
|
RAM usage increases quickly over the training step
|
|
2
|
370
|
March 30, 2023
|
Code structuring for text classification with hf bert-uncase
|
|
2
|
412
|
March 23, 2023
|
Use two datasets and distinguish during training
|
|
0
|
149
|
March 22, 2023
|
DeepSpeed: how to execute certain code once?
|
|
0
|
265
|
March 22, 2023
|
How to combine PTL arguments with ArgumentParser
|
|
2
|
1933
|
March 22, 2023
|
Multi GPU - Autolog with multiple runs - lightning2.0
|
|
2
|
709
|
March 22, 2023
|
Loadind saved checkpoint model.model
|
|
2
|
344
|
March 16, 2023
|