KFold vs Monte Carlo cross validation


I wrote two pieces of code that creates a new training and validation set for each epoch during training. I used two methods to do that.

  1. I used sklearn's train_test_split without providing a seed to create two datasets. This constitutes a monte carlo method of selection
  2. I used sklearn's KFold method to initially get my splits. Then I used itertools cycle function to just cycle through the splits on each epoch end to get my two datasets

Here are the two methods:

def tts_dataset(ds, split_pct=0.2):
  train_idxs, val_idxs = train_test_split(np.arange(len(ds)), test_size=split_pct)
  return ds.select(train_idxs), ds.select(val_idxs) 

def kfold(ds, n=5):
  idxs = itertools.cycle(KFold(n).split(np.arange(len(ds))))
  for train_idxs, val_idxs in idxs:
    yield ds.select(train_idxs), ds.select(val_idxs) 

ds is just HuggingFace's dataset library dataset object. Here are my setup and on_epoch_end methods:

  def setup(self, stage=None):
    dss = datasets.load_from_disk(self.data_params.dataset_dir)
    self.idx2auth = pickle.load(open(self.data_params.dataset_dir/'idx2auth.pkl', 'rb'))
    self.auth2idx = {v:k for k,v in enumerate(self.idx2auth)}
    self.tokenizer = AutoTokenizer.from_pretrained(self.model_params.model_name, use_fast=False)
    self.ds = dss['train']
    self.test_ds = dss['test']
    self.ds.set_format(type='pt', columns=['input_ids', 'attention_mask', 'labels'])       
    self.test_ds.set_format(type='pt', columns=['input_ids', 'attention_mask', 'labels'])
    if self.model_params.val_type == 'kfold':
      self.kfold_splits = kfold(self.ds, n=int(1/self.model_params.val_pct))
      self.train_ds, self.val_ds = next(self.kfold_splits)
      self.train_ds, self.val_ds = tts_dataset(self.ds, self.model_params.val_pct) # first split

  def on_epoch_end(self):
    if self.model_params.val_type == 'kfold':
      self.train_ds, self.val_ds = next(self.kfold_splits)
    elif self.model_params.val_type == 'monte-carlo':    
      self.train_ds, self.val_ds = tts_dataset(self.ds, self.model_params.val_pct)

I’d like to first know whether these methods actually make sense to others and whether they would do what I want them to do.

In my experiments, both work with the validation accuracy reaching the high 90s. In particular, with the monte-carlo method, the model eventually sees almost all of the data in the original training set, because on each epoch the training and validation set are randomly chosen (with replacement).

In case of K-fold, if the number epochs that are run is greater than the number of folds, then the model will begin seeing already seen examples again. This leads me to my second question: For those runs where k << # epochs, does it essentially become similar to monte-carlo method?

In both of the above cases, as I mentioned the validation accuracy reaches high 90s. However, the test accuracy is what is expected (75-85%) based on the pretrained model used. So for both of the above methods the end of training validation accuracy does not correspond to the eventual test accuracy.

Another method is to just set a hold-out validation set that never changes. In this case, the validation accuracy is similar to the test accuracy. This seems to indicate that this method is perhaps more useful because the validation accuracy actually gives me a sense of what the test accuracy would be.

This leads me to my final question: Amongst the 3 methods of validation discussed above, what is a good approach to take?

Thank you.

Hello, my apology for the late reply. We are slowly converging to deprecate this forum in favor of the GH build-in version… Could we kindly ask you to recreate your question there - Lightning Discussions