Hi all, I’m training CBOW using Pytorch. I have a million sentences, and they generate more than ten million training points. Each sentence contributes a variable amount of datapoints. I cannot store all of them in memory.
Therefore, I wanted to implement something of the following:
class DataGenerator(Dataset): def __init__(self, sentences): self.sents = sentences def __len__(self): return len(self.sents) def __getitem__(self, index): sentence = self.sents[index] # TODO: return a batch of size (len(sentence), _) rather than (1, _)
Thus I want to ask is there a way to modify getitem to return a variable sized batch instead of a single item in the batch?
For example, my first batch may have size 10 (because sentence length is 10) but the second batch may have size 30 (because second sentence length is 30) and so on… In Tensorflow Keras, I was using data generators and they were working fine. Can’t figure out the alternative here.