The issue is that you never reduce your spatial size which results in huge activations and a huge Linear layer (with over 700 million parameters) which blows up your gpu memory usage.
changing your network structure to the following (to reduce your spatial dimensions) works for me:
self.seq1 = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1),
nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=1),
torch.nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(512, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, 128, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(128, 64, 3, padding=1),
nn.ReLU(),
nn.Conv2d(64, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.dense = nn.Sequential(
nn.Linear(32 * 14 * 14, 1024),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(1024, 2)
)
also note: you don’t have to create a new instance of torch.nn.CrossEntropyLoss
every time. Either create an instance once or use the functional interface