This report presents experiments on binary sentiment classification using the IMDB movie review dataset with Word2Vec embeddings. All experiments were conducted across 5 runs with different random seeds, reporting mean accuracy ± standard deviation.

Architectural Choices

Fixed parameters: Learning rate 1e-3, batch size 64, epochs 10, min_df=0.0005, max_df=0.5

Hidden Layers

ArchitectureAccuracy (mean ± std)
Linear: 300→20.8309 ± 0.0005
1 layer: 300→(128)→20.8598 ± 0.0009
2 layers: 300→(256→128)→20.8611 ± 0.0007
3 layers: 300→(256→128→64)→20.8613 ± 0.0005
Bottleneck: 300→(10)→20.8545 ± 0.0010

Adding the first hidden layer provides a 2.9% improvement, but additional layers offer fewer and fewer returns (<0.2% difference). Even a bottleneck with only 10 neurons achieves 85.45%, suggesting the task requires only a compact internal representation.

Activation Functions

ActivationAccuracy (mean ± std)
ReLU0.8598 ± 0.0009
Tanh0.8595 ± 0.0005
LeakyReLU0.8607 ± 0.0009

Activation function choice has negligible impact (0.12% difference), within standard deviation.


Learning Configurations

Fixed parameters: Architecture 300→128→2 (ReLU), min_df=0.0005, max_df=0.5

Learning Rate

Learning RateAccuracy (mean ± std)
0.00010.8437 ± 0.0006
0.00050.8577 ± 0.0012
0.0010.8598 ± 0.0009
0.010.8609 ± 0.0010

Learning rate shows the clearest impact: 0.0001 underperforms by 1.7% due to insufficient convergence in 10 epochs. The 0.001–0.01 range works well.

Batch Size

Batch SizeAccuracy (mean ± std)
160.8613 ± 0.0005
320.8611 ± 0.0011
640.8598 ± 0.0009
1280.8587 ± 0.0012
2560.8566 ± 0.0005

Smaller batches perform slightly better (~0.5% difference) due to more gradient updates per epoch, but the effect is minor.

Epochs

EpochsAccuracy (mean ± std)
50.8566 ± 0.0008
100.8598 ± 0.0009
150.8618 ± 0.0003
200.8623 ± 0.0004

Accuracy improves steadily up to 20 epochs with no overfitting observed for this simple architecture.


Dictionary Cutoff Frequencies

Fixed parameters: Architecture 300→128→2 (ReLU), lr=1e-3, batch size 64, epochs 10

Minimum Document Frequency

min_dfVocab SizeMissing EmbeddingsAccuracy (mean ± std)
0.000135,8275,6240.8598 ± 0.0008
0.000515,8621,1450.8598 ± 0.0009
0.00110,4305020.8587 ± 0.0007
0.0026,4411860.8572 ± 0.0008

Despite 6× vocabulary size difference, accuracy varies only 0.26%. Lower min_df includes more words without Word2Vec coverage, but the averaged embedding approach is robust to this noise.

Maximum Document Frequency

max_dfVocab SizeAccuracy (mean ± std)
0.315,8270.8589 ± 0.0007
0.515,8620.8598 ± 0.0009
0.715,8770.8588 ± 0.0003
0.915,8830.8588 ± 0.0011

max_df has negligible effect (~0.1% difference, ~56 words removed). Common words contribute little to averaged embeddings.


Additional Experiments

Word2Vec vs Random Embeddings

Embedding TypeAccuracy (mean ± std)
Word2Vec0.8582 ± 0.0027
Random0.7425 ± 0.0006

Word2Vec outperforms random embeddings by 11.6 percentage points—the largest effect observed. This explains why architectural choices matter little: the pretrained embeddings already encode rich semantic information, making the neural network’s task straightforward.

Overfitting Detection

ModelEpochsBest AccuracyBest EpochFinal Accuracy
Simple500.8639 ± 0.000440.2 ± 8.80.8600 ± 0.0035
Simple1000.8639 ± 0.000440.2 ± 8.80.8543 ± 0.0029
Deep/Wide300.8622 ± 0.000711.4 ± 4.40.8506 ± 0.0038
Deep/Wide500.8622 ± 0.000711.4 ± 4.40.8412 ± 0.0029

Simple model (300→128→2) peaks at epoch 40, declining ~1% by epoch 100. Deep model (300→512→256→128→2) peaks at epoch 11, declining ~2% by epoch 50. Simpler models overfit slower and achieve higher peak accuracy.


Conclusion

Key findings:

  1. Pretrained embeddings dominate — Word2Vec provides 11.6% improvement over random embeddings, explaining why architecture matters little.
  2. Complexity provides minimal benefit — Single hidden layer achieves nearly identical accuracy to deeper networks while resisting overfitting.
  3. Learning rate is most sensitive — Too low (0.0001) causes 1.7% drop; other hyperparameters show <0.6% effects.
  4. Dictionary cutoffs negligible — Vocabulary size varies 6× with only 0.26% accuracy change.

Recommended configuration: Single hidden layer (128 neurons), ReLU, learning rate 0.001, batch size 32–64, 30–50 epochs with early stopping → ~86% accuracy.