This report presents experiments on binary sentiment classification using the IMDB movie review dataset. Neural network models were trained using document representations constructed from pre-trained Word2Vec embeddings. The experiments examine the impact of architectural choices, learning configurations, and vocabulary cutoff parameters on classification accuracy. All experiments were conducted across 5 runs with different random seeds, reporting mean accuracy ± standard deviation.

How Accuracy is Affected by Architectural Choices

This section examines how different neural network architectural choices affect sentiment classification performance on the IMDB dataset.

Variables tested:

Number of hidden layers: 0, 1, 2, 3 layers
Neurons per layer: 10 (bottleneck), 128, 256→128, 256→128→64
Activation functions: ReLU, Tanh, LeakyReLU

The hypothesis is that deeper networks might capture more complex patterns, but too many layers could lead to overfitting or vanishing gradients.

Fixed hyperparameters for fair comparison:

Learning rate: 1e-3
Batch size: 64
Epochs: 10
Dictionary: min_df=0.0005, max_df=0.5

Number of Hidden Layers

Architecture	Accuracy (mean ± std)
Linear (0 hidden): 300→2	0.8309 ± 0.0005
1 hidden layer: 300→128→2	0.8598 ± 0.0009
2 hidden layers: 300→256→128→2	0.8611 ± 0.0007
3 hidden layers: 300→256→128→64→2	0.8613 ± 0.0005
Bottleneck: 300→10→2	0.8545 ± 0.0010

The results show a clear pattern: adding the first hidden layer provides a significant boost (from 83.09% to 85.98%, approximately 2.9%), but additional layers offer diminishing returns. The difference between 1, 2, and 3 hidden layers is less than 0.2%, which is comparable to the standard deviation between runs.

The linear model (no hidden layers) performs noticeably worse because it cannot learn non-linear decision boundaries. However, even this simple model achieves 83% accuracy, demonstrating how much information is already encoded in the Word2Vec embeddings.

The bottleneck architecture with only 10 neurons still achieves 85.45%, surprisingly close to larger architectures. This suggests the sentiment classification task can be solved with a very compact internal representation.

Activation Functions

Activation	Accuracy (mean ± std)
ReLU	0.8598 ± 0.0009
Tanh	0.8595 ± 0.0005
LeakyReLU	0.8607 ± 0.0009

The choice of activation function has virtually no impact on final accuracy. The difference between the best (LeakyReLU at 86.07%) and worst (Tanh at 85.95%) is only 0.12%, which is within the standard deviation of the measurements.

This result is expected given the simplicity of the network. For deeper architectures, activation function choice becomes more important due to vanishing gradient issues, but for a single hidden layer, all three activations perform equivalently.

How Different Learning Configurations Affect the Results

This section examines how different hyperparameter configurations impact model training and final accuracy while keeping the architecture constant.

Variables tested:

Learning rate: 0.0001, 0.0005, 0.001, 0.01
Batch size: 16, 32, 64, 128, 256
Epochs: 5, 10, 15, 20

Hypotheses:

Too low learning rates will converge slowly; too high will cause instability
Smaller batch sizes provide noisier gradients but may converge faster per epoch
More epochs allow better convergence but risk overfitting

Fixed parameters for fair comparison:

Architecture: 1 hidden layer (128 neurons), ReLU activation
Dictionary: min_df=0.0005, max_df=0.5

Learning Rate Impact

Learning Rate	Accuracy (mean ± std)
0.0001	0.8437 ± 0.0006
0.0005	0.8577 ± 0.0012
0.001	0.8598 ± 0.0009
0.01	0.8609 ± 0.0010

The learning rate shows the clearest impact of any hyperparameter tested. The lowest learning rate (0.0001) significantly underperforms at 84.37%, approximately 1.7% worse than optimal. This confirms that very low learning rates fail to converge within 10 epochs.

Interestingly, the highest learning rate (0.01) performs best at 86.09%, slightly outperforming 0.001. This contradicts the initial expectation that high learning rates would cause instability. For this simple architecture and dataset, even aggressive learning rates work well.

The practical takeaway is that anything in the 0.001–0.01 range works well, but 0.0001 is too conservative for 10 epochs of training.

Batch Size Impact

Batch Size	Accuracy (mean ± std)
16	0.8613 ± 0.0005
32	0.8611 ± 0.0011
64	0.8598 ± 0.0009
128	0.8587 ± 0.0012
256	0.8566 ± 0.0005

Smaller batch sizes perform slightly better, with batch size 16 achieving 86.13% compared to 85.66% for batch size 256. This difference of approximately 0.5% is small but consistent across runs.

The trend is expected: smaller batches perform more gradient updates per epoch, allowing faster convergence. With batch size 16, the model performs 1,562 gradient updates per epoch, compared to only 97 updates with batch size 256.

However, the absolute differences are small enough that batch size selection can be based on computational convenience rather than accuracy optimization.

Number of Epochs Impact

Epochs	Accuracy (mean ± std)
5	0.8566 ± 0.0008
10	0.8598 ± 0.0009
15	0.8618 ± 0.0003
20	0.8623 ± 0.0004

The model continues to improve with more epochs, going from 85.66% at 5 epochs to 86.23% at 20 epochs. Notably, there is no sign of overfitting at 20 epochs—the standard deviation actually decreases (from 0.0008 to 0.0004), suggesting more stable convergence.

This indicates that the model architecture is simple enough that it does not easily memorize the training data. With a deeper or wider network, overfitting would likely occur sooner.

Summary of Learning Configuration Experiments

The most impactful parameter was learning rate when set too low (0.0001), causing a 1.7% accuracy drop. The other parameters showed differences of 0.3–0.6%, which while statistically significant, are relatively minor in practical terms.

Dictionary Cutoff Frequencies

This section investigates how vocabulary cutoff thresholds (min_df and max_df) in the CountVectorizer affect model performance. These parameters control which words are included in the feature vocabulary.

Variables tested:

min_df (minimum document frequency): 0.0001, 0.0005, 0.001, 0.002
max_df (maximum document frequency): 0.3, 0.5, 0.7, 0.9

Hypotheses:

Too low min_df includes noisy rare words that do not generalize
Too high min_df loses important sentiment-specific words
max_df controls removal of very common words (stopwords)

Fixed hyperparameters:

Architecture: 1 hidden layer (128 neurons), ReLU
Learning rate: 1e-3
Batch size: 64
Epochs: 10

Minimum Document Frequency (min_df)

min_df	Vocab Size	Words Not Found	Accuracy (mean ± std)
0.0001	35,827	5,624	0.8598 ± 0.0008
0.0005	15,862	1,145	0.8598 ± 0.0009
0.001	10,430	502	0.8587 ± 0.0007
0.002	6,441	186	0.8572 ± 0.0008

The min_df parameter dramatically affects vocabulary size (ranging from 6,441 to 35,827 words) but has minimal impact on accuracy. The difference between the best (85.98%) and worst (85.72%) is only 0.26%.

An interesting observation is the correlation between vocabulary size and missing Word2Vec embeddings. With min_df=0.0001, 5,624 words (15.7% of vocabulary) have no embedding, likely because they are rare misspellings or domain-specific terms absent from Google News. With min_df=0.002, only 186 words (2.9%) are missing.

Despite including thousands of words with zero vectors, min_df=0.0001 performs identically to min_df=0.0005. The model appears robust to this noise because rare words contribute little to the averaged document embedding.

Maximum Document Frequency (max_df)

max_df	Vocab Size	Accuracy (mean ± std)
0.3	15,827	0.8589 ± 0.0007
0.5	15,862	0.8598 ± 0.0009
0.7	15,877	0.8588 ± 0.0003
0.9	15,883	0.8588 ± 0.0011

The max_df parameter has essentially no effect. Vocabulary size changes by only 56 words across the entire range, and accuracy differences (0.1%) are smaller than the standard deviations.

This is expected: max_df removes words appearing in more than X% of documents. Even at max_df=0.3, only extremely common words like “the”, “a”, and “movie” are removed. Since the model uses averaged Word2Vec embeddings, these high-frequency words contribute little discriminative information.

Summary of Dictionary Cutoff Experiments

Parameter	Effect on Vocab Size	Effect on Accuracy
min_df	Large (6× range)	Small (~0.26%)
max_df	Minimal (~56 words)	Negligible (~0.1%)

The dictionary cutoffs matter little when using Word2Vec embeddings. The recommended setting is min_df=0.0005 and max_df=0.5, providing a reasonable vocabulary size (~16,000 words) with good Word2Vec coverage.

Additional Experiments: Embedding Quality and Overfitting

These experiments address two questions raised by the previous results: why do architectural choices have such small effects, and when does overfitting actually occur?

Random vs Word2Vec Embeddings

This experiment tests whether the pretrained Word2Vec embeddings are responsible for the strong baseline performance.

Setup:

Architecture: 1 hidden layer (128 neurons), ReLU
Two conditions: Word2Vec embeddings vs random 300-dimensional vectors (normalized)
All other hyperparameters identical

Embedding Type	Accuracy (mean ± std)
Word2Vec	0.8582 ± 0.0027
Random	0.7425 ± 0.0006

The results are striking: Word2Vec outperforms random embeddings by 11.6 percentage points. This is by far the largest effect observed in any experiment.

With random embeddings, the model achieves only 74.25% accuracy, significantly worse than with meaningful embeddings. The neural network cannot learn effective patterns when the input representation carries no semantic information.

This explains why architectural choices in previous experiments had such small effects. The Word2Vec embeddings already encode rich semantic relationships learned from billions of words of text. The neural network’s task is relatively simple: learn a decision boundary in this well-structured embedding space. Whether 1, 2, or 3 hidden layers are used matters little when the input representation is already so powerful.

The extremely low standard deviation for random embeddings (0.0006 vs 0.0027) is also notable. With meaningless input features, every random initialization converges to essentially the same poor solution.

Extended Training and Overfitting Detection

Previous experiments showed no overfitting at 20 epochs with a simple model. This experiment trains for longer and with larger models to identify when overfitting occurs.

Conditions tested:

Condition	Architecture	Epochs
1	Simple (300→128→2)	50
2	Simple (300→128→2)	100
3	Deep/Wide (300→512→256→128→2)	30
4	Deep/Wide (300→512→256→128→2)	50

Results:

Condition	Best Test Acc	Best Epoch	Final Test Acc
Simple, 50 epochs	0.8639 ± 0.0004	40.2 ± 8.8	0.8600 ± 0.0035
Simple, 100 epochs	0.8639 ± 0.0004	40.2 ± 8.8	0.8543 ± 0.0029
Deep/Wide, 30 epochs	0.8622 ± 0.0007	11.4 ± 4.4	0.8506 ± 0.0038
Deep/Wide, 50 epochs	0.8622 ± 0.0007	11.4 ± 4.4	0.8412 ± 0.0029

Overfitting is clearly present, but manifests differently for the two architectures:

Simple model: Peak performance occurs around epoch 40 (86.39%), then slowly declines. By epoch 100, accuracy drops to 85.43%, a decrease of approximately 1%. The overfitting is gradual and mild.

Deep/wide model: Peak performance occurs much earlier, around epoch 11 (86.22%), then declines more steeply. By epoch 50, accuracy drops to 84.12%, a decrease of 2.1%. The deep model overfits faster and more severely.

Interestingly, the simple model achieves slightly higher peak accuracy (86.39%) than the deep model (86.22%), despite having far fewer parameters. This reinforces the finding that model complexity provides no benefit for this task, and simpler models are less prone to overfitting.

The practical recommendation is to use early stopping: monitor validation accuracy and stop training when it begins to decline. For the simple model, training for 30–50 epochs with early stopping is optimal. For deeper models, 10–15 epochs may be sufficient.

Conclusion

This study examined neural network performance on IMDB sentiment classification using Word2Vec embeddings. The key findings are:

Pretrained embeddings are the dominant factor. Word2Vec embeddings outperform random embeddings by 11.6 percentage points, explaining why architectural choices have relatively small effects.
Model complexity provides minimal benefit. A single hidden layer with 128 neurons achieves nearly the same accuracy as deeper architectures (85.98% vs 86.13%), while being more resistant to overfitting.
Learning rate is the most sensitive hyperparameter. Setting it too low (0.0001) causes a 1.7% accuracy drop, while other parameters show differences under 0.6%.
Dictionary cutoffs have negligible impact when using Word2Vec embeddings, as the averaging approach naturally handles both rare and common words.
Overfitting occurs but is mild with simple architectures. Peak accuracy is reached around epoch 40 for a single-layer model, with gradual decline thereafter.

Recommended configuration: A single hidden layer (128 neurons) with ReLU activation, learning rate 0.001, batch size 32–64, and 30–50 epochs with early stopping. This achieves approximately 86% test accuracy while maintaining simplicity and training stability.

Both models peak near 86.2 %, but the deep one overfits faster (epoch 11 → 84.1 % at 50) than the simple one (epoch 40 → 85.4 % at 100).
Fewer parameters give slightly higher accuracy and gentler decay; stop early after 30–50 epochs for simple nets, 10–15 for deep.

before computing averaged Word2Vec embeddings

The min_df parameter dramatically affects vocabulary size (ranging from 189 to 35,827 words). Within the moderate range (0.0001–0.002), accuracy remains remarkably stable with only 0.26% variation. At extreme values, performance degrades significantly—unsurprisingly, when you remove so many words that you can barely form coherent sentences anymore, the model struggles.

What is surprising is that with only 981 words (min_df=0.02), the model still achieves 83.76% accuracy. That is a tiny vocabulary to work with, yet it captures enough sentiment-bearing terms to perform reasonably well. Only when pushed to truly extreme levels (189 words) does accuracy drop substantially to 76.31%.

The model appears robust to including rare, noisy words (low min_df) but depends on having sufficient vocabulary coverage. The optimal range is min_df=0.0005–0.002, balancing vocabulary size with stable accuracy around 86%.

Quartz 4

Explorer

Basics of Neural Networks and Backpropagation

How Accuracy is Affected by Architectural Choices

Number of Hidden Layers

Activation Functions

How Different Learning Configurations Affect the Results

Learning Rate Impact

Batch Size Impact

Number of Epochs Impact

Summary of Learning Configuration Experiments

Dictionary Cutoff Frequencies

Minimum Document Frequency (min_df)

Maximum Document Frequency (max_df)

Summary of Dictionary Cutoff Experiments

Additional Experiments: Embedding Quality and Overfitting

Random vs Word2Vec Embeddings

Extended Training and Overfitting Detection

Conclusion

Graph View

Table of Contents