Fine-Tuning Language Models
Course Project for Natural Language Processing (Graduate Level)
Fall 2025
Overview
This project explores fine-tuning strategies for two types of language models: an encoder-only model (BERT) for sentiment classification and an encoder-decoder model (T5) for natural language to SQL translation. The project investigates challenges in model generalization, robustness to out-of-distribution data, and conditional generation of structured outputs.
Part 1: Fine-Tuning BERT for Sentiment Classification
Task
Fine-tuned BERT on the IMDB dataset to perform sentiment analysis, achieving >91% accuracy on the test set.
Data Transformations & Robustness Testing
To evaluate model robustness, I designed realistic out-of-distribution (OOD) transformations that simulate real-world test-time inputs:
Transformation Strategy:
- Typo Addition: Created a keyboard-proximity map to introduce realistic typos (e.g., ‘a’ → ‘q’, ‘w’, ‘s’, ‘z’) at a controlled rate
- Synonym Replacement: Used WordNet to replace words with synonyms while maintaining semantic meaning and implementing case-matching to prevent errors
- Letter-to-Number Substitution: Converted visually similar characters (e.g., O→0, t→7) to test perceptual robustness
These transformations were carefully ordered (synonyms → typos → letter-to-number) to ensure the data remained human-readable while testing model robustness.
Results:
- Original test accuracy: 91.1%
- Transformed test accuracy: ~85% (4-6% decrease)
- The performance drop demonstrated the model’s sensitivity to realistic input variations
Data Augmentation Implementation
Approach: Augmented the training set with 5,000 transformed examples using the same transformation pipeline.
Impact:
- Transformed test set: Accuracy improved from ~85% → 89% (4% gain)
- Original test set: Minimal performance decrease (0.1%, potentially negligible)
This demonstrates that targeted data augmentation can significantly improve robustness to OOD data with minimal cost to in-distribution performance.
Key Implementation:
- Implemented custom
create_augmented_dataloaderfunction to combine original and transformed training data - Designed transformation pipeline with tunable rates for each transformation type
- Balanced augmentation to maintain readability while improving model robustness
Part 2: Fine-Tuning T5 for Natural Language to SQL
Task
Fine-tuned T5-small to translate natural language flight booking queries into executable SQL statements, achieving ≥65 F1 score on structured output generation.
Example:
- Input: “Show me flights from Boston to San Francisco”
- Output:
SELECT * FROM flight WHERE from_airport = 'BOS' AND to_airport = 'SFO'
Implementation Contributions
Working from a skeleton codebase, I implemented:
Evaluation Pipeline:
- Designed and implemented complete evaluation loop for conditional generation
- Integrated three metrics: Record F1, Record Exact Match, and SQL Query Exact Match
- Implemented database execution to validate SQL query correctness
Hyperparameter Tuning:
- Systematically explored learning rates, batch sizes, and stopping criteria
- Optimized for F1 score on development set before final test evaluation
- Documented ablation studies showing impact of different design choices
Data Processing & Experimentation:
- Experimented with T5 tokenization strategies for structured SQL output
- Analyzed error patterns across query types and complexity levels
- Conducted detailed error analysis identifying common failure modes
Technical Stack
Models: BERT (encoder-only), T5-small (encoder-decoder)
Frameworks: PyTorch, Hugging Face Transformers
Datasets: IMDB (sentiment), Custom flight booking natural language to SQL
Tools: WordNet (for synonym replacement), SQL database evaluation
Key Takeaways
- OOD Robustness: Models can be brittle to realistic input variations; data augmentation is an effective mitigation strategy
- Structured Generation: Conditional generation requires careful evaluation beyond string matching
- Design Choices Matter: Systematic experimentation with data processing, tokenization, and architecture decisions significantly impacts performance
- Trade-offs: Improving robustness to transformed data can be achieved with minimal performance cost on original data