Fine-Tuning (LLM)
This module fine-tunes a base language model to extract structured metadata from academic document text. The primary model is LED (Longformer Encoder-Decoder), but other models are also supported.
Run with:
Process Flow
flowchart TD
A[Dataset on disk?] -->|No| B[hugging_face_connection.py\nDownload from HuggingFace]
A -->|Yes| C[Load & Split\ntrain/val/test]
B --> C
C --> D[generate_tokens.py\nTokenize inputs/outputs]
D --> E[model_managment.py\nLoad base model + tokenizer]
E --> F{Use PEFT?}
F -->|Yes| G[peft_configuration.py\nApply LoRA config]
F -->|No| H[Standard fine-tuning]
G --> I[trainer.py\nTrain the model]
H --> I
I --> J[Fine-tuned Model\nSaved to disk]
Module Structure
| File | Responsibility |
|---|---|
main.py |
Orchestrates the full training pipeline |
hugging_face_connection.py |
Downloads the dataset from HuggingFace if it doesn't exist on disk |
generate_tokens.py |
Converts text + metadata into tokenized inputs/outputs |
model_managment.py |
Loads base models, handles quantization |
peft_configuration.py |
PEFT/LoRA configuration when using parameter-efficient fine-tuning |
trainer.py |
Training: HuggingFace Trainer configuration and also a traditional training loop |
Tokenization Strategies
Two approaches for converting documents to model input:
- Prompt-based: Uses type-specific prompts from
constants.py(e.g.,PROMPT_TESIS,PROMPT_LIBRO) - Schema-based: Uses JSON schema (for NuExtract model)
Token limits:
- Max input: 2048 tokens
- Max output: 512 tokens
Supported Models
| Model | Architecture | Optimizations |
|---|---|---|
| LED (default) | Seq2Seq | Long context (16k tokens) |
| LED Large | Seq2Seq | Larger capacity |
| LED Spanish | Seq2Seq | Pre-trained on Spanish |
| LLAMA | Causal | LoRA + 4-bit quantization |
| GEMMA | Causal | LoRA + 4-bit quantization |
| Mistral | Causal | LoRA + 4-bit quantization |
| T5 | Seq2Seq | Standard fine-tuning |
| NuExtract | Schema-based | Schema-guided extraction |
Training Modes
HuggingFace Trainer
Uses the HuggingFace Trainer API with configurable training arguments (learning rate, epochs, batch size, etc.).
Traditional Training Loop
Custom training loop with manual optimization steps, gradient clipping, and evaluation — useful for more control over the training process.
Optimizations
- PEFT/LoRA: Parameter-Efficient Fine-Tuning to reduce memory usage (configured in
peft_configuration.py) - 4-bit Quantization: BitsAndBytes quantization for large models
- Gradient Clipping: Prevents exploding gradients during training
- CUDA: Automatic GPU detection and usage
Output
The fine-tuned model is saved to fine-tuned-model-With-Objeto-Conferencia/ and is loaded by the API's LLM Service at runtime.
Requirements
- Dataset JSON in
data/sedici/jsons/(or downloads from HuggingFace automatically) TOKEN_HUGGING_FACEin.env- GPU recommended for training