Skip to content

Document Metadata Extraction with LLM

Automatic extraction of bibliographic metadata from academic documents (theses, articles, books, conference objects) using fine-tuned language models. Built around the SEDICI repository (Servicio de Difusion de la Creacion Intelectual) at Universidad Nacional de La Plata.

Project Goal

Given an academic PDF document, automatically extract structured metadata such as title, authors, date, language, rights, and type-specific fields (director, ISBN, ISSN, etc.) using fine-tuned LLMs and ML classifiers.

High-Level Flow

flowchart LR
    A[SEDICI\nRepository] -->|Download & Clean| B[Dataset\n2000 docs]
    B -->|Train| C[Fine-Tuned\nLLM]
    B -->|Train| D[Type & Subject\nClassifiers]
    C --> E[API\nMicroservices]
    D --> E
    E -->|Upload PDF| F[Structured\nMetadata JSON]

Project Structure

Folder / File Description
api/ FastAPI microservices (Orchestrator, Extractor, LLM Service)
download_prepare_clean_normalize_sedici_dataset/ Data preparation pipeline: download, extract text, clean metadata
fine_tunning/ LLM fine-tuning module (LED, LLAMA, GEMMA, etc.)
fine_tune_type/ Document type classifier (Libro, Tesis, Articulo, Objeto de conferencia)
fine_tune_subject/ Subject/topic classifier using multiple ML models (SVM, XGBoost, etc.)
validation/ Model evaluation and comparison against ground truth
utils/ Shared utilities (text extraction, normalization, API clients)
data/ Data storage (CSVs, PDFs, extracted texts, JSONs)
constants.py Global configuration, prompts, model definitions, field mappings
run_modules.sh / run_modules.bat Entry-point scripts to run each module

Supported Document Types

Type Spanish Example Specific Fields
Thesis Tesis director, codirector, degree
Book Libro publisher, ISBN
Article Articulo ISSN, journal
Conference Object Objeto de conferencia ISSN, event

Technologies

  • Deep Learning: PyTorch, Transformers (Hugging Face), PEFT, LoRA
  • Base Models: LED, LLAMA, GEMMA, Mistral, T5, NuExtract
  • Document Processing: pdfplumber, PyMuPDF, EasyOCR
  • ML Classifiers: scikit-learn (SVM), XGBoost, Random Forest
  • API: FastAPI, Docker, Docker Compose
  • Data Cleaning: Google Gemini API, OpenAI API (configurable)