Skip to content

Architecture

End-to-End Pipeline

The project follows a sequential pipeline from raw data to deployed API:

flowchart TB
    subgraph Phase1["1. Data Preparation"]
        A1[SEDICI CSV] --> A2[Filter & Map\nMetadata]
        A2 --> A3[Download PDFs]
        A3 --> A4[Extract Text\n+ OCR]
        A4 --> A5[Clean Metadata\nRegex rules for ISSN, ISBN,\nrights, exact attrs\n+ Cloud LLM validation\nGemini or OpenAI]
        A5 --> A6[Final Dataset\nJSON]
    end

    subgraph Phase2["2. Training"]
        B1[Fine-Tune LLM\nLED / LLAMA / GEMMA]
        B2[Train Type\nClassifier]
        B3[Train Subject\nClassifier SVM]
    end

    subgraph Phase3["3. API Deployment"]
        C1[Orchestrator\n:8000]
        C2[Extractor\n:8001]
        C3[LLM Service - Fine-tuned\n:8002]
        C4[LLM Service - DeepAnalyze\n:8003\nOptional, extensible]
    end

    subgraph Phase4["4. Validation"]
        D1[Test Dataset] --> D2[Compare\nPredictions vs\nGround Truth]
        D2 --> D3[Metrics &\nDashboard]
    end

    A6 --> B1
    A6 --> B2
    A6 --> B3
    B1 --> C3
    B2 --> C1
    B3 --> C1
    C1 --> Phase4

API Microservices Architecture

When a user uploads a PDF, the Orchestrator coordinates the services:

sequenceDiagram
    participant Client
    participant Orchestrator as Orchestrator :8000
    participant Extractor as Extractor :8001
    participant LLM as LLM Service :8002
    participant Deep as LLM Service - DeepAnalyze :8003

    Client->>Orchestrator: POST /upload (PDF)
    Orchestrator->>Extractor: POST /extract (PDF)
    Extractor-->>Orchestrator: plain_text + xml_text

    Note over Orchestrator: Classify document type<br/>(TF-IDF + sklearn model)
    Note over Orchestrator: Classify subject<br/>(SVM model)

    Orchestrator->>LLM: POST /consume-llm (text + type prompt)
    LLM-->>Orchestrator: extracted metadata JSON

    opt DeepAnalyze enabled
        Orchestrator->>Deep: POST /consume-llm (metadata for validation)
        Deep-->>Orchestrator: validated/refined metadata
    end

    Orchestrator-->>Client: Final metadata JSON

Extensible LLM Services

The LLM Service structure is reusable. DeepAnalyze runs as a separate instance of the same service on port 8003, using a larger non-fine-tuned model to validate results. New LLM services can be added following the same pattern.

Data Flow Through the System

flowchart LR
    PDF[PDF Document] --> EXT[Text Extraction\npdfplumber + EasyOCR]
    EXT --> TYPE[Type Classification\nTF-IDF + sklearn]
    EXT --> SUBJ[Subject Classification\nSVM]
    EXT --> LLM[LLM Extraction\nFine-tuned LED]

    TYPE --> MERGE[Merge Results]
    SUBJ --> MERGE
    LLM --> MERGE
    MERGE --> JSON[Structured\nMetadata JSON]

Technology Stack

Layer Technology
API Framework FastAPI + Uvicorn
Containerization Docker + Docker Compose
Text Extraction pdfplumber, PyMuPDF, EasyOCR
LLM Fine-Tuning HuggingFace Transformers, PEFT, LoRA
Base Models LED (primary), LLAMA, GEMMA, Mistral
Classification scikit-learn, XGBoost
Data Cleaning Google Gemini API / OpenAI API + regex rules
Frontend (Metrics) React + TypeScript + Recharts