LLM Service
The LLM Service runs language models to extract or validate structured metadata from document text. The same service structure is deployed as two instances:
- Port 8002 — Fine-tuned model (e.g. LED) for metadata extraction
- Port 8003 — DeepAnalyze (e.g. Qwen via Ollama) for result validation (optional, enabled with
ENABLE_QWEN_SERVICE=true)
New LLM instances can be added by deploying another copy with different configuration.
Running
Endpoint
POST /consume-llm
Receives text (with prompt prepended by Orchestrator) and returns extracted metadata. Requires Bearer token (LLM_LED_TOKEN or LLM_DEEPANALYZE_TOKEN).
Parameters:
| Parameter | Type | Description |
|---|---|---|
text |
string | Prompt + document text sent by the Orchestrator |
Example request:
curl -X POST http://localhost:8002/consume-llm \
-H "Authorization: Bearer $LLM_LED_TOKEN" \
-H "Content-Type: application/json" \
-d '{"text": "Extract metadata from the following text: ..."}'
Success response:
Error responses:
{"success": false, "data": null, "error": {"code": 400, "message": "No text input"}}
{"success": false, "data": null, "error": {"code": 500, "message": "cannot parse json output"}}
{"success": false, "data": null, "error": {"code": 500, "message": "server internal error"}}
GET /health
No auth. Returns {"message-info": "server is up"}.
GET /test-integration
Requires Bearer token. Returns {"message": "Integration tests passed"}.
Internal Architecture
The services/ folder is organized in four parts:
flowchart TD
A[Text Input] --> B[llms_extraction.py\nMain extraction logic]
B --> C{Library Strategy?}
C -->|HuggingFace| D[llm_library_strategy.py\nHuggingFaceStrategy]
C -->|Ollama| E[llm_library_strategy.py\nOllamaStrategy]
D --> F[model_managment.py\nSelect correct model class\n+ tokenizer]
F --> G[Generate prediction]
E --> G
G --> H[utils/\nRegex normalization\nof output prediction]
H --> I[Parsed metadata JSON]
llms_extraction.py — Main Extraction Logic
Entry point that receives text, delegates to the appropriate library strategy, and returns the final result.
llm_library_strategy.py — Library Strategies
Two strategies for how to generate predictions:
| Strategy | Description |
|---|---|
HuggingFaceStrategy |
Loads and runs fine-tuned models locally via Transformers |
OllamaStrategy |
Connects to an Ollama server for model inference |
model_managment.py — Model Strategies (HuggingFace)
When using HuggingFace, this file selects the correct model class and tokenizer based on the model type:
| Model | Class | Tokenizer |
|---|---|---|
| LED, LED Large, LED Spanish | LEDForConditionalGeneration |
LEDTokenizer |
| LLAMA, Qwen, GEMMA, DeepSeek-Qwen | AutoModelForCausalLM |
AutoTokenizer |
| T5 | T5ForConditionalGeneration |
T5Tokenizer |
| NuExtract | AutoModelForCausalLM |
AutoTokenizer |
Also handles optional 4-bit quantization (BitsAndBytes) and CUDA device detection.
utils/ — Output Normalization
Regex-based cleanup of the raw model prediction output before parsing it as JSON (fixing malformed strings, normalizing fields, etc.).
Supported Base Models
| Model | Architecture | Key |
|---|---|---|
| LED (default) | Seq2Seq | allenai/led-base-16384 |
| LED Large | Seq2Seq | allenai/led-large-16384 |
| LED Spanish | Seq2Seq | vgaraujov/led-base-16384-spanish |
| LLAMA | Causal | meta-llama/Llama-3.2-1B |
| GEMMA | Causal | google/gemma-3-1b-pt |
| Qwen | Causal | Qwen/Qwen3-4B |
| DeepSeek-Qwen | Causal | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| Mistral | Causal | mistralai/Mistral-7B-v0.1 |
| NuExtract | Schema-based | numind/NuExtract-tiny |
| T5 | Seq2Seq | google-t5/t5-base |
Fine-tuned Model Files
When using a local fine-tuned model, place the model files in api/app/llm_service/app/models/fine-tuned-model-led/:
fine-tuned-model-led/
├── config.json
├── generation_config.json
├── model.safetensors
├── tokenizer_config.json
├── special_tokens_map.json
├── vocab.json
└── merges.txt
The path is configured via MODEL_PATH environment variable.
Environment Variables
Service 1 — Fine-tuned (port 8002)
| Variable | Default | Description |
|---|---|---|
SERVICE_TOKEN |
— | Bearer token (from LLM_LED_TOKEN) |
IS_LOCAL_MODEL1 |
true |
Load model locally via HuggingFace |
IS_OLLAMA_MODEL1 |
false |
Use Ollama instead |
MODEL_SELECTED_SERVICE1 |
LED |
Model name |
MODEL_PATH_SERVICE1 |
fine-tuned-model |
Path to model weights |
MAX_TOKENS_INPUT_SERVICE1 |
2048 |
Max input token length |
MAX_TOKENS_OUTPUT_SERVICE1 |
512 |
Max new tokens to generate |
TRUNACTION_SERVICE1 |
true |
Truncate input if exceeds max |
SPECIAL_TOKENS_TREATMENT_SERVICE1 |
true |
Skip special tokens in output |
ERRORS_TREATMENT_SERVICE1 |
replace |
Encoding error handling |
QUANTIZATION_SERVICE1 |
false |
Enable 4-bit quantization (BitsAndBytes) |
Service 2 — DeepAnalyze (port 8003)
| Variable | Default | Description |
|---|---|---|
ENABLE_QWEN_SERVICE |
false |
Enable this service |
SERVICE_TOKEN |
— | Bearer token (from LLM_DEEPANALYZE_TOKEN) |
IS_LOCAL_MODEL2 |
false |
Load model locally |
IS_OLLAMA_MODEL2 |
true |
Use Ollama |
MODEL_SELECTED_SERVICE2 |
QWEN |
Model name |
OLLAMA_HOST_URL |
http://localhost:11434 |
Ollama server URL |
Requirements
Location
api/app/llm_service/
├── Dockerfile
├── requirements.txt
├── run_llm_temp.sh
└── app/
├── main.py
├── routers/router.py
├── constants/constant.py
├── middleware/security.py
├── errors/error.py
├── models/
│ └── fine-tuned-model-led/ # Fine-tuned model weights
└── services/
├── llms_extraction.py # Main extraction logic
├── llm_library_strategy.py # Ollama & HuggingFace strategies
├── model_managment.py # Model class + tokenizer selection
└── utils/ # Regex normalization of output