Vietnamese Medical Information Extraction (NER + Relation Extraction)

UEH NLP course final project with semi-supervised IE pipeline for medical text

UEH NLP course final project building an end-to-end Information Extraction system for Vietnamese medical text. Implemented a pipeline architecture (NER → Entity Pairing → Relation Extraction) inspired by PURE, recognized 5 entity types and 4 relation types, and used semi-supervised hybrid learning with silver data to overcome limited labeled data.

NLP

AI/ML

Healthcare

Team Member

NLP Developer

Demo GitHub

Vietnamese Medical Information Extraction (NER + Relation Extraction)

Timeline

2025

Type

Project

Status

completed

Outcome / Impact

•Hybrid semi-supervised RE achieved 81.25% accuracy and 0.631 Macro-F1 (MLP + BERT)
•Semi-supervised approach improved F1 from 0.599 (Standard) to 0.631 (Hybrid) with silver data augmentation
•Built complete pipeline: Label Studio annotation → NER → Entity Pairing with markers → RE classification
•Deployed interactive Gradio demo with Knowledge Graph visualization on Hugging Face Spaces for course presentation

Tech / Skills

Python

PhoBERT

Label Studio

spaCy

Gradio

NER

Relation Extraction

Case Study

1) Context / Problem

Vietnamese medical texts from health consultation websites contain valuable knowledge about diseases, symptoms, causes, diagnoses, and treatments, but that knowledge is unstructured and difficult to query. For the UEH NLP course project, the goal was to turn noisy medical text into a structured IE pipeline under limited labeled-data constraints.

2) Your Role

As an NLP developer on the team, I contributed to the demo web interface, BERT-based NER workflow, hybrid labeling mechanism for silver data generation, Word2Vec/BERT vectorization functions, and evaluation logic for comparing relation extraction models.

3) Approach

Applied PURE-inspired pipeline: (1) NER with spaCy/ViHealthBERT to identify entities; (2) Entity Pairing with [S]/[O] markers for all entity combinations; (3) RE classification with multiple vectorizer-model combinations (BoW/TF-IDF/Word2Vec/BERT × LogReg/SVM/RF/MLP). Used semi-supervised hybrid approach: trained initial model on 2,248 gold samples, then generated 2,484 silver samples from 618 unlabeled sentences to augment training.

4) Result / Impact

MLP + BERT achieved best performance: Standard (Accuracy 0.780, F1 0.599) → Hybrid with silver data (Accuracy 0.813, F1 0.631). BERT embeddings consistently outperformed traditional vectorizers. The team also productized the coursework output into a Gradio demo with knowledge-graph-oriented exploration on Hugging Face Spaces.

5) Learnings

Contextual embeddings such as ViHealthBERT/PhoBERT outperform frequency-based features for Vietnamese medical RE, while semi-supervised learning helps offset labeled-data scarcity. The pipeline setup is effective for a course project, but still exposes error propagation from NER to RE; joint IE and richer graph reasoning are natural next steps.

6) Links

See links above.