Back to Work & Research
Project
completed
2025

VN Stock Analytics – Investment Decision Support System

Multi-model data mining with LLMs reasoning for Vietnamese banking stocks

Data Mining course final project building comprehensive analytics system for 14 Vietnamese banking stocks in VN30 index. Integrated multi-source data (market OHLCV, financial reports, macro indicators, news sentiment) and developed 4 XGBoost models: Return Regression, Direction Classification, Risk Forecasting, and Regime Detection. LLMs layer provides reasoning and investment recommendations in natural language.

AI/ML
Finance
Data Mining
NLP
Team Lead
Developer
VN Stock Analytics – Investment Decision Support System

Timeline

2025

Type

Project

Status

completed

Outcome / Impact

  • Return Regression achieved MAE 0.094 and RMSE 0.119 on 21-day log-return prediction
  • Risk Model achieved 0.98 correlation between predicted and actual volatility
  • Regime Model achieved 61% accuracy in Bear/Sideway/Bull classification with walk-forward validation
  • Deployed full-stack web app with Dashboard, Explorer, AI Advisor (LLMs reasoning), and Admin Console

Tech / Skills

Python
XGBoost
LLMs
Sentiment Analysis
Feature Engineering
Streamlit
FastAPI

Case Study

1) Context / Problem

Vietnamese stock market has grown rapidly with 11M+ trading accounts by 2025, but retail investors (F0) still rely on personal experience or simple technical indicators. Information overload from price data, financial reports, macro news, and social media makes manual analysis inefficient and prone to emotional bias. This project aimed to build a data-driven decision support system specifically for banking stocks.

2) Your Role

As Team Lead, I designed the overall system architecture, coordinated 5 team members, and was responsible for data pipeline development, feature engineering (momentum, volatility, technical indicators, sentiment), model training with walk-forward validation, and LLMs integration for reasoning layer.

3) Approach

Built end-to-end pipeline: (1) Data Collection from vnstock, yfinance, CafeF/TCBS news; (2) Feature Engineering with 31 features across 5 groups (Market, Technical, Sentiment, Macro, Bank Fundamentals); (3) 4 XGBoost models with expanding window training; (4) LLMs reasoning layer using Groq API to synthesize model outputs into natural language recommendations. Used walk-forward validation (2020-2025) to ensure no look-ahead bias.

4) Result / Impact

Return Regression: MAE 0.094, RMSE 0.119; Direction Classification: 55% accuracy, 0.55 ROC-AUC; Risk Model: 0.98 correlation (excellent volatility forecasting); Regime Model: 61% accuracy. Feature importance analysis showed macro variables (GDP, Credit Growth, FX) dominate regime detection, while technical features drive short-term signals. Deployed Streamlit app with FastAPI backend.

5) Learnings

Multi-source data integration significantly improves model robustness compared to single-source approaches. Volatility is highly predictable due to clustering effect; direction is harder due to market efficiency. LLMs reasoning layer bridges gap between quantitative outputs and actionable insights for retail investors. Future work: portfolio optimization, backtesting engine, real-time data streaming.

6) Links

See links above.