HEARSEE: A MULTIMODAL LARGE LANGUAGE MODEL POWERED IMAGE-TEXT-TO-TEXT AND TEXT-TO-SPEECH WEB-BASED LEARNING TOOL FOR UNDERGRADUATE STUDENTS
| dc.contributor.author | FERDINAND MORIENTES ANAK RANTAI | |
| dc.date.accessioned | 2026-04-28T07:45:26Z | |
| dc.date.issued | 2025 | |
| dc.identifier.uri | https://scholarhub.unimas.my/handle/123456789/548 | |
| dc.language.iso | English | |
| dc.publisher | HearSee is a web-based assistive learning tool designed to enhance undergraduate study by integrating multimodal large language models for image-text-to-text and text-to-speech processing. Implemented with a three-tier architecture utilising Gradio UI for presentation, Python service classes for application logic, and configuration utilities for data handling. The system leverages Qwen2-VL 7B via the Replicate API to perform OCR, generate descriptive captions, produce concise summaries, and support interactive Q&A on uploaded images. Processed outputs can be converted to natural-sounding speech using the Kokoro TTS engine (based on StyleTTS 2), with user-selectable voice profiles and adjustable playback speed. Deployed on Hugging Face Spaces, HearSee employs a prompt-engineering approach to optimize model responses and manages session state and temporary audio files for seamless user interactions. System efficacy is established through automated evaluation metrics— BERTScore for text quality and Mean Opinion Score for speech naturalness—alongside measured latency benchmarks. This was validated through automated testing which achieved 86% code coverage and user acceptance testing with undergraduate students, confirming that HearSee meets its objectives of improving accessibility and ease of use across diverse learning modalities with high user satisfaction and excellent semantic quality, while providing a scalable foundation for future enhancements in multimodal educational applications. | |
| dc.relation.ispartofseries | Faculty of Computer Science and Information Technology | |
| dc.title | HEARSEE: A MULTIMODAL LARGE LANGUAGE MODEL POWERED IMAGE-TEXT-TO-TEXT AND TEXT-TO-SPEECH WEB-BASED LEARNING TOOL FOR UNDERGRADUATE STUDENTS | |
| dc.type | Final Year Project |
