HEARSEE: A MULTIMODAL LARGE LANGUAGE MODEL POWERED IMAGE-TEXT-TO-TEXT AND TEXT-TO-SPEECH  WEB-BASED LEARNING TOOL FOR UNDERGRADUATE  STUDENTS

FERDINAND MORIENTES ANAK RANTAI

HEARSEE: A MULTIMODAL LARGE LANGUAGE MODEL POWERED IMAGE-TEXT-TO-TEXT AND TEXT-TO-SPEECH WEB-BASED LEARNING TOOL FOR UNDERGRADUATE STUDENTS

dc.contributor.author	FERDINAND MORIENTES ANAK RANTAI
dc.date.accessioned	2026-04-28T07:45:26Z
dc.date.issued	2025
dc.identifier.uri	https://scholarhub.unimas.my/handle/123456789/548
dc.language.iso	English
dc.publisher	HearSee is a web-based assistive learning tool designed to enhance undergraduate study by integrating multimodal large language models for image-text-to-text and text-to-speech processing. Implemented with a three-tier architecture utilising Gradio UI for presentation, Python service classes for application logic, and configuration utilities for data handling. The system leverages Qwen2-VL 7B via the Replicate API to perform OCR, generate descriptive captions, produce concise summaries, and support interactive Q&A on uploaded images. Processed outputs can be converted to natural-sounding speech using the Kokoro TTS engine (based on StyleTTS 2), with user-selectable voice profiles and adjustable playback speed. Deployed on Hugging Face Spaces, HearSee employs a prompt-engineering approach to optimize model responses and manages session state and temporary audio files for seamless user interactions. System efficacy is established through automated evaluation metrics— BERTScore for text quality and Mean Opinion Score for speech naturalness—alongside measured latency benchmarks. This was validated through automated testing which achieved 86% code coverage and user acceptance testing with undergraduate students, confirming that HearSee meets its objectives of improving accessibility and ease of use across diverse learning modalities with high user satisfaction and excellent semantic quality, while providing a scalable foundation for future enhancements in multimodal educational applications.
dc.relation.ispartofseries	Faculty of Computer Science and Information Technology
dc.title	HEARSEE: A MULTIMODAL LARGE LANGUAGE MODEL POWERED IMAGE-TEXT-TO-TEXT AND TEXT-TO-SPEECH WEB-BASED LEARNING TOOL FOR UNDERGRADUATE STUDENTS
dc.type	Final Year Project

Files

Original bundle

Now showing 1 - 1 of 1

Name:: FERDINAND MORIENTES.pdf
Size:: 9.05 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Final Year Project Report/IMRaD