An Iterative Comprehensive Evaluation of Large Language and Vision Models in Medical AI: Benchmarks, Adaptability, and Deployment Challenges

Wani H. Bisen; Avinash J. Agrawal

doi:10.65521/ijacect.v14i3s.1636

PDF

Published: Dec 22, 2025

DOI: https://doi.org/10.65521/ijacect.v14i3s.1636

Keywords:

Large Language Models, Medical Applications, Systematic Review, Report Generation, GPT-4, Process

Wani H. Bisen

Rashtrasant Tukdoji Maharaj Nagpur University, Shri Ramdeobaba College of Engineering and Management, Nagpur, India

Avinash J. Agrawal

Ramdeobaba University Nagpur, Shri Ramdeobaba College of Engineering and Management, Nagpur, India.

Abstract

PRISMA principles provide a thorough analysis of current advances in large language models (LLMs) and multimodal transformers for medical applications. As LLMs like GPT-4, BioGPT, Med-PaLM, and hybrid frameworks like COMCARE enter clinical processes, thorough synthesis is essential to increase performance, methodological adaptability, and implementation practicality in many healthcare situations. Their creativity in medical report writing, decision support, and diagnosis is notable, but the literature has not established a cohesive taxonomy that evaluates these models by uniform metrics, domain-specific generalizability, and ethical acceptability. Over 40 studies examined radiology report production, clinical question responding, cognitive assessment, and causal reasoning. After testing vision-language transformer architectures like PEGASUS and ETB MII for automated imaging-based reporting, graph-based reasoning was used to evaluate drug safety and interpretability of knowledge- integrated models like KELLM. As needed, BLEU, ROUGE, F1 score, CIDEr, and qualitative evaluations were used. Domain- adapted and hybrid models improve diagnostic accuracy, task- specific explainability, and clinician workload differently. Model illusion, biases, hostile manipulation, and resource-intensive fine- tuning persist. The report recommends strong benchmarking, public evaluation standards, and ethical frameworks for LLMs in high-stakes medical applications. This study defines LLMs' therapeutic utility and recommends infrastructure, ethics, and technology for safe and successful integration. This effort prepares scalable, interpretable, and equitable medical AI systems.

How to Cite

Bisen, W. H., & Agrawal, A. J. (2025). An Iterative Comprehensive Evaluation of Large Language and Vision Models in Medical AI: Benchmarks, Adaptability, and Deployment Challenges. International Journal on Advanced Computer Engineering and Communication Technology, 14(3s), 310–319. https://doi.org/10.65521/ijacect.v14i3s.1636

Issue

Vol. 14 No. 3s (2025): Special Issue: AIDCON-2025

Section

Articles

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Article Sidebar

Main Article Content

Abstract

Article Details

Similar Articles