An Iterative Comprehensive Evaluation of Large Language and Vision Models in Medical AI: Benchmarks, Adaptability, and Deployment Challenges
Main Article Content
Abstract
PRISMA principles provide a thorough analysis of current advances in large language models (LLMs) and multimodal transformers for medical applications. As LLMs like GPT-4, BioGPT, Med-PaLM, and hybrid frameworks like COMCARE enter clinical processes, thorough synthesis is essential to increase performance, methodological adaptability, and implementation practicality in many healthcare situations. Their creativity in medical report writing, decision support, and diagnosis is notable, but the literature has not established a cohesive taxonomy that evaluates these models by uniform metrics, domain-specific generalizability, and ethical acceptability. Over 40 studies examined radiology report production, clinical question responding, cognitive assessment, and causal reasoning. After testing vision-language transformer architectures like PEGASUS and ETB MII for automated imaging-based reporting, graph-based reasoning was used to evaluate drug safety and interpretability of knowledge- integrated models like KELLM. As needed, BLEU, ROUGE, F1 score, CIDEr, and qualitative evaluations were used. Domain- adapted and hybrid models improve diagnostic accuracy, task- specific explainability, and clinician workload differently. Model illusion, biases, hostile manipulation, and resource-intensive fine- tuning persist. The report recommends strong benchmarking, public evaluation standards, and ethical frameworks for LLMs in high-stakes medical applications. This study defines LLMs' therapeutic utility and recommends infrastructure, ethics, and technology for safe and successful integration. This effort prepares scalable, interpretable, and equitable medical AI systems.
Article Details

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.