By A-pharmaconsult
A-pharmaconsult about AI ‘hallucinations’ in medical writing
Grasse, France: – A-pharmaconsult, the consultancy that specializes in services for pharma industries and medical devices, has identified serious risks in relying on AI writing tools for producing medical documentation.
A-pharmaconsult medical manager and project manager Dr. Natalia Bernachon has carried out a literary review of current research on the accuracy of current Large Language Model (LLM) AI platforms such as ChatGPT and Claude AI. The review recognizes that use of such writing tools has become increasingly prevalent in academic and research communities as students, researchers, and professionals across various fields seek to improve their written clarity, structure, and originality. There are also corporate motivations to adopt AI-assisted writing for enhanced speed, productivity, efficiency, cost savings, improved language quality, and multi-language support.
Beware hallucinations
However, the review also highlights the vulnerability of LLM-based platforms to ‘hallucinations or confabulations’. This means that the AI engine will generate inaccurate or fabricated content with either partially or totally false passages.
“This undesirable AI “side effect” remains a significant challenge for potential professional use, particularly in pharmaceutical regulatory areas,” Dr. Bernachon points out.
Various studies have categorized AI hallucinations into different types based on their nature and context. These include generating false responses that do not correspond to real-world data, misinterpretation or exaggerated extraction from the data, overgeneration, creation of nonexistent objects, data sets, or erroneous conclusions. These hallucinations can be caused by limited contextual understanding, overly generalized conclusions, errant feedback loops and may occur when the model cannot find a definitive solution and regresses to what seems like the most probable or plausible response.
Actual examples of typical AI hallucinations include:
- Factuality Hallucination: this occurs when an LLM generates factually incorrect content.
- Factual Inconsistency: an example is the LLM misinterpreting Yuri Gagarin as first man in space as first person to step onto the Moon, rather than the correct answer of Neil Armstrong
- Factual Fabrication: an LLM creating a fictitious narrative about unicorns’ research or creating fictitious articles, non-existent references etc.
AI models are not all created equal
The literature reveals that hallucination rates vary markedly between different AI models and depend on task being performed, ranging from a ChatGPT-4 error rate of 28.6% in scientific writing, through a hallucination rate of 39.6% for the earlier GPT-3.5 in the same contest, up to a high of 91.4% for Bard.
One study reviewed (De Wynter et al., 2023) reported that 46.4 % of the texts generated by AI had factual errors, 52% had discourse flaws, such as self-contradictory statements; 31.3 % contained logical fallacies and 15.4% generated presence of personally identifiable information (PII) such as incorrect claims and papers misattributed to living scientists.
Experts have identified AI-generated unstructured abstracts often scoring lower in quality compared to human-written versions (Cheng et al., 2023) and that while AI tools like ChatGPT have demonstrated high correctness rates in validation rounds, they still struggle with sourcing literature and maintaining contextual accuracy (Riedel et al., 2023; Seth et al., 2023).
Better training, better models
The A-pharmaconsult review reminds that several techniques can be employed to reduce hallucination rates in AI systems, such as improving training data quality, incorporating external knowledge sources, and implementing rigorous fact-checking processes.
“In scientific contexts, ensuring that AI models are trained on high quality, diverse, and extensive datasets is crucial. Professional-oriented tools provide regular updates and dataset expansions to help AI models adapt to new information and reduce inaccuracies,” it emphasizes.
Dr. Bernachon also identifies several new and relatively accurate AI-based scientific writing tools that offer various features aimed at improving clarity, structure, and originality in scientific documents. These include Jenni AI, Science42 Dora, and Scite AI.
“However, none of the tools identified guarantees 100% accuracy on their respective websites, and further exploration is needed to evaluate their claimed performance. As specified in multiple publications, in the current state of Large Language Model (LLM) development, it appears impossible to reduce AI hallucinations to zero,” she cautions.
Human oversight essential
In conclusion, she cites: “Incorporating human review remains one of the most effective safeguards against AI hallucinations. Human fact-checkers can identify and correct inaccuracies that AI systems might miss, ensuring outputs meet reliability standards before publication or use. Particular attention should be paid to reference verification, which should always be conducted manually to ensure authenticity and accuracy.”
“While AI hallucinations cannot be entirely eliminated, continuous efforts to refine AI models and implement robust mitigation strategies are crucial for minimizing their adverse effects and enhancing the reliability of AI-generated content. In scientific contexts, using professional-oriented specialized tools is strongly recommended over general-purpose models. However, an objective quality measurement instrument is still needed to evaluate and compare different AI writing tools systematically,” says the review.
Bibliography
1. Athaluri, S. A., Manthena, S. V., Kesapragada, V. S. R. K. M., Yarlagadda, V., Dave, T., & Duddumpudi, R. T. S. (2023). Exploring the Boundaries of Reality : Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus. https://doi.org/10.7759/cureus.37432
2. Buholayka, M., Zouabi, R., & Tadinada, A. (2023). Is ChatGPT Ready to Write Scientific Case Reports Independently? A Comparative Evaluation Between Human and Artificial Intelligence. Cureus. https://doi.org/10.7759/cureus.39386
3. Chelli, M., Descamps, J., Lavoué, V., Trojani, C., Azar, M., Deckert, M., Raynier, J., Clowez, G., Boileau, P., & Ruetsch-Chelli, C. (2024). Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews : Comparative Analysis. Journal Of Medical Internet Research, 26, e53164. https://doi.org/10.2196/53164
4. Cheng, S., Tsai, S., Bai, Y., Ko, C., Hsu, C., Yang, F., Tsai, C., Tu, Y., Yang, S., Tseng, P., Hsu, T., Liang, C., & Su, K. (2023). Comparisons of Quality, Correctness, and Similarity Between ChatGPT-Generated and Human-Written Abstracts for Basic Research : Cross-Sectional Study. Journal Of Medical Internet Research, 25, e51229. https://doi.org/10.2196/51229
5. De Wynter, A., Wang, X., Sokolov, A., Gu, Q., & Chen, S. (2023). An evaluation on large language model outputs : Discourse and memorization. Natural Language Processing Journal, 4, 100024. https://doi.org/10.1016/j.nlp.2023.100024
6. Hamilton, A. (2024). Artificial Intelligence and Healthcare Simulation : The Shifting Landscape of Medical Education. Cureus. https://doi.org/10.7759/cureus.59747
7. Hosseini, M., & Horbach, S. P. J. M. (2023). Fighting reviewer fatigue or amplifying bias ? Considerations and recommendations for use of ChatGPT and other large language models in scholarly peer review. Research Integrity and Peer Review, 8(1). https://doi.org/10.1186/s41073-023-00133-5
8. Jamaluddin, J., Gaffar, N. A., & Din, N. S. S. (2023). Hallucination : A key challenge to Artificial Intelligence-Generated writing. Malaysian Family Physician, 18, 68. https://doi.org/10.51866/lte.527
9. Kwong, J. C. C., Wang, S. C. Y., Nickel, G. C., Cacciamani, G. E., & Kvedar, J. C. (2024). The long but necessary road to responsible use of large language models in healthcare research. Npj Digital Medicine, 7(1). https://doi.org/10.1038/s41746-024-01180-y
10. Morreel, S., Verhoeven, V., & Mathysen, D. (2024). Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam. PLOS Digital Health, 3(2), e0000349. https://doi.org/10.1371/journal.pdig.0000349
11. Riedel, M., Kaefinger, K., Stuehrenberg, A., Ritter, V., Amann, N., Graf, A., Recker, F., Klein, E., Kiechle, M., Riedel, F., & Meyer, B. (2023). ChatGPT’s performance in German OB/GYN exams – paving the way for AI-enhanced medical education and clinical practice. Frontiers In Medicine, 10. https://doi.org/10.3389/fmed.2023.1296615
12. Sartori, G., & Orrù, G. (2023). Language models and psychological sciences. Frontiers In Psychology, 14. https://doi.org/10.3389/fpsyg.2023.1279317
13. Seth, I., Lim, B., Xie, Y., Cevik, J., Rozen, W. M., Ross, R. J., & Lee, M. (2023). Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty : An Observational Study. Aesthetic Surgery Journal Open Forum, 5. https://doi.org/10.1093/asjof/ojad084
14. Siepmann, R., Huppertz, M., Rastkhiz, A., Reen, M., Corban, E., Schmidt, C., Wilke, S., Schad, P., Yüksel, C., Kuhl, C., Truhn, D., & Nebelung, S. (2024). The virtual reference radiologist :comprehensive AI assistance for clinical image reading and interpretation. European Radiology. https://doi.org/10.1007/s00330-024-10727-2
15. Soong, D., Sridhar, S., Si, H., Wagner, J., Sá, A. C. C., Yu, C. Y., Karagoz, K., Guan, M., Kumar, S., Hamadeh, H., & Higgs, B. W. (2024). Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model. PLOS Digital Health, 3(8), e0000568. https://doi.org/10.1371/journal.pdig.0000568
16. Wong, R. S., Ming, L. C., & Ali, R. A. R. (2023). The Intersection of ChatGPT, Clinical Medicine, and Medical Education. JMIR Medical Education, 9, e47274. https://doi.org/10.2196/47274.
About A-pharmaconsult SAS
Part of the A-consult group, A-pharmaconsult SAS specializes in regulatory affairs for medicines and medical devices.
A-pharmaconsult services include medical writing, non-clinical writing, technical writing, regulatory strategy, and quality systems. These offerings include preparation of marketing authorization dossiers for drugs, technical documentation for medical devices, regulatory consultancy and documentation for quality management systems (GMP, GDP, ISO 13485 & ISO22716).
From both its Danish and French origins, this consulting company has remained deeply focused on helping pharmaceutical, medical devices and cosmetics businesses operate in an increasingly complex regulatory environment, underpinned by core values of expertise, respect, and confidence.
A-pharmaconsult has built long term partnerships across a broad range of different companies with a “tailor made” offering to suit every size of company and specific needs,
For more information, visit www.a-consult.com
Resources
Click on A-consult Regulatory Affairs Services for background information.