Integrating Retrieval-Augmented Generation with Large Language Model Mistral 7b for Indonesian Medical Herb
DOI:
https://doi.org/10.14421/jiska.2024.9.3.230-243Keywords:
LLM, Generative AI, LLAMA2, Retrieval-Augmented Generation, Deep LearningAbstract
Large Language Models (LLMs) are advanced artificial intelligence systems that use deep learning, particularly transformer architectures, to process and generate text. One such model, Mistral 7b, featuring 7 billion parameters, is optimized for high performance and efficiency in natural language processing tasks. It outperforms similar models, such as LLaMa2 7b and LLaMa 1, across various benchmarks, especially in reasoning, mathematics, and coding. LLMs have also demonstrated significant advancements in addressing medical queries. This research leverages Indonesia’s rich biodiversity, which includes approximately 9,600 medicinal plant species out of the 30,000 known species. The study is motivated by the observation that LLMs, like ChatGPT and Gemini, often rely on internet data of uncertain validity and frequently provide generic answers without mentioning specific herbal plants found in Indonesia. To address this, the dataset for pre-training the model is derived from academic journals focusing on Indonesian medicinal herbal plants. The research process involves collecting these journals, preprocessing them using Langchain, embedding models with sentence transformers, and employing Faiss CPU for efficient searching and similarity matching. Subsequently, the Retrieval-Augmented Generation (RAG) process is applied to Mistral 7b, allowing it to provide accurate, dataset-driven responses to user queries. The model's performance is evaluated using both human evaluation and ROUGE metrics, which assess recall, precision, F1 measure, and METEOR scores. The results show that the RAG Mistral 7b model achieved a METEOR score of 0.22%, outperforming the LLaMa2 7b model, which scored 0.14%.
References
Ardiyanto, D., Triyono, A., Nisa, U., Fitriani, U., Astana, P. R., Novianto, F., & Zulkarnain, Z. (2021). The use of hyperuricemia herbs at “Hortus Medicus” herbal medicine clinic Tawangmangu. Jurnal Kedokteran Dan Kesehatan Indonesia. https://doi.org/10.20885/JKKI.Vol12.Iss2.art9
Arozal, W., Louisa, M., & Soetikno, V. (2020). Selected Indonesian Medicinal Plants for the Management of Metabolic Syndrome: Molecular Basis and Recent Studies. Frontiers in Cardiovascular Medicine, 7. https://doi.org/10.3389/fcvm.2020.00082
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., … Wei, J. (2022). Scaling Instruction-Finetuned Language Models: Vol. 1?54 (H. W. Chung, S. Longpre, B. Zoph, A. Castro-ros, A. Yu, & A. Dai, Eds.). http://arxiv.org/abs/2210.11416
Elfahmi, Woerdenbag, H. J., & Kayser, O. (2014). Jamu: Indonesian traditional herbal medicine towards rational phytopharmacological use. Journal of Herbal Medicine, 4(2), 51–73. https://doi.org/10.1016/j.hermed.2014.01.002
Fathir, A., HAIKAL, MOCH., & Wahyudi, D. (2021). Ethnobotanical study of medicinal plants used for maintaining stamina in Madura ethnic, East Java, Indonesia. Biodiversitas Journal of Biological Diversity, 22(1), 386–392. https://doi.org/10.13057/biodiv/d220147
Geberemeskel, G. A., Debebe, Y. G., & Nguse, N. A. (2019). Antidiabetic Effect of Fenugreek Seed Powder Solution ( Trigonella foenum-graecum L .) on Hyperlipidemia in Diabetic Patients. Journal of Diabetes Research, 2019, 1–8. https://doi.org/10.1155/2019/8507453
Hadi, M. U., Al-tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Hassan, S. Z., Shoman, M., Wu, J., Mirjalili, S., & Shah, M. (2024). Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects. TechRxiv, 1–47. https://doi.org/10.36227/techrxiv.23589741.v2
Jain, N., Saifullah, K., Wen, Y., Kirchenbauer, J., Shu, M., Saha, A., Goldblum, M., Geiping, J., & Goldstein, T. (2023). Bring Your Own Data! Self-Supervised Evaluation for Large Language Models. http://arxiv.org/abs/2306.13651
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. Le, Lavril, T., Wang, T., Lacroix, T., & Sayed, W. El. (2023). Mistral 7B: Vol. 7b.? 1?9. http://arxiv.org/abs/2310.06825
Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., & McHardy, R. (2023). Challenges and Applications of Large Language Models. http://arxiv.org/abs/2307.10169
Kartini, K., Jayani, N. I. E., Octaviyanti, N. D., Krisnawan, A. H., & Avanti, C. (2019). Standardization of Some Indonesian Medicinal Plants Used in “Scientific Jamu.” IOP Conference Series: Earth and Environmental Science, 391(1), 012042. https://doi.org/10.1088/1755-1315/391/1/012042
Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., & Liu, T.-Y. (2022). BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6). https://doi.org/10.1093/bib/bbac409
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2023). GPT-4 Technical Report. 4, 1–100. http://arxiv.org/abs/2303.08774
Putri, L. S. E., Dasumiati, D., Kristiyanto, K., Mardiansyah, M., Malik, C., Leuvinadrie, L. P., & Mulyono, E. A. (1970). Ethnobotanical study of herbal medicine in Ranggawulung Urban Forest, Subang District, West Java, Indonesia. Biodiversitas Journal of Biological Diversity, 17(1), 172–176. https://doi.org/10.13057/biodiv/d170125
Radeva, I., Popchev, I., Doukovska, L., & Dimitrova, M. (2024). Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics, 13(7), 1361. https://doi.org/10.3390/electronics13071361
Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003
Ren, X., Zhou, P., Meng, X., Huang, X., Wang, Y., Wang, W., Li, P., Zhang, X., Podolskiy, A., Arshinov, G., Bout, A., Piontkovskaya, I., Wei, J., Jiang, X., Su, T., Liu, Q., & Yao, J. (2023). PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. http://arxiv.org/abs/2303.10845
Sholikhah, E. N. (2016). Indonesian medicinal plants as sources of secondary metabolites for pharmaceutical industry. Journal of the Medical Sciences (Berkala Ilmu Kedokteran), 48(04), 226–239. https://doi.org/10.19106/JMedSci004804201606
Sianipar, E. A. (2021). The Potential of Indonesian Traditional Herbal Medicine as Immunomodulatory Agents: A Review. International Journal of Pharmaceutical Sciences and Research, 12(10), 5229–5237. https://doi.org/10.13040/IJPSR.0975-8232.12(10).5229-37
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Schärli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., … Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180. https://doi.org/10.1038/s41586-023-06291-2
Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., Neal, D., Schaekermann, M., Wang, A., Amin, M., Lachgar, S., Mansfield, P., Prakash, S., Green, B., Dominowska, E., Arcas, B. A. y, … Natarajan, V. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. http://arxiv.org/abs/2305.09617
Sumarni, W., Sudarmin, S., & Sumarti, S. S. (2019). The scientification of jamu: a study of Indonesian’s traditional medicine. Journal of Physics: Conference Series, 1321(3), 032057. https://doi.org/10.1088/1742-6596/1321/3/032057
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. http://arxiv.org/abs/2302.13971
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. http://arxiv.org/abs/2307.09288
Wang, T., Yu, P., Tan, X. E., O’Brien, S., Pasunuru, R., Dwivedi-Yu, J., Golovneva, O., Zettlemoyer, L., Fazel-Zarandi, M., & Celikyilmaz, A. (2023). Shepherd: A Critic for Language Model Generation. https://arxiv.org/abs/2308.04592v1
Zareie, A., Sahebkar, A., Khorvash, F., Bagherniya, M., Hasanzadeh, A., & Askari, G. (2020). Effect of cinnamon on migraine attacks and inflammatory markers: A randomized double‐blind placebo‐controlled trial. Phytotherapy Research, 34(11), 2945–2952. https://doi.org/10.1002/ptr.6721
Zhang, K., Zhou, R., Adhikarla, E., Yan, Z., Liu, Y., Yu, J., Liu, Z., Chen, X., Davison, B. D., Ren, H., Huang, J., Chen, C., Zhou, Y., Fu, S., Liu, W., Liu, T., Li, X., Chen, Y., He, L., … Sun, L. (2024). BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks. Nature Medicine. https://doi.org/10.1038/s41591-024-03185-2
Zhang, T., Huang, Z., Wang, Y., Wen, C., Peng, Y., & Ye, Y. (2022). Information Extraction from the Text Data on Traditional Chinese Medicine: A Review on Tasks, Challenges, and Methods from 2010 to 2021. Evidence-Based Complementary and Alternative Medicine, 2022, 1–19. https://doi.org/10.1155/2022/1679589
Zhu, X., Li, J., Liu, Y., Ma, C., & Wang, W. (2023). A Survey on Model Compression for Large Language Models. https://arxiv.org/abs/2308.07633v4
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Diash Firdaus, Idi Sumardi, Yuni Kulsum
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors who publish with this journal agree to the following terms as stated in http://creativecommons.org/licenses/by-nc/4.0
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.