全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

GUARDIAN: A Multi-Tiered Defense Architecture for Thwarting Prompt Injection Attacks on LLMs

DOI: 10.4236/jsea.2024.171003, PP. 43-68

Keywords: Large Language Models (LLMs), Adversarial Attack, Prompt Injection, Filter Defense, Artificial Intelligence, Machine Learning, Cybersecurity

Full-Text   Cite this paper   Add to My Lib

Abstract:

This paper introduces a novel multi-tiered defense architecture to protect language models from adversarial prompt attacks. We construct adversarial prompts using strategies like role emulation and manipulative assistance to simulate real threats. We introduce a comprehensive, multi-tiered defense framework named GUARDIAN (Guardrails for Upholding Ethics in Language Models) comprising a system prompt filter, pre-processing filter leveraging a toxic classifier and ethical prompt generator, and pre-display filter using the model itself for output screening. Extensive testing on Meta’s Llama-2 model demonstrates the capability to block 100% of attack prompts. The approach also auto-suggests safer prompt alternatives, thereby bolstering language model security. Quantitatively evaluated defense layers and an ethical substitution mechanism represent key innovations to counter sophisticated attacks. The integrated methodology not only fortifies smaller LLMs against emerging cyber threats but also guides the broader application of LLMs in a secure and ethical manner.

References

[1]  Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Xiang, J., Xu, K.P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S. and Scialom, T. (2023) Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.
[2]  Phute, M., et al. (2023) LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv preprint arXiv:2308.07308.
[3]  Liu, Y.P., et al. (2023) Prompt Injection Attacks and Defenses in LLM-Integrated Applications. arXiv preprint arXiv:2310.12815v1.
[4]  Robey, A., et al. (2023) SmoothLLM: Defending Large Language Models against Jailbreaking Attacks. arXiv preprint arXiv:2310.03684.
[5]  Cao, B.C., et al. (2023) Defending against Alignment-Breaking Attacks via Robustly Aligned LLM. arXiv preprint arXiv:2309.14348.
[6]  Chen, B.C. et al. (2023) Jailbreaker in Jail: Moving Target Defense for Large Language Models. MTD’23: Proceedings of the 10th ACM Workshop on Moving Target Defense, November 2023, 29-32.
https://doi.org/10.1145/3605760.3623764
[7]  Wei, A., et al. (2023) Jailbroken: How Does LLM Safety Training Fail? arXiv Preprint arXiv:2307.02483.
[8]  Kumar, A., et al. (2023) Certifying LLM Safety against Adversarial Prompting. arXiv Preprint arXiv:2309.02705.
[9]  Mozes, M., et al. (2023) Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities. arXiv preprint arXiv:2308.12833.
[10]  Wu, F.Z., Xie, Y.Q., Yi, J.W., et al. (2023) Defending ChatGPT against Jailbreak Attack via Self-Reminder.
https://doi.org/10.21203/rs.3.rs-2873090/v1
[11]  Deng, G., et al. (2023) MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. arXiv preprint arXiv:2307.08715.
[12]  Rao, A., et al. (2023) Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks. arXiv preprint arXiv:2305.14965.
[13]  Shen, X.Y., et al. (2023) Do Anything Now: Characterizing and Evaluating In-the-Wild Jailbreak Prompts on Large Language Models. arXiv Preprint arXiv:2308.03825.
[14]  Wulczyn, E., Thain, N. and Dixon, L. (2017) Ex Machina: Personal Attacks Seen at Scale. In: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Perth, 1391-1399.
https://doi.org/10.1145/3038912.3052591
[15]  Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 4171-4186.
[16]  Tunstall, L., Beeching, E., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A.M. and Wolf, T. (2023) Zephyr: Direct Distillation of LM Alignment. arXiv preprint arXiv:2310.16944.
[17]  (2023) LM Studio—Discover, Download, and Run Local LLMs.
https://lmstudio.ai/
[18]  Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T. and El Sayed, W. (2023) Mistral 7B. arXiv preprint arXiv:2310.06825.

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413