[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models - discu.eu

Reddit

[D] RLHF Preference Tuning: How Things May Go Wrong https://arxiv.org/abs/2307.15043 3 comments 3/8/2023 machinelearning

Linking pages

Why Are LLMs So Gullible? - by Steve - Am I Stronger Yet? https://amistrongeryet.substack.com/p/why-are-llms-so-gullible 101 comments
AI chatbots can fall for prompt injection attacks, leaving you vulnerable - The Washington Post https://www.washingtonpost.com/technology/2023/11/02/prompt-injection-ai-chatbot-vulnerability-jailbreak/ 10 comments
Researchers find 'universal' jailbreak prompts for multiple AI chat models | SC Media https://www.scmagazine.com/news/researchers-find-universal-jailbreak-prompts-for-multiple-ai-chat-models 1 comment
GitHub - llm-attacks/llm-attacks: Universal and Transferable Attacks on Aligned Language Models https://github.com/llm-attacks/llm-attacks 0 comments
Adversarial Attacks on LLMs | Lil'Log https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/ 0 comments
Model alignment protects against accidental harms, not intentional ones https://www.aisnakeoil.com/p/model-alignment-protects-against 0 comments
Improving LLM Security Against Prompt Injection: AppSec Guidance For Pentesters and Developers - Include Security Research Blog https://blog.includesecurity.com/2024/01/improving-llm-security-against-prompt-injection-appsec-guidance-for-pentesters-and-developers/ 0 comments
Dual Use Foundation Artificial Intelligence Models with Widely Available Model Weights | National Telecommunications and Information Administration https://www.ntia.gov/federal-register-notice/2024/dual-use-foundation-artificial-intelligence-models-widely-available 0 comments
Making a SOTA Adversarial Attack on LLMs 38x Faster | Haize Labs Blog 🕊️ https://blog.haizelabs.com/posts/acg/ 0 comments

Would you like to stay up to date with Computer science? Checkout Computer science Weekly.

Related searches:

Search whole site: site:arxiv.org

Search title: [2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models

See how to search.

Submit link to: