Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (2024)

Janghwan Lee1 , Seongmin Park111footnotemark: 1 , Sukjin Hong1,2, Minsoo Kim1,
Du-Seong Chang2 and Jungwook Choi1†
1Hanyang University,2KT
Seoul, Republic of Korea
1{hwanii0288, skstjdals, sjhong7898, minsoo2333}@hanyang.ac.kr
2{sukjin.hong, dschang}@kt.com, 1\daggerchoij@hanyang.ac.kr
  Equal contribution Corresponding author

Abstract

The rapid advancement of large language models (LLMs) has facilitated their transformation into conversational chatbots that can grasp contextual nuances and generate pertinent sentences, closely mirroring human values through advanced techniques such as instruction tuning and reinforcement learning from human feedback (RLHF). However, the computational efficiency required for LLMs, achieved through techniques like post-training quantization (PTQ), presents challenges such as token-flipping that can impair chatbot performance. In response, we propose a novel preference alignment approach, quantization-aware direct preference optimization (QDPO), that aligns quantized LLMs with their full-precision counterparts, improving conversational abilities. Evaluated on two instruction-tuned LLMs in various languages, QDPO demonstrated superior performance in improving conversational abilities compared to established PTQ and knowledge-distillation fine-tuning techniques, marking a significant step forward in the development of efficient and effective conversational LLMs.

1 Introduction

As large language models (LLMs) advance in understanding the context of language and generating relevant sentences, LLMs are evolving into conversational chatbots that can naturally respond to a wide array of user requestsOpenAI (2023); Chiang etal. (2023); Team etal. (2023); Touvron etal. (2023b). Particularly noteworthy is the remarkable ability of LLMs to follow user instructions and align with human values, such as providing helpful and engaging responses through techniques like instruction tuning and reinforcement learning from human feedback (RLHF)Taori etal. (2023); Longpre etal. (2023); Chung etal. (2022); Mukherjee etal. (2023); Ouyang etal. (2022). These advancements have greatly enhanced the capability to fine-tune pre-trained LLMs for various tasks and user preferences.

For the effective implementation of LLM-based chatbots, addressing LLMs’ computational complexity is essential. Weight load overhead, a critical bottleneck in LLM deployment, has led to the development of weight quantization techniques like post-training quantization (PTQ). PTQ reduces storage requirements by applying quantization to the weights of trained LLMs, thereby decreasing the necessary bit count for weight data storage Frantar etal. (2023); Lin etal. (2023). Techniques such as AWQ Lin etal. (2023) address quantization-induced accuracy loss through methods like scaling data distribution and weight updates aimed at preserving accuracy. The effectiveness of these quantization strategies has been measured by task-dependent benchmarks to evaluate model accuracy instead of multifaceted conversational qualities.

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (1)

Evaluating the conversational abilities of LLM-based chat assistants, especially for open-ended tasks requiring alignment with human preferences, challenges traditional score-based benchmarks due to the assistants’ varied capabilities. To address this, new methods have been introduced for a more objective assessment of LLM chatbot performance Chiang etal. (2023); Zheng etal. (2023). The "LLM as a Judge" approach Zheng etal. (2023) employs advanced LLMs like GPT-4OpenAI (2023) to evaluate responsiveness in multi-turn conversations across eight conversational categories, focusing on conversational continuity and adherence to instructions. Furthermore, FLASKYe etal. (2023) offers fine-grained evaluation criteria that dissect conversational skills linguistically. Yet, these methods mainly target full-precision chatbots, leaving the performance of cost-efficient quantized LLM chatbots less explored.

To assess quantization’s effect on LLM-based chatbots’ conversational abilities, we qualitatively compared the responses of quantized LLMs with a 16-bit baseline. Fig.1 reveals that quantized models often fail to maintain engaging dialogues with repetitive phrases. We identify "token-flipping" — a phenomenon where quantization errors skew token distribution, causing incorrect token selection — as a crucial factor for this quality degradation. Traditional task-dependent evaluation metrics, such as Common Sense Question Answering (CSQA)Talmor etal. (2019) and Massive Multitask Language Understanding (MMLU)Hendrycks etal. (2020), may not fully detect these nuances. For example, as shown in Fig.1(a) and (b), 16-bit and W4A16 inference exhibit similar task accuracy, but W4A16 inference produces responses that are not helpful to the user. This observation underscores the need for a new quantization approach that preserves user-perceived effectiveness beyond the task-dependent benchmarks.

To address the issue of token-flipping in quantized LLMs, we propose a novel preference alignment method that aligns quantized LLMs with full-precision counterparts. Drawing inspiration from direct preference optimization (DPO) strategiesRafailov etal. (2023); Liu etal. (2023a), our approach generates preference datasets directly from the quantized LLM and its full-precision counterpart to implement quantization-aware optimization for preference-reflective weight adjustments. Our quantization-aware direct preference optimization (QDPO) method improves the disparity between the top-1 and top-2 logits of token distribution, reducing token-flipping, and fostering more relevant and consistent text output. We rigorously tested QDPO on two instruction-tuned LLMs, VicunaZheng etal. (2023) and Mi:dmKT-AI (2023), assessing their conversational performance in both English and Korean. The results, as illustrated in Fig.1(c), demonstrate that QDPO markedly enhances conversational abilities beyond those achieved with established quantization techniques.

2 Background

2.1 Conversational Ability of LLM

In the pre-training phase, LLMs learn from a vast corpus of text data collected from various sources, including the internet, books, articles, and conversationsRaffel etal. (2019); Zhu etal. (2015); Gao etal. (2020); Penedo etal. (2023). Through this process, they acquire extensive knowledge on a wide range of topics, which forms the foundation that enables LLMs to flexibly respond to diverse conversational subjectsZhang etal. (2022); Touvron etal. (2023a, b); Brown etal. (2020). Subsequently, LLMs develop the capability to follow instructions through instruction fine-tuning and learn to align with human preferences via RLHFTaori etal. (2023); Longpre etal. (2023); Chung etal. (2022); Mukherjee etal. (2023); Ouyang etal. (2022). Through such processes, LLM-based chatbots like GPT-4OpenAI (2023) and VicunaChiang etal. (2023) have acquired the conversational ability to engage with humans on various topics over multiple turns, distinguishing them from conventional language models.

To evaluate LLM-based chatbots, it is essential to assess their conversational ability, which is their key capability. However, existing task-dependent benchmarks such as MMLUHendrycks etal. (2020) and HELMLiang etal. (2023) do not adequately capture human preferences, rendering them insufficient for evaluating LLM-based chatbots. In response, proposals for new benchmarks such as MT-BenchZheng etal. (2023) and FLASKYe etal. (2023) are emerging, focusing on multi-turn questions or alignment with human preferences to effectively evaluate conversational abilities.

2.2 LLM Quantization

LLMs demand high serving costs due to their extensive number of parametersBrown etal. (2020).Weight quantization techniquesLin etal. (2023); Frantar etal. (2023); Lee etal. (2023a); Kim etal. (2023); Lee etal. (2023b) address this issue by representing the model’s weights in lower bit-precision, thereby reducing memory size, lowering memory load time, and speeding up inference. Post-training quantization (PTQ) changes the model’s weights directly to lower precision without additional training, offering cost benefits. However, due to concerns about accuracy loss, PTQ utilizes a portion of the training samples to calibrate and minimize the layer-wise quantization error through methods such as AWQLin etal. (2023). Quantization-aware training (QAT), on the other hand, maintains the performance of a quantized model by applying quantization during the forward pass and training the model accordingly. When applying QAT to LLMs, due to the insufficient information from the ground truth, techniques often use Knowledge Distillation (KD) by reducing the distance between the logits of the quantized model and the full-precision modelKim etal. (2023); Liu etal. (2023b).

However, previous quantization studies have evaluated their methods on task-dependent benchmarks, which show a limited scope for comprehensive evaluation of conversational abilities. For example, AWQLin etal. (2023) emphasizes that the quantized model achieves accuracy comparable to the baseline on CSQA. However, they do not analyze why only 35% of the sentences generated by the quantized model are considered as good as those from the baseline, according to GPT-4OpenAI (2023)’s evaluation in assessing conversational abilities. In this research, we analyze how the model’s quantization error impacts the conversational abilities of large language models and propose methods to enhance these capabilities.

2.3 Alignment with Human Preferences

The RLHF is an advanced method to improve the performance of LLMs by aligning with human preferences. It comprises three stages:

Supervised Fine-Tuning (SFT). SFT utilizes a dataset of human instructions to refine pre-trained LLMs.

Reward Modeling. This stage develops a reward model based on human preferences for LLM response pairs, using the Bradley-Terry (BT) model Bradley and Terry (1952) to quantify these preferences. It represents the distribution of preferences distribution psuperscript𝑝p^{*} between y1subscript𝑦1y_{1} and y2subscript𝑦2y_{2}:

p(y1y2|x)=er(x,y1)er(x,y1)+er(x,y2),superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥superscript𝑒superscript𝑟𝑥subscript𝑦1superscript𝑒superscript𝑟𝑥subscript𝑦1superscript𝑒superscript𝑟𝑥subscript𝑦2p^{*}(y_{1}\succ y_{2}|x)=\frac{e^{r^{*}(x,y_{1})}}{e^{r^{*}(x,y_{1})}+e^{r^{*}(x,y_{2})}},(1)

rsuperscript𝑟r^{*} is defined as the optimal reward function. y1subscript𝑦1y_{1} and y2subscript𝑦2y_{2}, assumed to be sampled from the optimal preference distribution psuperscript𝑝p^{*} with prompt x𝑥x, the parameterized reward model estimates the parameter using maximum likelihood.

Policy Optimization. The LLM policy optimization is guided by the reward model to generate responses that better align with human preferences for the training prompts.The reinforcement learning (RL) objective function is defined as follows:

maxπ𝔼xDyπ[r(x,y)]βDKL[π(y|x)πref(y|x)],\max_{\pi}\underset{\begin{subarray}{c}x\sim D\\y\sim\pi\end{subarray}}{\mathbb{E}}\left[r(x,y)\right]-\beta D_{KL}\left[\pi(y|x)\|\pi_{\text{ref}}(y|x)\right],(2)

π𝜋\pi represents the LLM policy, β𝛽\beta is a control parameter that regulates variations with respect to πrefsubscript𝜋ref\pi_{\text{ref}}.Recent approachOuyang etal. (2022) employs Proximal Policy Optimization Schulman etal. (2017) for RL-based optimization, wherein the necessary reward is derived from a previously trained reward model.

DPO Rafailov etal. (2023) aligns LLM policies with human preferences via supervised learning, leveraging Eq.(2) to relate the optimal reward to the optimal policy directly.

r(x,y)=βlog(π(y|x)πref(y|x))+βlogZ(x),superscript𝑟𝑥𝑦𝛽superscript𝜋conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥𝛽𝑍𝑥r^{*}(x,y)=\beta\log\left(\frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)}\right)+\beta\log Z(x),(3)

where Z(x)𝑍𝑥Z(x) is the partition function.The optimal reward function is fitted to the objective function of BT model, defining DPO loss as follows:

𝔼x,yw,ylD[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))],similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝐷𝔼delimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\underset{{x,y_{w},y_{l}\sim D}}{\mathbb{E}}\left[-\log\sigma\left(\beta\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)}\right.\right.\\\left.\left.-\beta\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)}\right)\right],(4)

where σ𝜎\sigma is logistic function.

SROLiu etal. (2023a) criticizes the preference sampling method of DPO. Sampling data ywsubscript𝑦𝑤y_{w} and ylsubscript𝑦𝑙y_{l} from the πsuperscript𝜋\pi^{*} is the optimal way for estimating πθsubscript𝜋𝜃\pi_{\theta}. However, all experiments in DPO use preference pairs not from the πsuperscript𝜋\pi^{*} but from πrefsubscript𝜋ref\pi_{\text{ref}}, and there is a lack of research into the implications of this approach. SRO proposes a solution by constructing an additional reward-ranking model to directly form preference pairs from an approximated optimal policy and statistical rejection sampling.

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (2)

3 Conversational Abilities of Quantized LLMs

3.1 Observations

Recent advancements in PTQ have demonstrated that 4-bit quantized LLMs are effective for a variety of tasks, as evidenced by references such as AWQ and OPTQLin etal. (2023); Frantar etal. (2023). However, our observations reveal that these quantized LLMs struggle to sustain engaging conversations, particularly in multi-turn chatbot interactions. For instance, Fig.1 illustrates the contrast between the 16-bit baseline and 4-bit quantized LLMs in sentence generation. The baseline model begins its responses with “The circle of fifths is a musical diagram,” providing relevant answers. On the other hand, the 4-bit quantized model starts to deviate at the seventh token, switching its focus from “musical” to “visual,” and often generates limited and repetitive phrases. Although both models display similar task performance metrics, such as accuracy in multiple-choice benchmarks, there’s a noticeable difference in the logit probability for the seventh token in the 4-bit model, causing a change in the token from “musical” to “visual.” This issue of altered text generation, observed across multiple examples (see A.9 for additional examples), prompts an investigationinto its underlying causes.

3.2 Breakdown Analysis

To understand the cause of altered text generation in quantized LLMs, we examine how quantization impacts text production. We pinpoint the initial deviation to a flipped token and identify three contributing factors as shown in Fig.2(a):

  • -

    Flipped token (TFsubscriptTF\mathrm{T_{F}}): Occurs when a quantized model selects a different token at timestep t=i𝑡𝑖t=i compared to the baseline, altering the input for subsequent token generation and leading to deviations.

  • -

    Perturbated KV cache (KVPerturbsubscriptKVPerturb\mathrm{KV_{Perturb}}): Despite identical token sequences up to timestep t=i1𝑡𝑖1t=i-1, quantization errors already affect the Transformer’s key-value caches, contributing to further deviations.

  • -

    Quantization error in generation (QErrorsubscriptQError\mathrm{Q_{Error}}): Starting from timestep t=i+1𝑡𝑖1t=i+1, ongoing quantization errors continue to influence token generation, causing further divergence from the baseline.

Setup.To evaluate the impact of each identified factor, we analyze eight possible scenarios shown in Fig.2(b). Case 1, where all three factors are present, mirrors the standard text generation of a quantized LLM, whereas Case 8, devoid of these factors, corresponds to the baseline model’s inference. For this analysis, we generate text using both 4-bit and 16-bit models with 1,000 instruction samples randomly chosen from the Alpaca datasetTaori etal. (2023). We record the first token where discrepancies in text generation between the two models occurred, along with the key-value cache status up to that point for each scenario. To quantify the deviation from the baseline text, we utilize the ROUGE-LLin (2004) as a metric (the higher the better). More details of the implementation for the breakdown analysis are provided in A.2.

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (3)

Results. To assess the contribution of each factor to deviations in sentence generation from the baseline, we contrast each scenario with Case 8. As depicted in Fig.2(b), sentences become more divergent with the inclusion of additional factors. Specifically, from Case 4, it is evident that the Flipped Token (TFsubscriptTF\mathrm{T_{F}}) significantly affects sentence variation, as indicated by the largest decrease in ROUGE scores. Conversely, the effects of perturbed KV cache (KVPerturbsubscriptKVPerturb\mathrm{KV_{Perturb}}) and quantization error in generation (QErrorsubscriptQError\mathrm{Q_{Error}}) are comparatively minor. This pattern is further highlighted in Fig.2(c), where ROUGE scores, sorted by samples, show that Cases 1-4 cluster on the right, signifying greater deviation. This suggests that even a single token difference, resulting from quantization-affected probability shifts, can substantially alter the overall sentence structure in quantized inference.

Ablation: Advanced Quantization. Advanced quantization techniques designed to minimize errors may not fully address the issue of deviated text generation caused by flipped tokens. Recent PTQ methods that employ calibration using a small sample by scaling weight channels or adjusting quantization step sizesFrantar etal. (2023); Lin etal. (2023); Lee etal. (2023b) aim to lessen layer-specific quantization errors. However, our observations indicate that while these calibrated PTQ models reduce quantization error effects, they do not mitigate the issues stemming from flipped tokens. The case study for a 4-bit quantized model calibrated with AWQLin etal. (2023), shown in Fig.2(d), reveals that although calibration decreases the impacts of KVPerturbsubscriptKVPerturb\mathrm{KV_{Perturb}} and QErrorsubscriptQError\mathrm{Q_{Error}}, sentence variations are still predominantly influenced by TFsubscriptTF\mathrm{T_{F}}. A similar trend can be observed by KD-based QAT (Fig.8), highlighting the need for strategies that specifically address flipped tokens.

3.3 Why Token-Flipping Happens?

We hypothesize that token-flipping occurs due to inherently ambiguous token distributions in sentence generation, which become prone to flipping when quantization errors introduce alterations. To empirically validate this, Fig.3(a) demonstrates token-flipping during text generation by a quantized model. It shows the probabilities for the top-1 and top-2 tokens throughout the auto-regressive generation. Notably, the 16-bit baseline and 4-bit quantized models produce nearly identical probabilities for most tokens. However, at certain points (e.g., t=0,7,11𝑡0711t=0,7,11), the probability margin between the top-1 and top-2 tokens is minimal. Token-flipping occurs when quantization-induced deviations in the probability distribution surpass this narrow margin, altering subsequent sentence generation and leading to unnatural phrasing.

Fig.3(b) shows the average probability margin between the top-1 and top-2 tokens across each text sample. By feeding identical inputs to each model, we note that the 4-bit quantized model has a narrower average probability margin between the top-1 and top-2 tokens than the 16-bit baseline. This indicates a higher likelihood of the 4-bit model experiencing token-flipping due to quantization error-induced deviations exceeding this margin. Additionally, our examination of beam searchGraves (2012) in Section5.5 reveals its limited effectiveness in mitigating this issue. This underscores the need for strategies that ensure the quantized model retains clear decision-making capabilities.

4 QDPO: Quantization-aware Direct Preference Optimization

As described in Section3, quantization significantly degrades the conversational ability of LLMs. To address this issue, we introduce an algorithm named Quantization-aware Direct Preference Optimization (QDPO), which aims to align the conversational abilities of quantized LLMs with those of LLMs prior to quantization.QDPO has two main contributions: 1) Providing an efficient method for generating the dataset 𝒟QDPOsubscript𝒟QDPO\mathcal{D}_{\text{{QDPO}}} without costly human annotations. 2) Offering a theoretical foundation that ensures the automatic distinction of preferences during dataset generation.

4.1 Method

Drawing inspiration from the success of DPO in aligning LLMs with human preferences, we have developed a novel approach that extends its application to overcome the challenges introduced by quantization.

The challenge in preference dataset generation arises from human labeling.To mitiagte this, we introduce for efficiently creating dataset 𝒟QDPOsubscript𝒟QDPO\mathcal{D}_{\text{QDPO}}, which is composed of triplets {yw,yl,x}subscript𝑦𝑤subscript𝑦𝑙𝑥\{y_{w},y_{l},x\}. Here, ywsubscript𝑦𝑤y_{w} denotes the response from the full-precision model πfpsubscript𝜋fp\pi_{\text{fp}}, which is also referred to as the optimal policy. ylsubscript𝑦𝑙y_{l} represents the corresponding response from the quantized model πqsubscript𝜋q\pi_{\text{q}}. x𝑥x serves as the prompt. Specifically, ywsubscript𝑦𝑤y_{w} is obtained as argmaxyπfp(y|x)subscript𝑦subscript𝜋fpconditional𝑦𝑥\arg\max_{y}{\pi_{\text{fp}}(y|x)} and ylsubscript𝑦𝑙y_{l} as argmaxyπq(y|x)subscript𝑦subscript𝜋qconditional𝑦𝑥\arg\max_{y}{\pi_{\text{q}}(y|x)}.The preference of ywsubscript𝑦𝑤y_{w} over ylsubscript𝑦𝑙y_{l} is ensured by Theorem4.1. The proof is in A.8. Unlike conventional DPO methods, QDPO automatically distinguishes preferences without relying on expensive human-annotated datasets.

{theorem}

For any response y𝑦y in the set of all possible responses Y𝑌Y, if y1=argmaxyYπfp(y|x)subscript𝑦1subscript𝑦𝑌subscript𝜋fpconditional𝑦𝑥y_{1}=\arg\max_{y\in Y}\pi_{\text{fp}}(y|x) and y2=argmaxyYπq(y|x)subscript𝑦2subscript𝑦𝑌subscript𝜋qconditional𝑦𝑥y_{2}=\arg\max_{y\in Y}\pi_{\text{q}}(y|x), then it is guaranteed that p(y1y2)p(y2y1)superscript𝑝succeedssubscript𝑦1subscript𝑦2superscript𝑝succeedssubscript𝑦2subscript𝑦1p^{*}(y_{1}\succ y_{2})\geq p^{*}(y_{2}\succ y_{1}).

By precisely distinguishing preferences, we can clearly eliminate errors in data labeling. This leads to improved performance by accurately estimating the policy model’s density. QDPOsubscriptQDPO\mathcal{L}_{\text{QDPO}} is define with high-quality preference data 𝒟QDPOsubscript𝒟QDPO\mathcal{D}_{\text{QDPO}} as follows:

𝔼x𝒟QDPOywπfp,ylπq[logσ(βlogπθ(yw|x)πq(yw|x)βlogπθ(yl|x)πq(yl|x))].similar-to𝑥subscript𝒟QDPOformulae-sequencesimilar-tosubscript𝑦𝑤subscript𝜋fpsimilar-tosubscript𝑦𝑙subscript𝜋q𝔼delimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋qconditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋qconditionalsubscript𝑦𝑙𝑥\underset{\begin{subarray}{c}x\sim\mathcal{D}_{\text{{QDPO}}}\\y_{w}\sim\pi_{\text{fp}},y_{l}\sim\pi_{\text{q}}\end{subarray}}{\mathbb{E}}\left[-\log\sigma\left(\beta\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{q}}(y_{w}|x)}\right.\right.\\\left.\left.-\beta\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{q}}(y_{l}|x)}\right)\right].(5)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (4)

Input: prompt {x1,x2,,xN}subscript𝑥1subscript𝑥2subscript𝑥𝑁\{x_{1},x_{2},\dots,x_{N}\}, full precision policy πfpsubscript𝜋fp\pi_{\text{fp}}, quantized policy πqsubscript𝜋q\pi_{\text{q}}, KL penalty β𝛽\beta

Output: Updated policy πθsubscript𝜋𝜃\pi_{\theta}

Initialize πθsubscript𝜋𝜃\pi_{\theta} from πqsubscript𝜋q\pi_{\text{q}}

Preference pairs dataset 𝒟QDPO=subscript𝒟QDPO\mathcal{D}_{\text{{QDPO}}}=\emptyset

fori=1𝑖1i=1 to N𝑁Ndo

ywi=argmaxyπfp(y|xi)subscriptsuperscript𝑦𝑖𝑤subscript𝑦subscript𝜋fpconditional𝑦superscript𝑥𝑖y^{i}_{w}=\arg\max_{y}{\pi_{\text{fp}}(y|x^{i})}

yli=argmaxyπq(y|xi)subscriptsuperscript𝑦𝑖𝑙subscript𝑦subscript𝜋qconditional𝑦superscript𝑥𝑖y^{i}_{l}=\arg\max_{y}{\pi_{\text{q}}(y|x^{i})}

Add pair {ywi,yli,xi}subscriptsuperscript𝑦𝑖𝑤subscriptsuperscript𝑦𝑖𝑙superscript𝑥𝑖\{y^{i}_{w},y^{i}_{l},x^{i}\} to 𝒟QDPOsubscript𝒟QDPO\mathcal{D}_{\text{{QDPO}}}

endfor

foreach pair {yw,yl,x}subscript𝑦𝑤subscript𝑦𝑙𝑥\{y_{w},y_{l},x\} in 𝒟QDPOsubscript𝒟QDPO\mathcal{D}_{\text{{QDPO}}}do

Calculate QDPOsubscriptQDPO\mathcal{L}_{\text{QDPO}} from eq.(5)

Calculate the gradient with respect to θ𝜃\theta

QDPOθ=QDPOθqθqθSTEQDPOθqsubscriptQDPO𝜃subscriptQDPOsubscript𝜃qsubscript𝜃q𝜃STEsubscriptQDPOsubscript𝜃q\frac{\partial\mathcal{L}_{\text{QDPO}}}{\partial\theta}=\frac{\partial\mathcal{L}_{\text{QDPO}}}{\partial\theta_{\text{q}}}\cdot\frac{\partial\theta_{\text{q}}}{\partial\theta}\underset{\text{{STE}}}{\approx}\frac{\partial\mathcal{L}_{\text{QDPO}}}{\partial\theta_{\text{q}}}

Update πθsubscript𝜋𝜃\pi_{\theta} by minimizing QDPOsubscriptQDPO\mathcal{L}_{\text{QDPO}}

endfor

return Updated policy πθsubscript𝜋𝜃\pi_{\theta}

4.2 Implementation

Given πθsubscript𝜋𝜃\pi_{\theta} as the quantized model’s policy, integrating QDPOsubscriptQDPO\mathcal{L}_{\text{QDPO}} with QAT adjusts for quantization effects. The quantization technique we employ uniformly quantizes each channel across its entire min-max range, ensuring comprehensive accommodation of the full spectrum of values within each channel. To overcome the challenge posed by the non-differentiable rounding within the quantization process, we employ the Straight-Through Estimator (STE) for gradient approximation, facilitating effective gradient approximation and ensuring smooth training despite quantization. As shown in Fig.4, QDPO demonstrates an increase in the chosen reward and a decrease in the rejected reward throughout the training process, indicating effective loss convergence. The complete procedure of QDPO is described in Algorithm 1. Details of the training settings and hyperparameters for QDPO can be found in A.1.

5 Experiments

5.1 Experimental Settings

Models. We evaluate QDPO on two representative conversational LLMs. Vicuna-v1.5Zheng etal. (2023), instruction-finetuned from LLaMA2 for improved conversational ability, and a bilingual (English-Korean) LLM, Mi:dmKT-AI (2023), to confirm QDPO’s effectiveness to support multiple languages. All these models have 7B parameters.

Benchmarks. For a comprehensive evaluation of conversational abilities, we employ three distinct benchmarks: MT-BenchZheng etal. (2023), Vicuna-EvalChiang etal. (2023), and FLASKYe etal. (2023). MT-Bench utilizes GPT-4 to evaluate the quality of two responses obtained from an initial question and an additional follow-up question, offering an evaluation of multi-turn responses. For assessing Korean capability, we also translate the MT-Bench dataset into Korean using GPT-4. Vicuna-Eval consists of 80 questions for evaluation by GPT-4 to determine which model generates better sentences. FLASK includes 1.7K samples designed to assess LLM’s fine-grained language skills, such as robustness and harmlessness.

Quantization Methods. To understand the impact of quantization on conversational abilities, we consider variations of quantization methods:

  • -

    Baseline: 16-bit floating-point weight

  • -

    RTNJacob etal. (2018): 4-bit round-to-nearest weight quantization

  • -

    AWQLin etal. (2023): 4-bit RTN with weight scaling for improved quantization

  • -

    KDLiu etal. (2023b): 4-bit quantization-aware training with knowledge distillation (KD) loss from Baseline

  • -

    QDPO (Ours): 4-bit RTN with QDPO for improved conversational abilities

Details of the experimental settings for each case can be found in A.1.

Lang.ModelMethodWinTieLoseLose-rate ↓
EngMi:dmRTN246660.69
AWQ289520.58
KD3116520.53
QDPO5314440.40
VicunaRTN2622730.60
AWQ3922470.44
QDPO4027530.44
KorMi:dmRTN297550.60
AWQ255480.62
QDPO454610.55

Category16-bit InferenceW4A16 Inference
RTNAWQQDPO
Writing5.824.135.394.74
Roleplay5.615.535.005.13
Reasoning3.373.063.614.31
Math1.711.451.601.40
Coding1.111.561.162.28
Extraction3.632.563.503.08
STEM5.244.394.685.69
Humanities6.265.755.005.63
Average4.093.553.744.03

5.2 Experimental Results: MT-Bench

We evaluate quantized LLMs on MT-Bench to understand the impact of different quantization methods on conversational abilities. Following conventionZheng etal. (2023), we report both pairwise comparison and single-answer grading results (A.4 for detailed evaluation metrics).

Pairwise Comparison. Table1 shows the results of pairwise comparison on MT-Bench for various quantized LLMs. Each quantized LLM is compared with the Baseline (16-bit weight) by GPT-4 for their multi-turn responses to the questions in various categories of MT-Bench. We focus on the lose-rate since our alignment objective is to improve the answer quality of the quantized LLM superior to (win) or comparable with (tie) the 16-bit weight baseline. In all the cases, RTN suffers from the highest lose-rate compared to AWQ due to its simplest quantization mechanism. However, lose-rate of the same RTN can be significantly improved by QDPO; QDPO achieves the lowest lose-rate in all the cases. We can further compare QDPO with KD as they finetune the model weights to be quantization-friendly. Interestingly, QDPO outperforms KD with a noticeable increase in winning cases. These results showcase that QDPO can effectively align the answer quality to the 16-bit weight baseline.

Single-Answer Grading. Table 2 presents the single-answer grading results of Mi:dm on MT-Bench across eight categories, each with 10 questions, and reports the average GPT-4 rating (higher is better). Throughout the categories, RTN suffers from the lowest grading due to the quantization errors, which can be marginally improved by AWQ. In contrast, QDPO significantly improves the average grading from RTN, achieving the average grading on par with the 16-bit weight baseline. This also highlights the effectiveness of QDPO in recovering conversational abilities. Details on category-wise analysis can be found in A.5.

5.3 Experimental Results: Vicuna-Eval

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (5)

Since Vicuna-Eval is a widely used benchmark for evaluating conversational abilities, we further employ it for evaluating QDPO. We take Mi:dm as a target language model to apply different quantization methods and evaluate its performance on 80 questions by GPT-4. As shown in Fig.5, it can be seen that models with QDPO applied exhibit the highest wins and the lowest losses, demonstrating a lose-rate of 50%, which indicates a near recovery of the language capabilities of the baseline model.

5.4 Experimental Results: FLASK

We use the FLASK benchmark on Mi:dm to verify how the proposed method enhances the fine-grained skills of the language model. Fig.6 shows the relative performance of different quantized LLMs across the 12 fine-grained skills. RTN significantly diminishes certain capabilities of the model, while AWQ and KD slightly improve performance toward the 16-bit weight baseline. In contrast, QDPO shows a significant enhancement in most skills; in particular, QDPO significantly improves metacognition skills, whereas RTN and AWQ significantly fall short. (Details on skill-wise analysis can be found in A.6.) Overall, QDPO achieves the abilities closest to the 16-bit weight baseline, showcasing the effectiveness of its alignment objective in recovering conversational skills.

MethodCSQAMMLUDROPBBHMT-bench
Baseline75.1646.5524.9534.234.07
RTN73.8742.4621.8132.573.52
RTN+QDPO73.1142.6921.5032.053.96
AWQ74.7545.0624.0732.633.75
AWQ+QDPO74.2944.9924.1232.763.87

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (6)

5.5 Ablation Studies

We further conduct ablation studies to provide insights on QDPO for improving the conversational skills of quantized LLMs.

Conversation Abilities vs. Task Accuracy. As discussed, QDPO has the particular role of aligning the quantized LLMs to the 16-bit weight baseline. What is the impact of this alignment on the task-specific performance of LLMs? To answer this question, we further evaluate the quantized LLMs on well-known benchmarks that test the task-specific capability of language models. In particular, Common Sense Question Answering (CSQA)Talmor etal. (2019) and Massive Multitask Language Understanding (MMLU)Hendrycks etal. (2020) assess the models’ reasoning and multitask-solving abilities through multiple-choice questions. Furthermore, DROPDua etal. (2019) and BBHSrivastava etal. (2023) evaluate the problem-solving abilities of instruction-tuned models in logic and math. (Details on the task-specific benchmarks are in A.7.) Table3 compares the task accuracy (CSQA, MMLU, DROP, BBH) as well as the conversational abilities (MT-Bench) on the quantized LLM with and without QDPO. RTN suffers degradation on the task accuracy as well as the conversational abilities. Interestingly, AWQ significantly improves task accuracy while its conversational abilities are marginally improved. Meanwhile, QDPO improves conversational ability while mostly preserving task accuracy, showcasing its usefulness.

Conversation Abilities vs. Perplexity.

LanguangeModelMethodPPL ↓Lose-rate ↓
EnglishMi:dm16-bit Baseline13.12-
RTN15.160.69
AWQ14.230.58
QDPO15.550.40
Vicuna16-bit Baseline6.78-
RTN7.530.60
AWQ7.340.44
QDPO7.360.44
KoreanMi:dm16-bit Baseline5.71-
RTN6.520.60
AWQ5.970.62
QDPO6.560.55

Perplexity is a key metric for evaluating language models, as it measures the exponentiated average negative log probability of predicted word sequences. We examine whether the enhanced conversational capabilities through QDPO are also reflected in perplexity by comparing the perplexity and the lose-rate on the MT-bench. We measure perplexity using Wikitext-2Merity etal. (2016) for English and Korean textbooks111https://huggingface.co/datasets/maywell/korean_textbooks dataset for Korean. As shown in Table4, RTN significantly increases perplexity across all models. While AWQ decreases perplexity in all models, it does not guarantee an improvement in conversational ability. For example, in Mi:dm’s Korean benchmark, AWQ significantly reduces perplexity by 0.55 compared to RTN, yet the lose-rate increases by 2%. On the other hand, QDPO significantly enhances conversational ability, even though it does not achieve as low a perplexity as the baseline. We believe that the discrepancy between perplexity and conversational ability stems from the difficulty of using next-word prediction perplexity on reference text to capture the impact of flipped tokens in an auto-regressive generation. From our observation, these tokens significantly contribute to sentence variation, as discussed in Sec.3.2.

QDPO vs. Beam Search.

MethodNumber of BeamsWinTieLoseLose-rate ↓
RTN1246660.69
AWQ1289520.58
3389500.52
53510610.58
QDPO15344140.40

Beam search Graves (2012) generates higher probability outcomes by considering multiple generation possibilities simultaneously. Therefore, even if the quantized model makes different judgments from the baseline, which significantly influences sentence generation, there is still a possibility of generating outcomes without issues in overall probability. We aim to observe how decoding strategies affect quantized generation across three beam sizes (1, 3, 5). Table5 shows the results of the MT-Bench pairwise comparison according to decoding strategies. In the case of a beam size of 3, generating a more diverse range of sentences slightly reduces the lose-rate, yet it still exhibits many losses, and increasing the beam size further does not fundamentally solve the problem, as it also increases defeats. In contrast, QDPO demonstrates a lower lose-rate by creating models that are robust to quantization.

6 Conclusion

In this work, we address the conversational abilities of quantized LLM-based chatbots. After identifying token-flipping as a crucial factor for degraded text generation quality, we propose a novel quantization-aware direct preference optimization (QDPO) method that effectively aligns quantized and full-precision LLMs, enhancing conversational performance. Tested across multiple languages and models, QDPO outperforms traditional fine-tuning techniques, setting a new benchmark for conversational chatbot development.

Acknowledgement

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2020-II201373, Artificial Intelligence Graduate School Program (Hanyang University), and 2022-0-00971, Logic Synthesis for NVM-based PIM Computing Architecture) and National Research Foundation of Korea (NRF) (No. RS-2023-00260527). This work was also partly supported by Artificial Intelligence Industrial Convergence Cluster Development Project, funded by the Ministry of Science and ICT (MSIT, Korea) & Gwangju Metropolitan City, and the research fund of Hanyang University (HY-201900000002966).

Limitations

Our objective is to align the language capabilities of a baseline model distorted by quantization error through DPO. We focus on exploring scenarios where quantization error does not completely ruin conventional benchmarking performance yet introduces subtle differences in language capabilities that are perceptible to humans. Hence, we do not address situations where large quantization errors significantly degrade model performance, nor do we deal with cases using fine-grained quantization where quantization error is minimal. However, from a practical standpoint, the challenge of reducing the inference cost of LLMs by transitioning to lower bit-precision is necessary, and this process should consider various techniques, including group quantization. Additionally, since our approach involves aligning the baseline model with a relatively minimal training process, there are limitations in utilizing extensive datasets. Nonetheless, the impact of different datasets when aligning the baseline model with a limited number of bits remains an intriguing topic.

References

  • Askell etal. (2021)Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021.A general language assistant as a laboratory for alignment.
  • Bai etal. (2022)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, etal. 2022.Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862.
  • Bisk etal. (2019)Yonatan Bisk, Rowan Zellers, RonanLe Bras, Jianfeng Gao, and Yejin Choi. 2019.Piqa: Reasoning about physical commonsense in natural language.
  • Bradley and Terry (1952)RalphAllan Bradley and MiltonE Terry. 1952.Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345.
  • Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners.In Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc.
  • Chiang etal. (2023)Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing. 2023.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  • Chung etal. (2022)HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, etal. 2022.Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416.
  • Clark etal. (2019)Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019.BoolQ: Exploring the surprising difficulty of natural yes/no questions.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Dettmers etal. (2023)Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023.QLoRA: Efficient finetuning of quantized LLMs.In Thirty-seventh Conference on Neural Information Processing Systems.
  • Dua etal. (2019)Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Frantar etal. (2023)Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023.OPTQ: Accurate quantization for generative pre-trained transformers.In The Eleventh International Conference on Learning Representations.
  • Gao etal. (2020)Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, etal. 2020.The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027.
  • Graves (2012)Alex Graves. 2012.Sequence transduction with recurrent neural networks.arXiv preprint arXiv:1211.3711.
  • Hendrycks etal. (2020)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020.Measuring massive multitask language understanding.CoRR, abs/2009.03300.
  • Hu etal. (2022)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen. 2022.LoRA: Low-rank adaptation of large language models.In International Conference on Learning Representations.
  • Jacob etal. (2018)Benoit Jacob, Skirmantas Kligys, BoChen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018.Quantization and training of neural networks for efficient integer-arithmetic-only inference.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713.
  • Kim etal. (2023)Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du-Seong Chang, Wonyong Sung, and Jungwook Choi. 2023.Token-scaled logit distillation for ternary weight generative language models.In Thirty-seventh Conference on Neural Information Processing Systems.
  • KT-AI (2023)KT-AI. 2023.Mi:dm-7b.
  • Lee etal. (2023a)Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. 2023a.Owq: Lessons learned from activation outliers for weight quantization in large language models.
  • Lee etal. (2023b)Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Hwang, Wonyong Sung, and Jungwook Choi. 2023b.Enhancing computation efficiency in large language models through weight and activation quantization.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14726–14739, Singapore. Association for Computational Linguistics.
  • Liang etal. (2023)Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, CeZhang, Christian Cosgrove, ChristopherD. Manning, Christopher Ré, Diana Acosta-Navas, DrewA. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, SangMichael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2023.Holistic evaluation of language models.
  • Lin (2004)Chin-Yew Lin. 2004.ROUGE: A package for automatic evaluation of summaries.In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Lin etal. (2023)JiLin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023.Awq: Activation-aware weight quantization for llm compression and acceleration.arXiv.
  • Liu etal. (2023a)Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, PeterJ Liu, and Jialu Liu. 2023a.Statistical rejection sampling improves preference optimization.arXiv preprint arXiv:2309.06657.
  • Liu etal. (2023b)Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. 2023b.Llm-qat: Data-free quantization aware training for large language models.
  • Longpre etal. (2023)Shayne Longpre, LeHou, TuVu, Albert Webson, HyungWon Chung, YiTay, Denny Zhou, QuocV. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023.The flan collection: Designing data and methods for effective instruction tuning.
  • Merity etal. (2016)Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016.Pointer sentinel mixture models.
  • Mukherjee etal. (2023)Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. 2023.Orca: Progressive learning from complex explanation traces of gpt-4.
  • OpenAI (2023)OpenAI. 2023.Gpt-4 technical report.arXiv preprint arXiv:2303.08774.
  • Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal. 2022.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744.
  • Penedo etal. (2023)Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023.The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116.
  • Rafailov etal. (2023)Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, ChristopherD Manning, and Chelsea Finn. 2023.Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290.
  • Raffel etal. (2019)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2019.Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv e-prints.
  • Roemmele etal. (2011)Melissa Roemmele, CosminAdrian Bejan, and AndrewS. Gordon. 2011.Choice of plausible alternatives: An evaluation of commonsense causal reasoning.In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21-23, 2011. AAAI.
  • Sakaguchi etal. (2019)Keisuke Sakaguchi, RonanLe Bras, Chandra Bhagavatula, and Yejin Choi. 2019.Winogrande: An adversarial winograd schema challenge at scale.
  • Schulman etal. (2017)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017.Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347.
  • Srivastava etal. (2023)Aarohi Srivastava etal. 2023.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research.
  • Talmor etal. (2019)Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019.Commonsenseqa: A question answering challenge targeting commonsense knowledge.
  • Taori etal. (2023)Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and TatsunoriB. Hashimoto. 2023.Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca.
  • Team etal. (2023)Gemini Team, Rohan Anil, etal. 2023.Gemini: A family of highly capable multimodal models.
  • Touvron etal. (2023a)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal. 2023a.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971.
  • Touvron etal. (2023b)Hugo Touvron etal. 2023b.Llama 2: Open foundation and fine-tuned chat models.
  • Ye etal. (2023)Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. 2023.Flask: Fine-grained language model evaluation based on alignment skill sets.
  • Zellers etal. (2019)Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019.Hellaswag: Can a machine really finish your sentence?CoRR, abs/1905.07830.
  • Zhang etal. (2022)Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, XiVictoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, PunitSingh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022.Opt: Open pre-trained transformer language models.
  • Zheng etal. (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, JosephE. Gonzalez, and Ion Stoica. 2023.Judging LLM-as-a-judge with MT-bench and chatbot arena.In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Zhu etal. (2015)Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015.Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.In The IEEE International Conference on Computer Vision (ICCV).

Appendix A Appendix

A.1 Experimental Details

PTQ Calibration Settings. For PTQ calibration, we use the widely utilized method AWQLin etal. (2023), with the calibration set consisting of 64 samples randomly extracted from the C4Raffel etal. (2019) dataset. We apply channel-wise quantization and do not consider fine-grained quantization (e.g. group quantization) to better observe the impact of quantization on the LLM’s conversational abilities.

Knowledge Distillation Settings. For KD setting, we follow KD method introduced in LLM-QATLiu etal. (2023b), excluding the data curation process. To facilitate a fair comparison with QDPO, we extracted 50,000 prompts from the Anthropic Helpful and Harmless dialogue datasetBai etal. (2022), and set the learning rate to 3e-6.

Training Settings. In our QDPO experiments, similar to the KD, we sample 50,000 prompts in English from the Anthropic Helpful and Harmless dialogue dataset and 21,155 prompts in Korean from the KoAlpaca222https://huggingface.co/datasets/beomi/KoAlpaca-v1.1a dataset. We collect responses using both the full-precision policy (πfpsubscript𝜋fp\pi_{\text{fp}}) and the quantized policy (πqsubscript𝜋q\pi_{\text{q}}) to construct a preference pair dataset. The learning rate is set to 3e-6.

A.2 More Details on Breakdown Analysis

To separate the cause of errors in text generation, we employ the following steps:

  • We provide the same input to both the baseline and quantized models, then observe the first 100 generations and find the timestep at which the first different token is generated between the two models. We dump these differently generated tokens, which are flipped tokens.

  • We dump the KV cache of both the baseline and quantized models until timestep. This is facilitated easily through HuggingFace333https://github.com/huggingface/transformers’s past_key_values argument.

  • Based on the dumped flipped tokens and KV cache, we observe additional generations with either the baseline or quantized model, depending on our purpose in reflecting QerrorsubscriptQerror\text{Q}_{\text{error}}.

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (7)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (8)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (9)

A.3 QDPO’s Compatibility with Existing Techniques

QDPO on RLHF-tuned Models. We conduct additional experiments to investigate whether QDPO can serve as a complementary approach to recover conversational in quantized RLHF-tuned models, such as LLaMA2-ChatTouvron etal. (2023b) in Table6, In LLaMA2-Chat, which has improved conversational abilities through reflecting human preferences via RLHF, W4A16 RTN exhibits a 50% lose-rate compared to the baseline model. However, QDPO demonstrates further recovery of conversational ability. With more aggressive quantization at 3-bit, a clearer trend is observed. RTN experiences a rapid decline in conversational ability. In contrast, QDPO significantly reduces the lose-rate by restoring the conversational ability of the baseline model. This demonstrates that QDPO effectively enhances the conversational ability of a quantized RLHF-tuned model, indicating that it is a method compatible with existing RLHF.

Bit-precisionMethodWinTieLoseLose-rate ↓
W4A16RTN40206050.00%
QDPO38265546.22%
W3A16g128RTN29187661.79%
QDPO35226553.28%

QDPO with Memory-Efficient Fine-Tuning Method. We extend our experiments with QDPO training using LoRAHu etal. (2022). Following the approach in QLoRADettmers etal. (2023), we keep the quantized base weights frozen and train only the high-precision adapter. To ensure a fair comparison with other methods, we utilize INT4 for the base weights instead of NF4Dettmers etal. (2023). The adapter rank and α𝛼\alpha used in this experiment is 64.As shown in Table7, QDPO with LoRA significantly enhances the conversational ability of quantized LLMs, achieving levels nearly identical to those of QDPO. Moreover, QDPO with LoRA reduces the required memory by keeping the quantized base weights fixed and significantly decreases the number of training parameters by only training the LoRA adapter. These results suggest that QDPO serves as a complementary method that can be utilized alongside other techniques.

MethodWinTieLoseLose-rate ↓# Trainable paramsRequired Memory for TrainingInference bit-width
RTN246660.69--W4A16
AWQ289520.58--W4A16
KD3116520.537.02 B56.16 GBW4A16
QDPO5314440.407.02 B56.16 GBW4A16
QDPO+LoRA4814460.431.33 B14.15 GBW16A16

A.4 MT-Bench Evaluation Metrics

Lang.ModelMethodWinTieLoseLose-Rate
EngMi:dmRTN18103550.31
AWQ23110460.26
KD22115400.23
QDPO43118380.19
VicunaRTN2095610.35
AWQ31107460.25
QDPO33113440.23
KorMi:dmRTN20103450.27
AWQ20111370.22
QDPO39109400.21

Pairwise Comparison. In pairwise comparison within MT-Bench for 80 samples, GPT-4 evaluates which model provides better responses between the two models. However, due to most LLMs’ tendency to prefer the first positionZheng etal. (2023), the evaluation occurs twice in reversed order, counting victories only if one model wins in both cases. If judgments reverse or both evaluations result in ties, it counts as an actual tie. We observe that GPT-4 frequently evaluates "tie" more often than usual in comparisons between different models in MT-Bench. This increased frequency of ties is because our study focuses on comparing similar models (the baseline model and the quantized model). We find that cases evaluated as a tie in both positions present many obstacles to the evaluation we desire for judging alignment. For example, as shown in Fig.12 and Fig.13, when the baseline model provides an incorrect answer and the quantized model also offers a wrong answer (but a different response), GPT-4 provides a "tie" because they are both incorrect. However, this "tie" does not reflect our goal of assessing "how well two models are aligned." Therefore, we evaluate a tie only in cases where win/lose changes due to swapping positions, causing GPT-4 confusion. We find this evaluation method to be the most consistent with the results of other benchmarks, like Vicuna-EvalChiang etal. (2023). Results obtained using the original evaluation criteria of MT-Bench is in Table8.

Single-Answer Grading. In single-answer grading, we directly request GPT-4 to assign scores of up to 10 points. While this approach may not be as nuanced as pairwise comparison in model comparisons, it enables observation of how quantization induces changes in specific categories where the model has strengths and weaknesses by measuring absolute scores by category.

A.5 Detailed Analysis by Catagory in MT-Bench

As shown in Table2, QDPO improves overall capability compared to other methods. However, we observe that in some categories, QDPO scores are lower than the AWQ model. We conducted a more detailed observation of GPT-4’s evaluations in areas where QDPO exhibits lower performance. Interestingly, as depicted in Fig.14, we can see that QDPO fails in cases where RTN already provides a good response and receives a high score, almost the same as the baseline generation. We believe this might be the case because QDPO is trained to reject sentences generated by the quantized model, which can lead to optimization challenges in such situations. Additional examples are present in Fig.15.

A.6 Skill-wise Analysis in FLASK

Catergory16-bitBaselineW4A16
RTNAWQKDQDPO
Robustness2.0291.8391.9271.8301.924
Correctness2.2372.0872.2542.2062.172
Efficiency2.3331.9882.0362.0362.129
Factuality2.7092.4872.4972.6312.691
Commonsense2.9652.7352.9252.9532.961
Comprehension2.8742.6392.7252.8312.879
Insightfulness2.2682.3392.0792.0952.246
Completeness2.8582.5872.5182.6662.784
Metacognition2.8912.5622.6252.6632.863
Readability4.0794.0473.9563.9894.070
Conciseness3.8813.6953.8863.7823.785
Harmlessness4.4474.5004.3554.5124.575
Average2.9642.7922.8152.8492.923

We aim to investigate how QDPO recovers skills that, according to FLASK’s fine-grained categorization, significantly underperform in RTN and AWQ compared to the baseline. As shown in Fig.17, RTN opts for "<[!newline]>" instead of ":", leading to subsequent generations consisting solely of simple listings, and it can be observed that sentences become repetitive as they lengthen. In contrast, models applying QDPO follow the baseline by providing explanations for each item.

A.7 Details of Task-Specific Benchmarks

To assess the reasoning capabilities of Large Language Models (LLMs), benchmarks such as Common Sense Question Answering (CSQA)Talmor etal. (2019) and MMLUHendrycks etal. (2020) have been widely utilized. CSQA assesses models’ reasoning abilities through multiple-choice questions, while MMLU verifies models’ multitask-solving capabilities across 57 different tasks with multiple-choice questions. Recently, benchmarks like DROPDua etal. (2019) and BBHSrivastava etal. (2023) have been used to evaluate the problem-solving abilities of instruction-tuned models, testing skills in logic and math. Additionally, the Helpful, Honest, and Harmless (HHH)Askell etal. (2021) benchmark is widely used to assess the extent to which these models are safe or beneficial to humans. In our experiments, we measure zero-shot CSQA benchmark and average across five tasks (WinoGrandeSakaguchi etal. (2019), COPARoemmele etal. (2011), PIQABisk etal. (2019), BoolQClark etal. (2019), HellaSwagZellers etal. (2019)).

A.8 Proof of Theorem.4.1

{theorem}

For any response y𝑦y in the set of all possible responses Y𝑌Y, if y1=argmaxyYπfp(y|x)subscript𝑦1subscript𝑦𝑌subscript𝜋fpconditional𝑦𝑥y_{1}=\arg\max_{y\in Y}\pi_{\text{fp}}(y|x) and y2=argmaxyYπq(y|x)subscript𝑦2subscript𝑦𝑌subscript𝜋qconditional𝑦𝑥y_{2}=\arg\max_{y\in Y}\pi_{\text{q}}(y|x), then it is guaranteed that p(y1y2)p(y2y1)superscript𝑝succeedssubscript𝑦1subscript𝑦2superscript𝑝succeedssubscript𝑦2subscript𝑦1p^{*}(y_{1}\succ y_{2})\geq p^{*}(y_{2}\succ y_{1}).

Proof.

The definition of argmax\arg\max ensures that for all yY𝑦𝑌y\in Y, πfp(y1|x)πfp(y|x)subscript𝜋fpconditionalsubscript𝑦1𝑥subscript𝜋fpconditional𝑦𝑥\pi_{\text{fp}}(y_{1}|x)\geq\pi_{\text{fp}}(y|x) and πq(y2|x)πq(y|x)subscript𝜋qconditionalsubscript𝑦2𝑥subscript𝜋qconditional𝑦𝑥\pi_{\text{q}}(y_{2}|x)\geq\pi_{\text{q}}(y|x) holds true. Consequently, this implies πfp(y1|x)πfp(y2|x)subscript𝜋fpconditionalsubscript𝑦1𝑥subscript𝜋fpconditionalsubscript𝑦2𝑥\pi_{\text{fp}}(y_{1}|x)\geq\pi_{\text{fp}}(y_{2}|x) and πq(y2|x)πq(y1|x)subscript𝜋qconditionalsubscript𝑦2𝑥subscript𝜋qconditionalsubscript𝑦1𝑥\pi_{\text{q}}(y_{2}|x)\geq\pi_{\text{q}}(y_{1}|x).
Substituting eq.(3) into eq.(1) we obtain:

p(y1\displaystyle p^{*}(y_{1}y2|x)\displaystyle\succ y_{2}|x)(6)
=11+exp(βlogπfp(y2|x)πq(y2|x)βlogπfp(y1|x)πq(y1|x))absent11𝛽subscript𝜋fpconditionalsubscript𝑦2𝑥subscript𝜋𝑞conditionalsubscript𝑦2𝑥𝛽subscript𝜋fpconditionalsubscript𝑦1𝑥subscript𝜋𝑞conditionalsubscript𝑦1𝑥\displaystyle=\frac{1}{1+\exp\left(\beta\log\frac{\pi_{\text{fp}}(y_{2}|x)}{\pi_{q}(y_{2}|x)}-\beta\log\frac{\pi_{\text{fp}}(y_{1}|x)}{\pi_{q}(y_{1}|x)}\right)}
=σ(βlogπfp(y1|x)πq(y1|x)βlogπfp(y2|x)πq(y2|x))absent𝜎𝛽subscript𝜋fpconditionalsubscript𝑦1𝑥subscript𝜋𝑞conditionalsubscript𝑦1𝑥𝛽subscript𝜋fpconditionalsubscript𝑦2𝑥subscript𝜋𝑞conditionalsubscript𝑦2𝑥\displaystyle=\sigma\left(\beta\log\frac{\pi_{\text{fp}}(y_{1}|x)}{\pi_{q}(y_{1}|x)}-\beta\log\frac{\pi_{\text{fp}}(y_{2}|x)}{\pi_{q}(y_{2}|x)}\right)
=σ(β(logπfp(y1|x)πfp(y2|x)logπq(y1|x)πq(y2|x)))absent𝜎𝛽subscript𝜋fpconditionalsubscript𝑦1𝑥subscript𝜋fpconditionalsubscript𝑦2𝑥subscript𝜋qconditionalsubscript𝑦1𝑥subscript𝜋qconditionalsubscript𝑦2𝑥\displaystyle=\sigma\left(\beta\left(\log\frac{\pi_{\text{fp}}(y_{1}|x)}{\pi_{\text{fp}}(y_{2}|x)}-\log\frac{\pi_{\text{q}}(y_{1}|x)}{\pi_{\text{q}}(y_{2}|x)}\right)\right)

logπfp(y1|x)πfp(y2|x)logπq(y1|x)πq(y2|x)subscript𝜋fpconditionalsubscript𝑦1𝑥subscript𝜋fpconditionalsubscript𝑦2𝑥subscript𝜋qconditionalsubscript𝑦1𝑥subscript𝜋qconditionalsubscript𝑦2𝑥\log\frac{\pi_{\text{fp}}(y_{1}|x)}{\pi_{\text{fp}}(y_{2}|x)}-\log\frac{\pi_{\text{q}}(y_{1}|x)}{\pi_{\text{q}}(y_{2}|x)} and β𝛽\beta is positive, it follows that p(y1y2|x)0.5superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥0.5p^{*}(y_{1}\succ y_{2}|x)\geq 0.5. Consequently, this implies that p(y1y2)p(y2y1)superscript𝑝succeedssubscript𝑦1subscript𝑦2superscript𝑝succeedssubscript𝑦2subscript𝑦1p^{*}(y_{1}\succ y_{2})\geq p^{*}(y_{2}\succ y_{1}).∎

A.9 Generation Examples

Fig.10 demonstrates a decline in language model performance due to the generation of different tokens compared to the baseline. The baseline model selects “Wear" following “1.", whereas the PTQ model, experiencing a change in probability ranking, chooses “Always." The PTQ model then repeats this word, leading to expressions that feel awkward to humans. On the other hand, QDPO recovers the probabilities similar to the baseline model, thereby continuing with the natural generation.

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (10)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (11)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (12)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (13)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (14)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (15)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (16)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (17)

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Golda Nolan II

Last Updated:

Views: 6298

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Golda Nolan II

Birthday: 1998-05-14

Address: Suite 369 9754 Roberts Pines, West Benitaburgh, NM 69180-7958

Phone: +522993866487

Job: Sales Executive

Hobby: Worldbuilding, Shopping, Quilting, Cooking, Homebrewing, Leather crafting, Pet

Introduction: My name is Golda Nolan II, I am a thoughtful, clever, cute, jolly, brave, powerful, splendid person who loves writing and wants to share my knowledge and understanding with you.