Empowering Parameter-Efficient Transfer Learning by Recognizing the Kernel Structure in Attention

Y Chen*, Devamanyu Hazarika*, M Namazifar, Y Liu, D Jin, D Hakkani-Tur, NAACL 2022 *Equal Contribution


The massive amount of trainable parameters in the pre-trained language models (PLMs) makes them hard to be deployed to multi- ple downstream tasks. To address this issue, parameter-efficient transfer learning methods have been proposed to tune only a few pa- rameters during fine-tuning while freezing the rest. This paper looks at existing methods along this line through the kernel lens. Moti- vated by the connection between self-attention in transformer-based PLMs and kernel learn- ing, we propose kernel-wise adapters, namely Kernel-mix, that utilize the kernel structure in self-attention to guide the assignment of the tunable parameters. These adapters use guide- lines found in classical kernel learning and en- able separate parameter tuning for each atten- tion head. Our empirical results, over a di- verse set of natural language generation and understanding tasks, show that our proposed adapters can attain or improve the strong per- formance of existing baselines.

Analyzing Modality Robustness in Multimodal Sentiment Analysis

Devamanyu Hazarika*, Y Li*, B Cheng, S Zhao, R Zimmermann, S Poria, NAACL 2022 *Equal Contribution

[Paper] [Code]

Building robust multimodal models are crucial for achieving reliable deployment in the wild. Despite its importance, less attention has been paid to identifying and improving the robustness of Multimodal Sentiment Anal- ysis (MSA) models. In this work, we hope to address that by (i) Proposing simple di- agnostic checks for modality robustness in a trained multimodal model. Using these checks, we find MSA models to be highly sensitive to a single modality, which creates issues in their robustness; (ii) We analyze well-known robust training strategies to alleviate the issues. Critically, we observe that robustness can be achieved without compromising on the original performance. We hope our extensive study–performed across five models and two benchmark datasets–and proposed procedures would make robustness an integral compo- nent in MSA research. Our diagnostic checks and robust training solutions are simple to im- plement and available at https://github. com/declare-lab/MSA-Robustness

So Different Yet So Alike! Constrained Unsupervised Text Style Transfer (Oral)

AR Kashyap*, Devamanyu Hazarika*, M-Y Kan, R Zimmermann, S Poria, ACL 2022 *Equal Contribution


Automatic transfer of text between domains has become popular in recent times. One of its aims is to preserve the semantic content while adapting to the target domain. However, it does not explicitly maintain other attributes between the source and translated text: e.g., text length and descriptiveness. Maintaining constraints in transfer has several downstream applications, including data augmentation and debiasing. We introduce a method for such constrained unsupervised text style transfer by introducing two complementary losses to the generative adversarial network (GAN) family of models. Unlike the competing losses used in GANs, we introduce cooperative losses where the discriminator and the generator cooperate and reduce the same loss. The first is a contrastive loss and the second is a classification loss — aiming to regularize the latent space further and bring similar sentences closer together. We demonstrate that such training retains lexical, syntactic and domain-specific constraints between domains for multiple benchmark datasets, including ones where more than one attribute change. We show that the complementary cooperative losses improve text quality, according to both automated and human evaluation measures.

Attention Biasing and Context Augmentation for Zero-Shot Control of Encoder-Decoder Transformers for Natural Language Generation

Devamanyu Hazarika, M Namazifar, and Dilek Hakkani-Tur, AAAI 2022

[Paper] [Arxiv]

Controlling neural network-based models for natural lan- guage generation (NLG) to realize desirable attributes in the generated outputs has broad applications in numerous areas such as machine translation, document summarization, and dialog systems. Approaches that enable such control in a zero-shot manner would be of great importance as, among other reasons, they remove the need for additional annotated data and training. In this work, we propose novel approaches for controlling encoder-decoder transformer-based NLG models in zero shot. While zero-shot control has previously been observed in massive models (e.g., GPT3), our method enables such control for smaller models. This is done by applying two control knobs, attention biasing and con- text augmentation, to these models directly during decoding and without additional training or auxiliary models. These knobs control the generation process by directly manipulating trained NLG models (e.g., biasing cross-attention layers). We show that not only are these NLG models robust to such ma- nipulations, but also their behavior could be controlled with- out an impact on their generation performance.

Analyzing the Domain Robustness of Pretrained Language Models, Layer by Layer

AR Kashyap, L Mehnaz, B Malik, A Waheed, Devamanyu Hazarika, M-Y Kan, and R

Shah, Adapt-NLP, EACL 2021


The robustness of pretrained language models(PLMs) is generally measured using performance drops on two or more domains. However, we do not yet understand the inherent robustness achieved by contributions from different layers of a PLM. We systematically analyze the robustness of these representations layer by layer from two perspectives. First, we measure the robustness of representations by using domain divergence between two domains. We find that i) Domain variance increases from the lower to the upper layers for vanilla PLMs; ii) Models continuously pretrained on domain-specific data (DAPT)(Gururangan et al., 2020) exhibit more variance than their pretrained PLM counterparts; and that iii) Distilled models (e.g., DistilBERT) also show greater domain variance. Second, we investigate the robustness of representations by analyzing the encoded syntactic and semantic information using diagnostic probes. We find that similar layers have similar amounts of linguistic information for data from an unseen domain.

Domain Divergences: a Survey and Empirical Analysis,

AR Kashyap, Devamanyu Hazarika, Min-Yen Kan, and R Zimmermann, NAACL 2021


Domain divergence plays a significant role in estimating the performance of a model in new domains. While there is significant literature on divergence measures, researchers find it hard to choose an appropriate divergence for a given NLP application. We address this shortcoming by both surveying the literature and through an empirical study. We develop a taxonomy of divergence measures consisting of three classes — Information-theoretic, Geometric, and Higher-order measures and identify the relationships between them. Further, to understand the common use-cases of these measures, we recognize three novel applications – 1) Data Selection, 2) Learning Representation, and 3) Decisions in the Wild – and use it to organize our literature. From this, we identify that Information-theoretic measures are prevalent for 1) and 3), and Higher-order measures are more common for 2). To further help researchers choose appropriate measures to predict drop in performance – an important aspect of Decisions in the Wild, we perform correlation analysis spanning 130 do- main adaptation scenarios, 3 varied NLP tasks and 12 divergence measures identified from our survey. To calculate these divergences, we consider the current contextual word representations (CWR) and contrast with the older distributed representations. We find that traditional measures over word distributions still serve as strong baselines, while higher-order measures with CWR are effective.