Tutorials, September 17-18, 2015
September 17 - Morning
- Semantic Similarity Frontiers: From Concepts to Documents David Jurgens and Mohammad Taher Pilehvar
- Personality Research for NLP Yair Neuman
September 17 - Afternoon
- Transparent Machine Learning for Information Extraction: State-of-the-Art and the Future Laura Chiticariu, Yunyao Li and Frederick Reiss
- Knowledge Acquisition for Web Search Marius Pasca
September 18 - Morning
- Learning Semantic Relations from Text Preslav Nakov, Vivi Nastase, Diarmuid Ó Séaghdha and Stan Szpakowicz
- Applications of Social Media Text Analysis Atefeh Farzindar and Diana Inkpen
September 18 - Afternoon
- Robust Semantic Analysis of Multiword Expressions with FrameNet Miriam R. L. Petruck and Valia Kordoni
- Computational Analysis of Affect and Emotion in Language Saif Mohammad and Cecilia Ovesdotter Alm
Semantic Similarity Frontiers: From Concepts to Documents
David Jurgens and Mohammad Taher Pilehvar
September 17, 2015 - Morning - Small Auditorium
Tutorial materials: PDF with slides
Semantic similarity forms a central component in many NLP systems, from lexical semantics, to part of speech tagging, to social media analysis. Recent years have seen a renewed interest in developing new similarity techniques, buoyed in part by work on embeddings and by SemEval tasks in Semantic Textual Similarity and Cross-Level Semantic Similarity. The increased interest has led to hundreds of techniques for measuring semantic similarity, which makes it difficult for practitioners to identify which state-of-the-art techniques are applicable and easily integrated into projects and for researchers to identify which aspects of the problem require future research.
This tutorial synthesizes the current state of the art for measuring semantic similarity for all types of conceptual or textual pairs and presents a broad overview of current techniques, what resources they use, and the particular inputs or domains to which the methods are most applicable. We survey methods ranging from corpus-based approaches operating on massive or domains-specific corpora to those leveraging structural information from expert-based or collaboratively-constructed lexical resources. Furthermore, we review work on multiple similarity tasks from sense-based comparisons to word, sentence, and document-sized comparisons and highlight general-purpose methods capable of comparing multiple types of inputs. Where possible, we also identify techniques that have been demonstrated to successfully operate in multilingual or cross-lingual settings.
Our tutorial provides a clear overview of currently-available tools and their strengths for practitioners who need out of the box solutions and provides researchers with an understanding of the limitations of current state of the art and what open problems remain in the field. Given the breadth of available approaches, participants will also receive a detailed bibliography of approaches (including those not directly covered in the tutorial), annotated according to the approaches abilities, and pointers to when open-source implementations of the algorithms may be obtained.
- David Jurgens, Postdoctoral Scholar, McGill University
David Jurgens is a postdoctoral scholar at McGill University. He received his PhD from the University of California, Los Angeles. His research interests include lexical semantics, word sense disambiguation, latent attribute inference, and the relationship between language and location. He is currently co-chairing the 2015 and 2016 International Workshops on Semantic Evaluation (SemEval). His research has been featured in the MIT Technology Review, Forbes, Business Insider, and Schneier on Security.
- Mohammad Taher Pilehvar, Postdoctoral Scholar, Sapienza University of Rome
Mohammad Taher Pilehvar is a postdoctoral scholar at Sapienza University of Rome. He received his PhD from the same university under the supervision of Roberto Navigli. He does research in multiple areas of Lexical Semantics such as semantic similarity and Word Sense Disambiguation (WSD). His main focus is on unified graph-based semantic similarity measures and large-scale frameworks for the evaluation of WSD systems. He has co-organized a task on Cross Level Semantic Similarity in SemEval-2014 (Jurgens et al., 2014) and is the first author of a paper on semantic similarity that was nominated for the best paper award at ACL 2013 (Pilehvar et al., 2013).
Personality Research for NLP
September 17, 2015 - Morning - Room 2
Tutorial materials: PDF with slides
"Personality" is a psychological concept describing the individual's characteristic patterns of thought, emotion, and behavior. In the context of Big Data and granular analytics, it is highly important to measure the individual's personality dimensions as these may be used for various practical applications. However, personality has been traditionally studied by questionnaires and other forms of low tech methodologies. The availability of textual data and the development of powerful NLP technologies, invite the challenge of automatically measuring personality dimensions for various applications from granular analytics of customers to the forensic identification of potential offenders. While there are emerging attempts to address this challenge, these attempts almost exclusively focus on one theoretical model of personality and on classification tasks limited when tagged data are not available.
The major aim of the tutorial is to provide NLP researchers with an introduction to personality theories that may empower their scope of research. In addition, two secondary aims are to survey some recent directions in computational personality and to point to future directions in which the field may be developed (e.g. Textual Entailment for Personality Analytics).
- Yair Neuman, Professor, Ben-Gurion University of the Negev
Prof. Yair Neuman (Ben-Gurion Univ. of the Negev) is the co-director of the Behavioral Insights Research Lab at the University of Toronto and a senior fellow at the Brain Sciences Foundation. Among his fields of interest are the interface of NLP and psychology and the development of novel cognitive-psychological technologies.
Transparent Machine Learning for Information Extraction: State-of-the-Art and the Future
Laura Chiticariu, Yunyao Li and Frederick Reiss
September 17, 2015 - Afternoon - Small Auditorium
Tutorial materials: PDF with slides
The rise of Big Data analytics over unstructured text has led to renewed interest in information extraction (IE). These applications need effective IE as a first step towards solving end-to-end real world problems (e.g. biology, medicine, finance, media and entertainment, etc). Much recent NLP research has focused on addressing specific IE problems using a pipeline of multiple machine learning techniques. This approach requires an analyst with the expertise to answer questions such as: “What ML techniques should I combine to solve this problem?”; “What features will be useful for the composite pipeline?”; and “Why is my model giving the wrong answer on this document?”. The need for this expertise creates problems in real world applications. It is very difficult in practice to find an analyst who both understands the real world problem and has deep knowledge of applied machine learning. As a result, the real impact by current IE research does not match up to the abundant opportunities available.
In this tutorial, we introduce the concept of transparent machine learning. A transparent ML technique is one that:
- produces models that a typical real world use can read and understand;
- uses algorithms that a typical real world user can understand; and
- allows a real world user to adapt models to new domains.
The tutorial is aimed at IE researchers in both the academic and industry communities who are interested in developing and applying transparent ML.
- Laura Chiticariu, Researcher, IBM Research – Almaden
Laura Chiticariu is a Research Staff Member at IBM Research – Almaden. She received her Ph.D from U.C. Santa Cruz in 2008. Her current research focuses on improving developmental support in information extraction systems.
- Yunyao Li, Researcher, IBM Research – Almaden
Yunyao Li is a Research Staff Member and Research Manager at IBM Research – Almaden. She received her Ph.D from the University of Michigan, Ann Arbor in 2007. She is particularly interested in designing, developing and analyzing large scale systems that are usable by a wide spectrum of users. Towards this direction, her current research focuses on enterprise-scale natural language processing.
- Frederick Reiss, Researcher, IBM Research – Almaden
Frederick Reiss is a Research Staff Member at IBM Research – Almaden. He received his Ph.D. from U.C. Berkeley in 2006. His research focuses on improving the scalability of text analytics in enterprise applications.
Knowledge Acquisition for Web Search
September 17, 2015 - Afternoon - Room 2
Tutorial materials: PDF with slides
The identification of textual items, or documents, that best match a user’s information need, as expressed in search queries, forms the core functionality of information retrieval systems. Well-known challenges are associated with understanding the intent behind user queries; and, more importantly, with matching inherently-ambiguous queries to documents that may employ lexically different phrases to convey the same meaning. The conversion of semi-structured content from Wikipedia and other resources into structured data produces knowledge potentially more suitable to database-style queries and, ideally, to use in information retrieval. In parallel, the availability of textual documents on the Web enables an aggressive push towards the automatic acquisition of various types of knowledge from text. Methods developed under the umbrella of open-domain information extraction acquire open-domain classes of instances and relations from Web text. The methods operate over unstructured or semi-structured text available within collections of Web documents, or over relatively more intriguing streams of anonymized search queries. Some of the methods import the automatically-extracted data into human-generated resources, or otherwise exploit existing human-generated resources. In both cases, the goal is to expand the coverage of the initial resources, thus providing information about more of the topics that people in general, and Web search users in particular, may be interested in.
- Marius Pasca, Research Scientist, Google
Marius Pasca is a research scientist at Google. Current research interests include the acquisition of factual information from unstructured text within documents and queries, and its applications to Web search.
Learning Semantic Relations from Text
Preslav Nakov, Vivi Nastase, Diarmuid Ó Séaghdha and Stan Szpakowicz
September 18, 2015 - Morning - Small Auditorium
Tutorial materials: PDF with slides
Every non-trivial text describes interactions and relations between people, institutions, activities, events and so on. What we know about the world consists in large part of such relations, and that knowledge contributes to the understanding of what texts refer to. Newly found relations can in turn become part of this knowledge that is stored for future use.
To grasp a text’s semantic content, an automatic system must be able to recognize relations in texts and reason about them. This may be done by applying and updating previously acquired knowledge. We focus here in particular on semantic relations which describe the interactions among nouns and compact noun phrases, and we present such relations from both a theoretical and a practical perspective. The theoretical exploration sketches the historical path which has brought us to the contemporary view and interpretation of semantic relations. We discuss a wide range of relation inventories proposed by linguists and by language processing people. Such inventories vary by domain, granularity and suitability for downstream applications.
On the practical side, we investigate the recognition and acquisition of relations from texts. In a look at supervised learning methods, we present available datasets, the variety of features which can describe relation instances, and learning algorithms found appropriate for the task. Next, we present weakly supervised and unsupervised learning methods of acquiring relations from large corpora with little or no previously annotated data. We show how enduring the bootstrapping algorithm based on seed examples or patterns has proved to be, and how it has been adapted to tackle Web-scale text collections. We also show a few machine learning techniques which can perform fast and reliable relation extraction by taking advantage of data redundancy and variability.
- Preslav Nakov, Senior Scientist, Qatar Computing Research Institute
Preslav Nakov, a Senior Scientist at the Qatar Computing Research Institute, part of Qatar Foundation, holds a Ph.D. from the University of California at Berkeley. His research interests include computational linguistics and NLP, machine translation, lexical semantics, Web as a corpus and biomedical text processing.
- Vivi Nastase Pilehvar, Researcher, Fondazione Bruno Kessler
Vivi Nastase is a researcher at the Fondazione Bruno Kessler in Trento, working mainly on lexical semantics, semantic relations, knowledge acquisition and language evolution. She holds a Ph.D. from the University of Ottawa, Canada.
- Diarmuid Ó Séaghdha, Senior Researcher, VocalIQ
Diarmuid Ó Séaghdha, Senior NLP Researcher at VocalIQ and Visiting Industrial Fellow at the University of Cambridge, holds a Ph.D. from the University of Cambridge. His research interests include discourse and dialog, lexical and relational semantics, machine learning for NLP, scientific text mining and social media analysis.
- Stan Szpakowicz, Emeritus Professor, University of Ottawa
Stan Szpakowicz, an emeritus professor of Computer Science at the University of Ottawa, holds a Ph.D. from the University of Warsaw and a D.Sc. from the Polish Academy of Sciences. He has been active in NLP since 1969. His recent interests include lexical resources, semantic relations and emotion analysis.
Applications of Social Media Text Analysis
Atefeh Farzindar and Diana Inkpen
September 18, 2015 - Morning - Room 2
Tutorial materials: PDF with slides
Analyzing social media texts is a complex problem that becomes difficult to address using traditional Natural Language Processing (NLP) methods. Our tutorial focuses on presenting new methods for NLP tasks and applications that work on noisy and informal texts, such as the ones from social media.
Automatic processing of large collections of social media texts is important because they contain a lot of useful information, due to the in-creasing popularity of all types of social media. Use of social media and messaging apps grew 203 percent year-on-year in 2013, with overall app use rising 115 percent over the same period, as reported by Statista, citing data from Flurry Analytics. This growth means that 1.61 billion people are now active in social media around the world and this is expected to advance to 2 billion users in 2016, led by India. The research shows that consumers are now spending daily 5.6 hours on digital media including social media and mo-bile internet usage.
At the heart of this interest is the ability for users to create and share content via a variety of platforms such as blogs, micro-blogs, collaborative wikis, multimedia sharing sites, social net-working sites. The unprecedented volume and variety of user-generated content, as well as the user interaction network constitute new opportunities for understanding social behavior and building socially intelligent systems. Therefore it is important to investigate methods for knowledge extraction from social media data. Furthermore, we can use this information to detect and retrieve more related content about events, such as photos and video clips that have caption texts.
- Atefeh Farzindar, Adjunct Professor, University of Montreal
Dr. Atefeh Farzindar is the CEO of NLP Technologies Inc. and Adjunct Professor at University of Montreal. She has served as Chair of the technology sector of the Language Industry Association Canada (AILIA) (2009-2013), vice president of The Language Technologies Research Centre (LTRC) of Canada (2012-2014) and a member of the Natural Sciences and Engineering Research Council of Canada (NSERC) Computer Science Liaison Committee (2014-2015). Recently, she authored a book chapter in Social Network Integration in Document Summarization, Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding, IGI Global publisher January 2014.
- Diana Inkpen, Professor, University of Ottawa
Dr. Diana Inkpen is a Professor the School of Electrical Engineering and Computer Science at the University of Ottawa. Her research interests and expertise are in natural language processing, in particular lexical semantics as applied to near synonyms and nuances of meaning, word and text similarity, classification of texts by emotion and mood, information retrieval from spontaneous speech, extraction of semantic frames, and lexical choice in natural language generation. She published more than 25 journal papers, 85 conference papers, and 6 book chapters. She is an associated editor of the Computational Intelligence and Natural Language Engineering journals.
Robust Semantic Analysis of Multiword Expressions with FrameNet
Miriam R. L. Petruck and Valia Kordoni
September 18, 2015 - Afternoon - Room 2
Tutorial materials: 1st PDF with slides 2nd PDF with slides
This tutorial will give participants a solid understanding of the linguistic features of multiword expressions (MWEs), focusing on the semantics of such expressions and their importance for natural language processing and language technology, with particular attention to the way that FrameNet (framenet.icsi.berkeley.edu) handles this wide spread phenomenon. Our target audience includes researchers and practitioners of language technology, not necessarily experts in MWEs or knowledgeable about FrameNet, who are interested in NLP tasks that involve or could benefit from considering MWEs as a pervasive phenomenon in human language and communication.
NLP research has been interested in automatic processing of multiword expressions, with reports on and tasks relating to such efforts presented at workshops and conferences for at least ten years (e.g. ACL 2003, LREC 2008, COLING 2010, EACL 2014). Overcoming the challenge of automatically processing MWEs remains elusive in part because of the difficulty in recognizing, acquiring, and interpreting such forms.
Indeed the phenomenon manifests in a range of linguistic forms (as Sag et al. (2001), among many others, have documented), including: noun + noun compounds (e.g. fish knife, health hazard etc.); adjective + noun compounds (e.g. political agenda, national interest, etc.); particle verbs (shut up, take out, etc.); prepositional verbs (e.g. look into, talk into, etc.); VP idioms, such as kick the bucket, and pull someone’s leg, along with less obviously idiomatic forms like answer the door, mention someone’s name, etc.; expressions that have their own mini-grammars, such as names with honorifics and terms of address (e.g. Rabbi Lord Jonathan Sacks), kinship terms (e.g. second cousin once removed), and time expressions (e.g. January 9, 2015); support verb constructions (e.g. verbs: take a bath, make a promise, etc; and prepositions: in doubt, under review, etc.). Linguists address issues of polysemy, compositionality, idiomaticity, and continuity for each type included here.
While native speakers use these forms with ease, the treatment and interpretation of MWEs in computational systems requires considerable effort due to the very issues that concern linguists
- Miriam R. L. Petruck, Research Linguist, International Computer Science Institute
Miriam R. L. Petruck received her PhD in Linguistics from the University of California, Berkeley. A key member of the team developing FrameNet almost since the project’s founding, her research interests include semantics, knowledge base development, grammar and lexis, lexical semantics, Frame Semantics and Construction Grammar
- Valia Kordoni, Research Professor, Humboldt University
Valia Kordoni received her PhD in Computational Linguistics from the University of Essex, UK. She joined the Department of English Studies, Humboldt University Berlin in 2012, where she is Research Professor of Linguistics. Her main research interests are in deep linguistic processing, semantic analysis, and multiword expressions.
Computational Analysis of Affect and Emotion in Language
Saif Mohammad and Cecilia Ovesdotter Alm
September 18, 2015 - Afternoon - Small Auditorium
Tutorial materials: PDF with slides
Computational linguistics has witnessed a surge of interest in approaches to emotion and affect analysis, tackling problems that extend beyond sentiment analysis in depth and complexity. This area involves basic emotions (such as joy, sadness, and fear) as well as any of the hundreds of other emotions humans are capable of (such as optimism, frustration, and guilt), expanding into affective conditions, experiences, and activities. Leveraging linguistic data for computational affect and emotion inference enables opportunities to address a range of affect-related tasks, problems, and non-invasive applications that capture aspects essential to the human condition and individuals’ cognitive processes. These efforts enable and facilitate human-centered computing experiences, as demonstrated by applications across clinical, socio-political, artistic, educational, and commercial domains. Efforts to computationally detect, characterize, and generate emotions or affect-related phenomena respond equally to technological needs for personalized, micro-level analytics and broad-coverage, macro-level inference, and they have involved both small and massive amounts of data.
While this is an exciting area with numerous opportunities for members of the ACL community, a major obstacle is its intersection with other investigatory traditions, necessitating knowledge transfer. This tutorial comprehensively integrates relevant concepts and frameworks from linguistics, cognitive science, affective computing, and computational linguistics in order to equip researchers and practitioners with the adequate background and knowledge to work effectively on problems and tasks either directly involving, or benefiting from having an understanding of, affect and emotion analysis.
There is a substantial body of work in traditional sentiment analysis focusing on positive and negative sentiment. This tutorial covers approaches and features that migrate well to affect analysis. We also discuss key differences from sentiment analysis, and their implications for analyzing affect and emotion.
The tutorial begins with an introduction that highlights opportunities, key terminology, and interesting tasks and challenges (1). The body of the tutorial covers characteristics of emotive language use with emphasis on relevance for computational analysis (2); linguistic data—from conceptual analysis frameworks via useful existing resources to important annotation topics (3); computational approaches for lexical semantic emotion analysis (4); computational approaches for emotion and affect analysis in text (5); visualization methods (6); and a survey of application areas with affect-related problems (7). The tutorial concludes with an outline of future directions and a discussion with participants about the areas relevant to their respective tasks of interest (8).
Besides attending the tutorial, tutorial participants receive electronic copies of tutorial slides, a complete reference list, as well as a categorized annotated bibliography that concentrates on seminal works, recent important publications, and other products and resources for researchers and developers.
- Saif M. Mohammad, Senior Research Officer, National Research Council Canada
Saif Mohammad has research interests in computational linguistics and natural language processing, especially lexical semantics and affect analysis. He develops computational models for sentiment analysis, emotion detection, semantic distance, and lexical-semantic relations such as word-pair antonymy. His team has developed a sentiment analysis system which ranked first in SemEval shared tasks on the sentiment analysis of tweets and on aspect-based sentiment analysis. His word-emotion association resource, the NRC Emotion Lexicon, is widely used for text analysis and information visualization. His recent work on generating music from emotions in text garnered widespread media attention, including articles in Time, LiveScience, io9, The Physics arXiv Blog, PC World, and Popular Science.
- Cecilia Ovesdotter Alm, Assistant Professor, Rochester Institute of Technology
Cecilia Ovesdotter Alm is a computational linguist dedicated to advancing the understanding of affective and subjective meaning across linguistic modalities and multimodal data. Her work focuses on linguistic annotation and resource development for affect-related problems, as well as computational modeling involving text and speech, image understanding, and linguistic or multimodal sensing in this area. She has published Affect in Text and Speech (2009) as well as articles in proceedings and journals, representing over a decade of related research.