Channels Resources Recent Items Reading list HomeRegisterLoginSupportContact


Query: "topic model" or "topic models" or "latent dirichlet allocation"
Status: updated [Success]
1-20 of 1490: 12345...75
View PDF Topic Model Diagnostics: Assessing Domain Relevance via Topical AlignmentAbstract: The use of topic models to analyze domainspecific texts often requires manual validation of the latent topics to ensure that they are meaningful. We introduce a framework to support such a large-scale assessment of topical relevance. We measure the correspondence between a set of latent topics and a set of reference concepts to quantify four types of topical misalignment: junk , fused , missing , and repeated topics. Our analysis compares 10,000 topic model variants to 200 expertprovided domain concepts, and demonstrates how our framework can inform choices of model parameters, inference algorithms, and intrinsic measures of topical quality.
Jason Chuang Sonal Gupta Christopher D. Manning Jeffrey Heer
Google Scholar CiteSeer X DBLP Database
View PDF Online Latent Dirichlet Allocation with Infinite VocabularyAbstract: Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary. This is reasonable in batch settings but not reasonable for streaming and online settings. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and--to only consider a finite number of words for each topic--propose heuristics to dynamically order, expand, and contract the set of words we consider in our vocabulary. We show our model can successfully incorporate new words and that it performs better than topic models with finite vocabularies in evaluations of topic quality and classification performance.
Ke Zhai Jordan Boyd-Graber
Google Scholar CiteSeer X DBLP Database
View PDF A Hybrid Neural Network-Latent Topic ModelAbstract: This paper introduces a hybrid model that combines a neural network with a latent topic model. The neural network provides a lowdimensional embedding for the input data, whose subsequent distribution is captured by the topic model. The neural network thus acts as a trainable feature extractor while the topic model captures the group structure of the data. Following an initial pre-training phase to separately initialize each part of the model, a unified training scheme is introduced that allows for discriminative training of the entire model. The approach is evaluated on visual data in scene classification task, where the hybrid model is shown to outperform models based solely on neural networks or topic models, as well as other baseline methods.
Li Wan Leo Zhu Rob Fergus
Google Scholar CiteSeer X DBLP Database
View PDF Rakuten Institute of Technology, New York 215 Park Avenue South New York, NY 10003, USAAbstract: At present, online shopping is typically a search-oriented activity where a user gains access to products which best match their query. Instead, we propose a surf-oriented online shopping paradigm, which links associated products allowing users to "wander around" the online store and enjoy browsing a variety of items. As an initial step in creating this experience, we constructed a prototype of an on-line shopping interface which combines product ontology information with topic model results to allow users to explore items from the food and kitchen domain. As a novel task for topic model application, we also discuss possible approaches to the task of selecting the best product categories to illustrate the hidden topics discovered for our product domain.
(no authors)
Google Scholar CiteSeer X DBLP Database
View PDF Kernel Topic ModelsAbstract: Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents' mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM) , modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.
Philipp Hennig David Stern Ralf Herbrich Thore Graepel
Google Scholar CiteSeer X DBLP Database
View PDF Random Walk Features for Network-aware TopicAbstract: Topic Models such as Latent Dirichlet Allocation (LDA) have been successfully applied as a data analysis and dimensionality reduction tool. With the emergence of social networks, many datasets are available in the form of a network with typed nodes (documents, authors, URLs, publication dates, . . . ) and edges (authorship, citation, friendship, . . . ). We propose a network-aware topic model that integrates rich, heterogeneous, network-based information, representing them using pathtyped random walks. In more detail, the proposed model is based on Dirichlet multinomial regression, an extension of LDA, as well as on random walks for exploiting network information; each document node is characterized by its connectivity to other nodes in the graph through a given set of random walks. A set of sparse latent parameters relate this characterization to topic assignments. Being sparse, the latent parameters give insight into the effect of different network features on the extracted topics.
Ahmed Hefny, Geoffrey Gordon, Katia Sycara
Google Scholar CiteSeer X DBLP Database
View PDF Constructing a Class-Based Lexical Dictionary using Interactive Topic ModelsAbstract: This paper proposes a new method of constructing arbitrary class-based related word dictionaries on interactive topic models; we assume that each class is described by a topic. We propose a new semi-supervised method that uses the simplest topic model yielded by the standard EM algorithm; model calculation is very rapid. Furthermore our approach allows a dictionary to be modified interactively and the final dictionary has a hierarchical structure. This paper makes three contributions. First, it proposes a word-based semi-supervised topic model. Second, we apply the semi-supervised topic model to interactive learning; this approach is called the Interactive Topic Model. Third, we propose a score function; it extracts the related words that occupy the middle layer of the hierarchical structure. Experiments show that our method can appropriately retrieve the words belonging to an arbitrary class. Keywords: Interactive Topic Models, Interactive Unigram Mixtures, Lexical dictionary
Kugatsu Sadamitsu, Kuniko Saito, Kenji Imamura and Yoshihiro Matsuo
Google Scholar CiteSeer X DBLP Database
Uri Shalit Daphna Weinshall Gal Chechik
Google Scholar CiteSeer X DBLP Database
View PDF A Study of Language Modeling for Chinese Spelling CheckAbstract: Chinese spelling check (CSC) is still an open problem today. To the best of our knowledge, language modeling is widely used in CSC because of its simplicity and fair predictive power, but most systems only use the conventional n -gram models. Our work in this paper continues this general line of research by further exploring different ways to glean extra semantic clues and Web resources to enhance the CSC performance in an unsupervised fashion. Empirical results demonstrate the utility of our CSC system.
Kuan-Yu Chen , Hung-Shin Lee, Chung-Han Lee, Hsin-Min Wang Hsin-Hsi Chen
Google Scholar CiteSeer X DBLP Database
View PDF Identifying Comparable Corpora Using LDAAbstract: Parallel corpora have applications in many areas of Natural Language Processing, but are very expensive to produce. Much information can be gained from comparable texts, and we present an algorithm which, given any bodies of text in multiple languages, uses existing named entity recognition software and topic detection algorithm to generate pairs of comparable texts without requiring a parallel corpus training phase. We evaluate the system's performance firstly on data from the online newspaper domain, and secondly on Wikipedia cross-language links.
Judita Preiss
Google Scholar CiteSeer X DBLP Database
View PDF A Biterm Topic Model for Short TextsAbstract: Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly capture the document-level word co-occurrence patterns to reveal topics, and thus suffer from the severe data sparsity in short documents. In this paper, we propose a novel way for modeling topics in short texts, referred as biterm topic model (BTM) . Specifically, in BTM we learn the topics by directly modeling the generation of word co-occurrence patterns (i.e. biterms) in the whole corpus. The major advantages of BTM are that 1) BTM explicitly models the word co-occurrence patterns to enhance the topic learning; and 2) BTM uses the aggregated patterns in the whole corpus for learning topics to solve the problem of sparse word co-occurrence patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics. Furthermore, we find that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng
Google Scholar CiteSeer X DBLP Database
View PDF Are Semantically Coherent Topic Models Useful for Ad Hoc InformationAbstract: The current topic modeling approaches for Information Retrieval do not allow to explicitly model query-oriented latent topics. More, the semantic coherence of the topics has never been considered in this field. We propose a model-based feedback approach that learns Latent Dirichlet Allocation topic models on the top-ranked pseudo-relevant feedback, and we measure the semantic coherence of those topics. We perform a first experimental evaluation using two major TREC test collections. Results show that retrieval performances tend to be better when using topics with higher semantic coherence.
Romain Deveaud Eric SanJuan Patrice Bellot
Google Scholar CiteSeer X DBLP Database
View PDF Integrate Multilingual Web Search Results using Cross-Lingual Topic ModelsAbstract: With the thriving of the Internet, web users today have access to resources around the world in more than 200 different languages. How to effectively manage multilingual web search results has emerged as an essential problem. In this paper, we introduce the ongoing work of leveraging a CrossLingual Topic Model (CLTM) to integrate the multilingual search results. The CLTM detects the underlying topics of different language results and uses the topic distribution of each result to cluster them into topic-based classes. In CLTM, we unify distributions in topic level by direct translation, thus distinguishing from other multi-lingual topic models, which mainly concern the parallelism at document or sentence level (Mimno 2009; Ni, 2009). Experimental results suggest that our CLTM clustering method is effective and outperforms the 6 compared clustering approaches.
Duo Ding
Google Scholar CiteSeer X DBLP Database
View PDF Optimizing Semantic Coherence in Topic ModelsAbstract: Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Un-fortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
David Mimno Hanna M. Wallach Edmund Talley Miriam Leenders Andrew McCallum
Google Scholar CiteSeer X DBLP Database
View PDF Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human JudgmentsAbstract: Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Recent studies have found that while there are suggestive connections between topic models and the way humans interpret data, these two often disagree. In this paper, we explore this disagreement from the perspective of the learning process rather than the output. We present a novel task, tag-and-cluster , which asks subjects to simultaneously annotate documents and cluster those annotations. We use these annotations as a novel approach for constructing a topic model, grounded in human interpretations of documents. We demonstrate that these topic models have features which distinguish them from traditional topic models.
Jonathan Chang
Google Scholar CiteSeer X DBLP Database
View PDF The Inverse Regression Topic ModelAbstract: Taddy ( 2013 )proposed multinomial inverse regression (MNIR) as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation algorithm and an online variant, which is suitable for large corpora. We apply these methods to a corpus of 73K Congressional press releases and another of 150K Yelp reviews, demonstrating that the IRTM outperforms both MNIR and supervised topic models on the prediction task. Further, we give examples showing that the IRTM enables systematic discovery of in-topic lexical variation, which is not possible with previous supervised topic models.
Maxim Rabinovich David M. Blei
Google Scholar CiteSeer X DBLP Database
View PDF A Topic Model for Melodic SequencesAbstract: We examine the problem of learning a probabilistic model for melody directly from musical sequences belonging to the same genre. This is a challenging task as one needs to capture not only the rich temporal structure evident in music, but also the complex statistical dependencies among different music components. To address this problem we introduce the Variable-gram Topic Model, which couples the latent topic formalism with a systematic model for contextual information. We evaluate the model on next-step prediction. Additionally, we present a novel way of model evaluation, where we directly compare model samples with data sequences using the Maximum Mean Discrepancy of string kernels, to assess how close is the model distribution to the data distribution. We show that the model has the highest performance under both evaluation measures when compared to LDA, the Topic Bigram and related non-topic models.
Athina Spiliopoulou Amos Storkey
Google Scholar CiteSeer X DBLP Database
View PDF arXiv:1308.2853v1 [cs.LG] 13 Aug 2013Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence . Our sufficient conditions for identifiability involve a novel set of "higher order" expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allows for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition. Keywords: Overcomplete representations, topic models, generic identifiability, tensor decomposition.
Animashree Anandkumar, Daniel Hsu, Majid Janzamin and Sham Kakade
Google Scholar CiteSeer X DBLP Database
View PDF A Practical Algorithm for Topic Modeling with Provable GuaranteesAbstract: Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for learning topic models that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.
Sanjeev Arora Rong Ge Yoni Halpern David Mimno Ankur Moitra David Sontag Yichen Wu Michael Zhu
Google Scholar CiteSeer X DBLP Database
View PDF Integrating Document Clustering and Topic ModelingAbstract: Document clustering and topic modeling are two closely related tasks which can mutually benefit each other. Topic modeling can project documents into a topic space which facilitates effective document clustering. Cluster labels discovered by document clustering can be incorporated into topic models to extract local topics specific to each cluster and global topics shared by all clusters. In this paper, we propose a multi-grain clustering topic model (MGCTM) which integrates document clustering and topic modeling into a unified framework and jointly performs the two tasks to achieve the overall best performance. Our model tightly couples two components: a mixture component used for discovering latent groups in document collection and a topic model component used for mining multi-grain topics including local topics specific to each cluster and global topics shared across clusters. We employ variational inference to approximate the posterior of hidden variables and learn model parameters. Experiments on two datasets demonstrate the effectiveness of our model.
Pengtao Xie Eric P.Xing
Google Scholar CiteSeer X DBLP Database
1-20 of 1490: 12345...75


1970 users, 671 channels, 349 resources, 56081 items