Channels Resources Recent Items Reading list HomeRegisterLoginSupportContact


Query: "machine translation" or SMT
Status: updated [Success]
1-20 of 5605: 12345...281
View PDF Chained Machine Translation Using Morphemes as Pivot LanguageAbstract: As the smallest meaning-bearing elements of the languages which have rich morphology information, morphemes are often integrated into state-of-the-art statistical machine translation to improve translation quality. The paper proposes an approach which novelly uses morphemes as pivot language in a chained machine translation system. A machine translation based method is used therein to find the mapping relations between morphemes and words. Experiments show the effectiveness of our approach, achieving 18.6 percent increase in BLEU score over the baseline phrase-based machine translation system.
Wen Li Lei Chen Miao Li
Google Scholar CiteSeer X DBLP Database
View PDF A Semi-supervised Approach to Bengali-English Phrase-Based Statistical Machine TranslationAbstract: Large amounts of bilingual data and monolingual data in the target language are usually used to train statistical machine translation systems. In this paper we propose several semi-supervised techniques within a Bengali English Phrase-based Statistical Machine Translation (SMT) System in order to improve translation quality. We conduct experiments on a Bengali-English dataset and our initial experimental results show improvement in translation quality.
Maxim Roy
Google Scholar CiteSeer X DBLP Database
Han-Bin Chen , Hen-Hsen Huang , Hsin-Hsi Chen , and Ching-Ting Tan
Google Scholar CiteSeer X DBLP Database
View PDF A Three-Layer Architecture for Automatic Post-Editing System Using Rule-Based ParadigmAbstract: This paper proposes a post-editing model in which our three-level rule-based automatic post-editing engine called Grafix is presented to refine the output of machine translation systems. The type of corrections on sentences varies from lexical transformation to complex syntactical rearrangement. The experimental results both in manual and automatic evaluations show that the proposed system is able to improve the quality of our state-of-the-art English-Persian SMT system.
Mahsa Mohaghegh Abdolhossein Sarrafzadeh Mehdi Mohammadi
Google Scholar CiteSeer X DBLP Database
View PDF Fully Automatic Compilation of Portuguese-English and Portuguese-Spanish Parallel CorporaAbstract: This paper reports the fully automatic compilation of parallel corpora for Brazilian Portuguese. Scientific news texts available in Brazilian Portuguese, English and Spanish are automatically crawled from a multilingual Brazilian magazine. The texts are then automatically aligned at document- and sentence-level. The resulting corpora contain about 2,700 parallel documents totaling over 150,000 aligned sentences each. The quality of the corpora and their usefulness are tested in an experiment with machine translation.
Wilker Aziz , Lucia Specia
Google Scholar CiteSeer X DBLP Database
View PDF Hybrid SRL with Optimization Modulo TheoriesAbstract: Generally speaking, the goal of constructive learning could be seen as, given an example set of structured objects, to generate novel objects with similar properties. From a statistical-relational learning (SRL) viewpoint, the task can be interpreted as a constraint satisfaction problem, i.e. the generated objects must obey a set of soft constraints, whose weights are estimated from the data. Traditional SRL approaches rely on (finite) First-Order Logic (FOL) as a description language, and on MAX-SAT solvers to perform inference. Alas, FOL is unsuited for constructive problems where the objects contain a mixture of Boolean and numerical variables. It is in fact difficult to implement, e.g. linear arithmetic constraints within the language of FOL. In this paper we propose a novel class of hybrid SRL methods that rely on Satisfiability Modulo Theories, an alternative class of formal languages that allow to describe, and reason over, mixed Boolean-numerical objects and constraints. The resulting methods, which we call Learning Modulo Theories , are formulated within the structured output SVM framework, and employ a weighted SMT solver as an optimization oracle to perform efficient inference and discriminative max margin weight learning. We also present a few examples of constructive learning applications enabled by our method.
Stefano Teso Roberto Sebastiani Andrea Passerini
Google Scholar CiteSeer X DBLP Database
View PDF An SMT-driven Authoring ToolAbstract: (no abstract)
Sriram Venkatapathy Shachar M irkin
Google Scholar CiteSeer X DBLP Database
View PDF The Trouble with SMT ConsistencyAbstract: SMT typically models translation at the sentence level, ignoring wider document context. Does this hurt the consistency of translated documents? Using a phrase-based SMT system in various data conditions, we show that SMT translates documents remarkably consistently, even without document knowledge. Nevertheless, translation inconsistencies often indicate translation errors. However, unlike in human translation, these errors are rarely due to terminology inconsistency. They are more often symptoms of deeper issues with SMT models instead.
Marine Carpuat Michel Simard
Google Scholar CiteSeer X DBLP Database
View PDF Integrating a Rule-based with a Hierarchical Translation SystemAbstract: Recent developments on hybrid systems that combine rule-based machine translation (RBMT) systems with statistical machine translation (SMT) generally neglect the fact that RBMT systems tend to produce more syntactically well-formed translations than data-driven systems. This paper proposes a method that alleviates this issue by preserving more useful structures produced by RBMT systems and utilizing them in a SMT system that operates on hierarchical structures instead of flat phrases alone. For our experiments, we use Joshua as the decoder (Li et al., 2009). It is the first attempt towards a tighter integration of MT systems from different paradigms that both support hierarchical analyses. Preliminary results show consistent improvements over the previous approach.
Yu Chen, Andreas Eisele
Google Scholar CiteSeer X DBLP Database
Mahsa Mohaghegh Abdolhossein Sarrafzadeh Mehdi Mohammadi
Google Scholar CiteSeer X DBLP Database
View PDF Evaluating Machine Translation Utility via Semantic Role LabelsAbstract: We present the methodology that underlies new metrics for semantic machine translation evaluation that we are developing. Unlike widely-used lexical and n-gram based MT evaluation metrics, the aim of semantic MT evaluation is to measure the utility of translations. We discuss the design of empirical studies to evaluate the utility of machine translation output by assessing the accuracy for key semantic roles. Such roles can be annotated using Propbank-style PRED and ARG labels. Recent work by Wu and Fung (2009) introduced methods based on automatic semantic role labeling into statistical machine translation, to enhance the quality of MT output. However, semantic SMT approaches have so far still only been evaluated using lexical and n-gram based SMT evaluation metrics such as BLEU, which are not aimed at evaluating the utility of MT output. Direct data analysis is still needed to understand how semantic models can be leveraged to evaluate the utility of MT output. In this paper, we discuss a new methodology for evaluating the utility of the machine translation output, by assessing the accuracy with which human readers are able to match the Propbank annotation frames.
Chi-kiu L Dekai W
Google Scholar CiteSeer X DBLP Database
View PDF Joint Learning of a Dual SMT System for Paraphrase GenerationAbstract: SMT has been used in paraphrase generation by translating a source sentence into another (pivot) language and then back into the source. The resulting sentences can be used as candidate paraphrases of the source sentence. Existing work that uses two independently trained SMT systems cannot directly optimize the paraphrase results. Paraphrase criteria especially the paraphrase rate is not able to be ensured in that way. In this paper, we propose a joint learning method of two SMT systems to optimize the process of paraphrase generation. In addition, a revised BLEU score (called iBLEU ) which measures the adequacy and diversity of the generated paraphrase sentence is proposed for tuning parameters in SMT systems. Our experiments on NIST 2008 testing data with automatic evaluation as well as human judgments suggest that the proposed method is able to enhance the paraphrase quality by adjusting between semantic equivalency and surface dissimilarity.
Hong Sun Ming Zhou
Google Scholar CiteSeer X DBLP Database
Tomoya Mizumoto Yuta Hayashibe Mamoru Komachi Masaaki Nagata Yu ji Matsumoto
Google Scholar CiteSeer X DBLP Database
View PDF Unsupervised Search for The Optimal Segmentation for Statistical Machine TranslationAbstract: We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. Submerging the proposed segmentation method in a SMT task from morphologically-rich Turkish to English does not exhibit the expected improvement in translation BLEU scores and confirms the robustness of phrase-based SMT to translation unit combinatorics. A positive outcome of this work is the described modification to the sequential search algorithm of Morfessor (Creutz and Lagus, 2007) that enables arbitrary-fold parallelization of the computation, which unexpectedly improves the translation performance as measured by BLEU.
Coskun Mermer
Google Scholar CiteSeer X DBLP Database
View PDF Evaluating the Word Sense Disambiguation Performance of Statistical Machine TranslationAbstract: We present the first known empirical test of an increasingly common speculative claim, by evaluating a representative Chinese-toEnglish SMT model directly on word sense disambiguation performance, using standard WSD evaluation methodology and datasets from the Senseval-3 Chinese lexical sample task. Much effort has been put in designing and evaluating dedicated word sense disambiguation (WSD) models, in particular with the Senseval series of workshops. At the same time, the recent improvements in the BLEU scores of statistical machine translation (SMT) suggests that SMT models are good at predicting the right translation of the words in source language sentences. Surprisingly however, the WSD accuracy of SMT models has never been evaluated and compared with that of the dedicated WSD models. We present controlled experiments showing the WSD accuracy of current typical SMT models to be significantly lower than that of all the dedicated WSD models considered. This tends to support the view that despite recent speculative claims to the contrary, current SMT models do have limitations in comparison with dedicated WSD models, and that SMT should benefit from the better predictions made by the WSD models. 1 The authors would like to thank the Hong Kong Research Grants Council (RGC) for supporting this research in part through grants RGC6083/99E, RGC6256/00E, and DAG03/04.EG09.
Human Language Technology Center Clear Water Bay, Hong Kong
Google Scholar CiteSeer X DBLP Database
View PDF Seeding Statistical Machine Translation with Translation Memory Output through Tree-Based Structural AlignmentAbstract: With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM) technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness.
Ventsislav Zhechev Josef van Genabith
Google Scholar CiteSeer X DBLP Database
View PDF NTT-NAIST SMT Systems for IWSLT 2013Abstract: This paper presents NTT-NAIST SMT systems for EnglishGerman and German-English MT tasks of the IWSLT 2013 evaluation campaign. The systems are based on generalized minimum Bayes risk system combination of three SMT systems: forest-to-string, hierarchical phrase-based, phrasebased with pre-ordering. Individual SMT systems include data selection for domain adaptation, rescoring using recurrent neural net language models, interpolated language models, and compound word splitting (only for German-English).
Katsuhito Sudoh , Graham Neubig , Kevin Duh , Hajime Tsukada
Google Scholar CiteSeer X DBLP Database
View PDF LetsMT!: A Cloud-Based Platform for Do-It-Yourself Machine TranslationAbstract: To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. The platform is developed in the EU collaboration project LetsMT!. This system demonstration paper presents the motivation in developing the LetsMT! platform, its main features, architecture, and an evaluation in a practical use case.
Andrejs Vasi jevs Raivis Skadi s Jrg Tiedemann
Google Scholar CiteSeer X DBLP Database
View PDF Feature Decay Algorithms for Fast Deployment of Accurate Statistical Machine Translation SystemsAbstract: We use feature decay algorithms (FDA) for fast deployment of accurate statistical machine translation systems taking only about half a day for each translation direction. We develop parallel FDA for solving computational scalability problems caused by the abundance of training data for SMT models and LM models and still achieve SMT performance that is on par with using all of the training data or better. Parallel FDA runs separate FDA models on randomized subsets of the training data and combines the instance selections later. Parallel FDA can also be used for selecting the LM corpus based on the training set selected by parallel FDA. The high quality of the selected training data allows us to obtain very accurate translation outputs close to the top performing SMT systems. The relevancy of the selected LM corpus can reach up to 86% reduction in the number of OOV tokens and up to 74% reduction in the perplexity. We perform SMT experiments in all language pairs in the WMT13 translation task and obtain SMT performance close to the top systems using significantly less resources for training and development.
Ergun Bicici
Google Scholar CiteSeer X DBLP Database
1-20 of 5605: 12345...281


1975 users, 671 channels, 349 resources, 56081 items