Ana içeriğe atla

Supervised Text Style Transfer For Turkish Texts

Zeynep YILMAZ – CRYPTTECH AI LAB.

https://github.com/zeynobia/stst

Özetçe—Diller zamanla değişime uğrar. Eski metinlerde kullanılan kelimeler günümüz metinlerinde kullanılmayabilir. Eski metinleri günümüz neslinin anlayabilmesi amacıyla, uzman kişiler tarafından metin sadeleştirmesi işleminin yapılması gerekmektedir.  Tarihi dizi ve filmlerdeki diyaloglarda ise, tam tersi bir durum söz konusudur. Uzman kişiler tarafından, diyaloglarda eski dönemlerde kullanılan kelimelerin kullanılması gerekmektedir. Ancak bu metot kaynak ve zaman açısından maliyetlidir. Bu probleme bir çözüm getirebilmek adına bu çalışma yapılmıştır. Çalışmada, problem kelime öbeği tabanlı istatiksel model olarak tanımlanmıştır.  İlk olarak paralel veri kümesi oluşturulmuştur. Kelimeleri hizalamak için IBM modeli kullanılmıştır. Dil modeli için de n gram dil modeli kullanılmıştır.  Önerilen sistemi değerlendirilmesi için, BLEU, Rouge, Meteor, Wer, Word2vec skorları kullanılarak, sonuç olarak,  87 BLEU puanı elde edilmiştir.

Anahtar Kelimeler — Text Stili Transferi, Gözetimli Makine Öğrenmesi, Yeniden Açıklama, Kelime Öbeği Çevirisi

AbstractLanguages change over time. The words used in old texts may not be used in today’s texts. For this reason, the task of simplifying the old texts should be done by experts in order to be able to understand today’s generation. The opposite condition is also present in the dialogues of historical media. Unfortunately, this method is costly and consumes too much time. In this study, I tried to solve this problem. The problem is defined as a phrase-based statistical model. First, a parallel dataset is created. IBM model is used for word alignments. N-gram language model is used for the language model. In order to evaluate the proposed system, BLEU, Rouge, Meteor, Wer, Word2vec scores are used. As a result, 87 BLEU points were achieved.

Key Words — Text Style Transfer, Supervised Machine Learning,  Paraphrasing,  Phrase-Based Translation.

I. INTRODUCTION

While communities are improvingdüzeltme / iyileştirme, the usage of some words is decreasing. Some words disappear completely [1]. Even texts written in old times such as the Republic period contain many words and word groups that cannot be used today. Since the meaning of these words is not known, the texts are not easily understood by many readers. Old texts are completely simplified and presented as a new text by linguists.

In this study, old Turkish texts are converted to modern Turkish. Plus, modern Turkish texts are also converted to old Turkish texts. It is aimed to change today’s texts while providing the same meaning.

On the problem of text simplification, many important studies have been done. In these studies; for text simplification tasks, different methods are used including such that lexical, syntactic methods, statistical Bayes model, artificial neural networks, and hybrid models [2]. Torunoglu-Selamet et al. [3] have created syntactic and morphological rules for the simplification task. Zhu et al. [4]  created the PWKP  text simplification dataset. They have trained the tree-based statistical Bayes model using this dataset. Xu et al. [5] tried to convert the writing style in texts written by Shakespear into modern English.  In [6] Jhamtani et al. tried to rewrite modern  English texts in Shakespeare’s style by using sequence to sequence models in artificial neural networks.

This study is organized as follows. In the second section, the data set is explained. In the third section, the steps in the proposed method applied are detailed. In the fourth chapter, the experiments and the results of these experiments are presented. In the fifth chapter, the conclusion is given.

II. DATA SET

In this study, the 1938 Nutuk edition is accepted as old Turkish. This edition has not been simplified. The Nutuk edition which was simplified and published by Bedi Yazıcı, has been accepted as modern Turkish. This edition was published in 1995.

In addition to the Nutuk data set, a parallel data set containing common words was also created.

Huge corpora collected from Wikipedia, news, review texts, old and modern book texts. Word, stem statistics, bigram statistics, phrase statistics are obtained. By using this corpora, Parallel data is created using the most frequent synonyms and these statistics.

III. MODEL

In this study, for the paraphrasing system, Bayes probabilistic model is used. The formula of the model is given in equation 1. Paraphrased text is denoted by p.  Source text is denoted by s. P(p)   indicates the probability of language model. 

P(s|p)  indicates the probability of the phrase model. 

arg max (P(p|s)) = arg max ( P(s|p) *P(p)  )      (1)

A.Preprocessing

In the preprocessing step, Firstly, sentences are split as words. Secondly, unwanted characters and punctuation marks are removed from the text. Then uppercase letters are converted to lowercase letters. If this process is not done, the success rate reduces. Because the same words are considereddikkate alınan / saygıdeğer as different words. This step is important for increasing the success rate.

B.Word Alignment

In the word alignment step, translation relationships between words in sentence pairs are calculated. In Supervised text style transfer applications, the high quality of word alignment directly contributes to the accuracy score. Brown et al. [7] introduced 5 different IBM models in their study. Until today, many studies have been conducted in which these models are developed and presented. In this study, the library named fast align is used. This library is created by parameterizing IBM Model2 [8]. Since some words do not change in the paraphrasing process, very successful results are obtained in the word alignment process.

Example: Word Alignment

Sentence: İstikbalde dahi, seni bu hazineden mahrum etmek isteyecek, dahili ve harici bedhahların olacaktır.

Paraphrasing: Gelecekte bile, seni bu hazineden yoksun bırakmak isteyecek, iç ve dış düşmanların olacaktır.

Table1: Word Alignment Example Results

SourceTarget
istikbaldegelecekte
dahibile
seniseni
bubu
hazinedenhazineden
mahrum etmekyoksun bırakmak
isteyecekisteyecek
dahili
veve
haricidış
bedhahlarındüşmanların
olacaktırolacaktır

C.Language Model

The language model is used for paraphrasing applications in order to increase the fluency of the generated text. In this study, the n-grams of the probabilistic language model are used. Kenelm n-gram language model is preferred. This language model uses the Modified Kneser-Ney smoothing approach [9]. It is faster than other n-gram language models. It also uses less memory [9]. N-gram language model is used to predict the next element using previous elements in a sequence. It also gives the probability of the sentence. N-gram language model is used in speech recognition, machine translation, spell checker applications.

Simple N-gram Model

Example: Bi gram of Simple Corpora

seni bu hazineden yoksun bırakmak isteyecek düşmanların olacaktır

seni bu hazineden mahrum etmek isteyecek bedhahların olacaktır

Table2: Bigram Model For Simple Corpora

How Calculate N gram Probability

Smoothing techniques are used to avoid zero possibilities. It is assumed that there is a possibility of unknown words.The <unk> indicates unknown words. The <s> expression indicates the beginning of sentences. The </s> statement indicates the end of the sentence. N gram file The actual probabilities are replaced by their logs. The logbase is generally 10. 

-1.278754 düşmanların -0.2527253

-1.278754 is probability of “düşmanların” words

-0.2527253 is back of weights  of “düşmanların” words.

Back of weights (BW) is used to calculate the probability if unknown expressions occur.

Example “düşmanların isteyecek” expression is an unknown expression in corpora. Let’s calculate this probability.

P(isteyecek|düşmanların)=P(isteyecek)*BW (düşmanların) 

=-0.9777236  –0.2527253 =-1,2305

D.TRAINING

Data are divided into training and test data for evaluation. In this step, a phrase table is created from the training data. In the phrase table, there are translation probabilities of words and word groups with each other.

Example: Simple Phrase Table 

Sentence1: İstikbal göklerdedir

Paraphrase1: Gelecek göklerdedir

Sentence2: İstikbal ne manaya gelir

Paraphrase2: Ati ne anlama gelir

Table3: Simple Phrase Table Example

SourceTargetProbability
istikbalgelecek0.50
istikbalati0.50
göklerdedirgöklerdedir1.0
nene1.0
manayaanlama1.0
gelirgelir1.0

E.FINE TUNING

Parameters to be used in the paraphrasing model in this step adjusted to increase accuracy. These parameters can be adding a new phrase to the Phrase table and removing a phrase.  It may also be the determination of optimum weights between the phrase model and the language model.  Stems can be used instead of words.

F.DECODING

More than one possibility on the target side of a given sentence available. A sentence word or phrase given in this step, divided into groups. Then, candidate sentences are created by paraphrasing the word groups. Finally, the sentence with the highest score among the candidate sentences returns as a result. Words can be deleted from the sentence while analyzing. A word on the source side can be expressed in more than one word on the target side.

G.EVALUATION

In order to measure accuracy in text simplification tasks have developed many different methods. there must be a high similarity between simplifications made by humans and simplifications made by machines.

BLEU

BLEU metric is firstly proposed in order to measure machine translation results in 2002 [10]. This metric can work independently of the language. In the following years, this metric has also become the standard in evaluating machine translation results. Candidate translation text and reference text in metric n-gram based as compared. The formula for the standard BLEU metric is seen in equation 2. BP is the penalty parameter for long candidate translations. The output of the metric varies between 0 and 1, but the result is multiplied by  10, 100. In the study, the BLEU score is multiplied by 100 and is presented. Wn indicates the n-gram precision weights. For BLEU4, weights of 1, 2, 3, 4 gram precision values are 0.25. For BLEU3, weights of 1, 2, 3 gram precision values are 0.333.

WER

The WER metric indicates the word error rate. WER metric is derived from the Levenshtein distance metric.  Levenshtein distance is firstly introduced by Vladimir Levenshtein [11]. Especially, it is the most popular metric of speech recognition systems. Because the paraphrasing process is between the same languages, the word order is the same. It can also be used to evaluate paraphrasing results.

S indicates the number of substitutions,

D indicates the number of deletions,

I indicates the number of insertions,

C indicates the number of correct words,

indicates the number of in the reference (N=S+D+C)

Ref:  Birinci görevin Türk istiklalini muhafaza etmektir

Hyp: Birinci ödevin Türk istiklalini muhafaza etmektir

Subsition: görevin -> ödevin  (Count:1 )

Deletion: 0

Insertion: 0

Correct: 5 

In this study, the WordAccuracy score is multiplied by 100 and is presented.

ROUGE

The rouge metric is the popular metric for summarization systems. This metric can be used in the evaluation of paraphrasing systems. It is firstly introduced in [12].  Rouge-N indicates an overlap of n-grams between reference and hypothesis texts. Rouge-2 indicates an overlap of bi grams between reference and hypothesis texts[12].

METEOR

It is firstly proposed in [13]. This metric is based on the harmonic mean of unigram precision and recall [13].  It also has several features like stemming and synonymy matching, along with the standard exact word matching. Meteor is implemented in pure Java programming language and requires no installation or dependencies to score output [14] . 

Word2Vec Semantic Similarity

The word2vec algorithm uses a neural network algorithm in order to learn word associations from a large corpus of text. 

This model can detect synonymous words or very relevant words. It is firstly proposed in [15].  It is also used to calculate the semantic similarity of sentences.

Proposed Evaluation Metric

In the paraphrasing system, there are no problems with word order. BLEU score is not suitable for paraphrasing system evaluations [16]. Because this process is done in the same language. In this case, the word accuracy metric system can evaluate better than the BLEU metric. A word can have more than one synonym. For example, the words “harika” and “mükemmel” , “kusursuz” and “muhteşem” are synonyms. There is one in the reference sentence. This situation causes the success to be lower than expected. In the Word2vec model, although “harika” and “kötü” are not synonyms, they have certain similarities because they are adjectives. This causes Word2vec similarity to be higher than expected.. To avoid this situation, the most similar N words from the Word2vec model are taken into account.  N can typically be between 10 and 50. N = 25 has been chosen in this study.

Different words between the reference and the hypothesis text are determined. If different words are in Word2vec TopN document, word2vec similarity is added.

Example: ProposedScore, Word2Vec, WordAccuracy

Hyp: Geçen hafta izlediğimiz film harika 

Ref1: Geçen hafta izlediğimiz film mükemmel

Ref2: Geçen hafta izlediğimiz film kötü

DifferentWord(Hyp,Ref1) = mükemmel

DifferentWord(Hyp,Ref2) = kötü

Word2VecSim(Harika->mükemmel)= 0.80 (in top sim)

Word2vecSim(Harika->kötü)= 0.40(not in top sim)

“mükemmel” 

Hyp-Ref1: WER=1 WordAcc: 4 /5 =0.80

Hyp-Ref2: WER=1 WordAcc: 4 /5 =0.80

Hyp-Ref1: Word2vecSimilarity: 0.96

Hyp-Ref2: Word2vecSimilarity: 0.88

Hyp-Ref1  ProposedMetric: 0.96

Hyp-Ref2  ProposedMetric: 0.80

IV. EXPERIMENTAL RESULTS

Feature Type

Table4: Effects on Feature Type

TypeTrainTestW2vecWaccMet.Rouge2Bleu
Word36500406080.571.563.758.049.3
Stem36500406087.968.558.853.043.3

Words performed better than stems. Because there are many different forms of additions to the stems. This situation reduces the word accuracy. For example, the ‘in’complement suffix has many states like’n’ ‘ın’,’in’, ‘un’, ‘ün’ ‘nın’, ‘nin’, ‘nun’, nün’. Stem also increases the sequence size.

Example: Word or Stem Tokens

Word:bu kelimenin anlamı  nedir|bu sözcüğün manası nedir

Stem:bu kelime nin anlam ı ne dir|bu sözcüğ ün mana  ne dir

Data Size

Table5: Effects on Data Size

TrainTest ScoreW2vecWaccMet.RougeBleu
74160829585.687.384.781.274.572.7
1430111076788.189.787.783.679.075.6
1828181469490.091.089.486.379.979.4
2158001793093.893.993.591.786.887.0

As the size of the data increases, success rates increase. Because, the more data, the better word alignment is done. Rare words decrease. The paraphrasing process is in the same language. The word order is the same. Some words have not changed, the accuracy of the Word alignment process is high.

93.7% word accuracy is achieved in the training data of 215800 sentences. Also, 87.0% BLEU value is obtained.

Sample Results 

Table6: Sample Paraphrasing Results

Ori: Birinci vazifen Türk istiklalini Türk Cumhuriyetini ilelebet muhafaza ve müdafaa etmektir
Paraph: Birinci görevin Türk bağımsızlığını Türk Cumhuriyetini sonsuza kadar korumak ve savunmaktır
Ori: Mevcudiyetinin ve istikbalinin yegane temeli budur
Paraph: Varlığının ve geleceğinin tek temeli budur.
Ori: Bu temel senin en kıymetli hazinendir
Paraph: Bu esas senin en değerli hazinendir
Ori: İstikbalde dahi seni bu hazineden mahrum etmek isteyecek dahili ve harici bedhahların olacaktır
Paraph: Gelecekte bile seni bu hazineden yoksun bırakmak isteyecek iç ve dış düşmanların olacaktır
Ori: Bir gün istiklal ve cumhuriyeti müdafaa mecburiyetine düşersen vazifeye atılmak için içinde bulunacağın vaziyetin imkan ve şeraitini düşünmeyeceksin
Paraph:  Bir gün bağımsızlık ve Cumhuriyeti savunmak zorunluluğuna düşersen, göreve atılmak için, bulunduğun durumun olanak ve şartlarını düşünmeyeceksin
Ori: Bu imkan ve şerait çok namüsait bir mahiyette tezahür edebilir
Paraph: Bu olanak ve şartlar, çok elverişsiz bir özellikte ortaya çıkabilir
Ori: İstiklal ve cumhuriyetine kastedecek düşmanlar bütün dünyada emsali görülmemiş bir galibiyetin mümessili olabilirler
Paraph:  Bağımsızlık ve cumhuriyetini yok etmek isteyecek düşmanlar bütün dünyada eşi görülmemiş bir galibiyetin temsilcisi olabilirler

V. CONCLUSION

In this study, we present the details of an automatic style transfer system that utilizes a supervised phrase-based statistical model.  When a statistical phrase-based model is used for the paraphrasing system, when there is sufficient data, success rates are quite high. The main reason for this is that the accuracy of the word alignment process in the paraphrasing system is high. Because the system is developed in the same language. Word order is similar and some words remain the same. We apply it to convert a text written in old Turkish to modern Turkish. Also, We apply it to convert a text written in modern Turkish to old Turkish. The importance of our proposed system lies in making old literature accessible to the new generation. At the same time, old texts can be generated for historical films and TV series.

REFERENCES

  •  Akay R., ” The social and language reasons of the changes in language” ,Uluslararası İnsan Bilimleri Dergisi ,Volume: 4, No:1, pp. 1-9, 2007
  • Shardlo M., “A survey of automated text simplification. International Journal of Advanced Computer Science and Applications”, 4(1), pp. 58-70, 2014.
  • Torunoglu-Selamet  D., Pamay  T., Eryigit G.,“Simplification of Turkish sentences”, In The First International Conference on Turkic Computational Linguistics, pp. 55-59,  2016.
  • Zhu  Z., Bernhard D., Gurevych, I. “A monolingual tree-based translation model for sentence simplification.”, In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1353-1361,2010.
  • Xu W., Ritter, A., Dolan, B., Grishman, R., Cherry, C., “Paraphrasing for style”, In Proceedings of COLING 2012, pp. 2899-2914, 2012.
  • Jhamtani H., Gangal V., Hovy E., Nyberg E., “Shakespearizing modern language using copy-enriched sequence-to-sequence models,” arXiv preprint arXiv:1707.01161, 2017.
  • Brown P. F., Della Pietra  S. A., Della Pietra, V. J., Mercer  R. L., “The mathematics of statistical machine translation: Parameter estimation.”, Computational linguistics, Vol. 19, No.2, pp. 263-311, 1993.
  • Dyer C., Chahuneau V., Smith N. A. ,“A simple, fast, and effective reparameterization of ibm model 2”, In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 644-648) 20
  • Heafield K., “KenLM: Faster and smaller language model queries.”, In Proceedings of the sixth workshop on statistical machine translation, pp. 187-197, 2011.
  •  Papineni K., Roukos S., Ward T., Zhu W. J., “BLEU: a method for automatic evaluation of machine translation”, In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311-318, 2002.
  •  Levenshtein V. I, “Binary codes capable of correcting deletions, insertions, and reversals.” In Soviet physics doklady ,Vol. 10, No. 8, pp. 707-710, 1966.
  •  Lin C. Y., “Rouge: A package for automatic evaluation of summaries”, In Text summarization branches out, pp. 74-81, 2004. 
  • Banerjee S.,  Lavie, A. “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments.”, In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65-72, 2005.
  •  https://github.com/cmu-mtlab/meteor [Web Access Time: 14.11.2020)
  •  Mikolov T., Chen K., Corrado G., Dean J.. “Efficient estimation of word representations in vector space.”  arXiv preprint arXiv:1301.3781, 2013.
  •  Sulem E., Abend O., Rappoport A. “Bleu is not suitable for the evaluation of text simplification”. arXiv preprint arXiv:1810.05995, 2018.

Yorumlar

Bu blogdaki popüler yayınlar

1. Geleneksel Stajyer CTF Soru ve Cevapları

2. Geleneksel Stajyer CTF Soru ve Cevapları - 2017

B*-Tree (BTree, BPlusTree) Veri Yapısı ile Veri İndeksleme