Skip to main content

Statistical Machine Translation (SMT) has been on the Machine Translation (MT) scene for some time now. Since its creation, SMT has proved itself to be an invaluable method in MT, shaping the field into what it is today.

SMT has impressive strengths as well as significant flaws in comparison to other MT approaches, such as Rule-Based Machine Translation (RBMT) and Neural Machine Translation (NMT). How does SMT work, what are its pros and cons and how well does it operate in the current field of MT?

What is SMT?

SMT uses statistical analysis and predictive algorithms to define rules that are best suited for target sentence translation. These models are trained using a bilingual corpus.

Based on the subject matter text that is used to train a corpus, the SMT will be best suited for documents pertaining to the same subject. Usually, a solid corpus requires 100 million words and 1 million aligned sentences to be effective.

SMT can be approached through different subgroups: word-based, phrase-based, syntax-based and hierarchical phrase-based.

To put it simply, statistical models must go from point A to Point B to get to Point C. This is a much different translation approach than NMT, where the models teach themselves how to go directly to Point C, and are not exclusively reliant on probability or chance.

Pros of SMT

A big advantage of SMT is the availability of platforms and algorithms—meaning a lot of the work for building and training a corpus might already be done for you and can be found at a much cheaper rate than usual. As a result, you can train and add new languages very fast in comparison to other MT models.

SMT also requires less virtual space than other models of MT, which makes it easier to operate and train on smaller systems. This means that a company doesn’t need to dedicate an entire server to just MT.

A well-trained, tailored corpus can consistently translate comprehensive content and is often more accurate than NMT. However, the translated content often contains errors that require post-editing. It isn’t suitable for external communications before that occurs.

Cons of SMT
One weakness of SMT is the challenge of translating material that is not similar to content from the training corpora. While SMT can excel with material that the training corpora has defined, such as technical texts written in a simple style, it will struggle if it’s given text that contains slang, idioms or an overall casual style.

In these cases, the accuracy of SMT falls drastically. As a result, the corpora should be customized for a specific style to be most effective. Even then, SMT is unable to translate idioms and marketing material—using it for casual style results in poor accuracy.

Another issue is that SMT systems need bilingual content and that can be tricky when it comes to finding content written in rarer languages. In addition to that, language pairs across language families will have a low threshold of accuracy and the resulting translations will be poor.

SMT can also be expensive. While it’s much less expensive than NMT, it still will require a great deal of upfront costs. Preprocessing and corpus creation is not only expensive and time-consuming, but it also requires collaboration with computer scientists, translators and linguists. The full process can take months.

Not only that, but it’s harder to fix mistakes in the system once they’ve been implemented. With models like RBMT, you can fix errors and remove certain words fairly easily. With SMT, you need to retrain the whole system and check to see if other errors were introduced.

How Does SMT Compare to Other Approaches?

In comparison to other MT approaches, SMT has some pretty clear advantages, especially when it comes to widely used languages that are within the same language family. The automation is another huge benefit, and its availability across most platforms help with accessibility and compatibility.

If a company is serious about putting time, money, and effort into a MT solution for a specific need, SMT can be the perfect answer. However, other MT models will be more effective if you’re dealing with rare languages, casual text or content that is varied in nature.

Recently, some companies have moved away from a purely statistical approach to MT, instead using other models, such as hybrid or NMT. While SMT has done a lot of legwork for the MT field up to this point, a question has to be asked of whether it will be dropped in the future in favor of other models.