BERT (Bidirectional Encoder Representations from Transformers), and related transformers, have had a huge impact on the field of Natural Language Processing (NLP). At Yext, BERT is one of the main underpinnings of our Answers platform, particularly as it relates to semantic search (we'll touch on this in a bit). For quite a while, its been a "gray box" because its inner workings havent been well understood. This is common in the field of machine learning, where the community doesnt entirely understand how machine learning models work. However, the models perform so well that many are willing to accept the risks of using a poorly understood tool. As we build more complex NLP tasks, we need to understand the strengths and weaknesses of our methods in order to make better business and engineering decisions. This blog post will discuss some recent research on how and why BERT works, including its strengths and weaknesses.
Before we can talk about the research into the mysterious internal workings of BERT, we need to review the architecture. BERT is based on the transformer architecture. As the name implies, a transformer is designed to transform data. For example, translation software could use a transformer to transform an English sentence into its equivalent German sentence. As such, transformers have become popular in the field of machine translation.
To accomplish this, the transformer architecture has two main parts: an encoder and a decoder. The encoder is designed to process the input and distill it down to its meaning (independent of the language). For example, lets say we assign the ID "12" to "the phrase used to greet someone in the morning". The encoder would accept "Good morning" as input, then it would output "12." The decoder would then know that "12" encodes the phrase thats used to greet someone in the morning, and it would output "Guten Morgen" (German for "Good morning." In practice, the encoder outputs more than a single ID, such as "12." Instead, it outputs hundreds of decimal point numbers which encode the meaning of the input. The decoder then decodes all of these numbers into the second language.
It turns out that these numbers output by the encoder are useful for more than just machine translation. Within the context of our Answers platform, BERT helps us translate queries and phrases into values that can be more easily compared, contrasted, or otherwise analyzed. For example, it can be used to detect if two queries are similar. Given the queries "I want to buy a car that seats 8" and "Where to get a new minivan," it may be difficult for a computer to realize they are similar, as on paper, they share very few words. However, when passed through a series of encoders, they are assigned similar numbers. For example, the encoder output for the first query might be 768 values such as [0.89, 0.20, 0.10, 0.71, …] and the encoder output for the second query might be 768 separate (but similar) values such as [0.87, 0.21, 0.12, 0.68, …]. Once again, in practice, its slightly more complicated. The encoder will actually output many sets of 768 numbers, such as those mentioned above, and we combine them together via summation or averaging. These combined outputs are called embeddings or embedding vectors. While its hard to compare queries as a series of words, its easy to compare numbers, and the computer can detect that these outputs are very similar based on their embeddings. We dont need the decoder part of the transformer to accomplish our goal of detecting similar queries, so we remove it. This is what BERT is — a series of encoder layers.
Figure 2 shows the architecture of an example vanilla transformer, while Figure 3 shows an example BERT architecture.