Customer Selection
Yext is selective in the kinds of customers we work with, which dramatically reduces the risk surface for unethical inputs to AI. We do not publish content generated by end users, which may be found on social media websites (e.g., Facebook or Twitter), and which is often ethically complicated. While our AI models must be able to respond to end-user inputs, our data inputs for these models are derived from upstanding businesses that do not engage in ethically risky content production.
Training Data Characteristics
It is important to understand that not all areas of language are equally prone to bias, let alone ethical bias (i.e., bias related to factors like gender, ethnicity, or age). The vast majority of data that we label and use for machine learning (ML) at Yext represent concrete, verifiable, and specific information on businesses and institutions provided by those institutions themselves (online on their own webpage or in the form of digitalized internal documentation).
Unlike the generalized user search across all resources available on the web, represented by tools like Google search or Bing, Yext's domain is the enterprise search, which means a search exclusively within a particular company/institution and its knowledge base. Given this unique character of Yext's business focus, ethically charged topics or concepts only rarely appear in the materials that are used for AI training. Consequently, it is hard to imagine a scenario where a bias could swing a particular annotation one way or another. There is always an external "source of objective truth" that the annotators need to refer to as instructed in the labeling guidelines. Should any uncertainties arise, the annotators have the option to escalate a particular labeling task to their manager who provides advice from both the linguistic and content perspectives and who involves other subject matter experts as needed.
That being said, our data scientists always train ML algorithms on sufficiently large volumes of data that are representative of the scenarios in which the algorithm is to be deployed. By doing so, we prevent any idiosyncratic occurrences of inaccurate or biased labels from skewing the statistical pattern-matching that produces the AI algorithms.
Data Selection
The majority of labeling tasks begin with collecting datasets from search logs. When constructing a corpus of data for labeling, we make sure to avoid over-indexing on large clients by ensuring that no more than 40% of the data comes from one client, and at least four clients are represented in the dataset — unless there's a good reason to do otherwise (e.g., training a customer-specific model).
Labeling Process and Review Mechanisms
To guarantee the highest possible quality of labeled data, each labeling task must have clear written labeling guidelines that reflect the objectives of the project and explain in detail what labels should be used and how they should be applied. The guidelines are a result of collaboration between a linguistic expert/data-labeling manager, a data scientist, and a product manager.
Each labeling project is first tested on a small amount of data in order to gather feedback for further clarification of the guidelines. After that, the labeling project is passed on to the annotators who are in constant communication with the labeling manager. The manager's task is to resolve any issues, ambiguities, or unclarities that the annotators bring up during the entire process of labeling and track the applied solutions in the labeling guidelines so that they can be consistently utilized in the future as well.
Marking Problematic Data as Corrupt
In order to maintain the four main ethical objectives stated above, the annotators are instructed to mark any queries and/or responses that contain vulgar, profane, or ethically questionable content as Corrupt. The data with this tag are discarded from any AI training. The same rule holds for queries with personally identifiable information (PII) or otherwise corrupted data (meaningless or irrelevant for the given business domain).
Consensus Between Multiple Annotators
In order to prevent any unwanted bias, the majority of data used for model training or Search performance analysis are labeled by at least two annotators. If there is disagreement in the selected label between the annotators, the task is escalated for a "disagreement resolution", whereby an annotator assesses both labels and chooses the more appropriate one. If there are doubts about which label should be chosen, the task is further escalated to the labeling manager who discusses the optimal resolution with all annotators involved in the process. If the agreement cannot be reached (which only rarely happens), the data point is discarded.
Final Review
To add an extra layer of protection against bias and unwanted errors that could slip through and compromise the labeled data quality, the more experienced annotators perform a manual review of most labels assigned during the primary annotation process. The systematic implementation of the review process has been possible since March 2022 when Yext invested in the enterprise edition of Label Studio, a cutting-edge labeling software for large-scale labeling operations, where all our labeling tasks are currently carried out.