Snorkel AI looks beyond data labelling for generative AI

Snorkel AI looks beyond data labelling for generative AI

June 30, 2023

Snorkel AI has unveiled new features that go beyond traditional data labelling to assist organisations in curating and preparing data for generative artificial intelligence (AI). The company's data platform, which already leveraged large language models (LLMs) for data labelling, now offers the Snorkel Foundry and GenFlow services. The Snorkel Foundry facilitates data curation, helping organisations optimise their data mix for building customised LLMs, while GenFlow assists in filtering poor-quality data points to enhance the output of generative AI models.

A common issue faced by generative AI tools is hallucination, where inaccurate responses are generated due to a lack of specific task training or insufficient information. While a boss at Google has hit out at ChatGPT for this issue, Snorkel Foundry addresses this challenge by curating data from a repository during the pre-training phase and creating a custom software solution. “Hallucinations are another type of error that result from not training the model to perform a specific task in the first place”, - Alex Ratner, CEO and co-founder of Snorkel AI, said.

By providing the right mix of data, organisations can mitigate bias and reduce the risk of hallucination, resulting in more accurate generative AI models. The tool's data sampling function enables users to identify data relevance programmatically, optimising the data mixture for training machine learning models.

After pre-training large language models, additional instruction tuning is often performed. Snorkel AI's GenFlow service steps in to facilitate feedback and filtering of poor-quality data points, as programmed by the software developer. By utilising the right tooling and management capabilities, GenFlow enhances generative AI's ability to generate optimal outputs. This feedback mechanism differs from traditional data labelling, as it focuses on user preferences for summaries or responses, rather than binary classification.

While generative AI has received significant attention, Ratner emphasises that traditional predictive AI will likely deliver most of the enterprise value from AI in the long run. Data labelling remains crucial for predictive AI tasks like fraud classification, providing feedback to improve model performance.

Although generative AI requires a different form of feedback, the need for feedback persists, whether in the form of labels, long-form answers, or ratings. Snorkel AI aims to streamline and accelerate this feedback process through programmatic and well-managed approaches.