TruEra, a vendor that provides tools for testing, debugging, and monitoring machine language (ML) models, today announced the release of TruLens, open-source software specialised to testing applications built on large language models (LLMs) such as the GPT series.
TruLens, which is available for free today, gives organisations with a quick and easy approach to review and iterate on their LLM applications, eliminating the possibility of hallucination and bias during the manufacturing stage.
Currently, just a few vendors provide tools to address this element of LLM app development, even as enterprises across industries continue to investigate the potential of generative AI for various use cases.
Why is TruLens used in LLM applications?
LLMs are popular, but when it comes to developing apps based on these models, businesses must go through a time-consuming trial phase that includes human-driven response scoring. Once the first version of an app is created, teams must manually test and assess its responses, tweak prompts, hyperparameters, and models, and then re-test until a satisfying result is obtained.
This takes a long time and is tough to scale up.
TruEra is addressing this issue with TruLens by proposing a programmatic technique of evaluation known as “feedback functions.” According to the business, a feedback function evaluates the quality and efficacy of an LLM application’s output by analysing both the text generated by the LLM and the response’s metadata.
“Consider it a way to track and evaluate direct and indirect feedback on the performance and quality of your LLM app.” This enables developers to create credible and powerful LLM apps more quickly. “You can use it for a wide range of LLM use cases, such as chatbot question answering, information retrieval, and so on,” Anupam Datta, TruEra’s cofounder, president, and chief scientist, told VentureBeat.
With a few lines of code, TruLens may be integrated into the development process. Once it’s up and running, users can design their own feedback functions that are tailored to specific use cases, or they can rely on the built-in alternatives.
Currently, the software includes feedback features that assess truthfulness, relevance of question-answering, harmful or poisonous language, user attitude, language mismatch, response verbosity, and fairness and prejudice. Furthermore, it reports how much an LLM is pinged within the programme, providing a convenient way to track usage expenses.
“This also assists you in determining how to build the best version of the app at the lowest possible ongoing cost.” “Every ping adds up,” Datta observed.
Other offerings for LLM applications
While testing LLM-driven apps for performance and response accuracy is critical, only a few players have introduced ways to address it. Datadog’s OpenAI model monitoring integration, Arize’s Pheonix solution, and Israel-based Mona Labs’ recently debuted generative AI monitoring solution are among them.
According to TruEra, TruLens is best employed throughout the development phase of LLM app development.
“This is actually the phase that most companies are in right now — they’re experimenting with development and have a real need for tools to help them iterate faster and zero in on application versions that are both effective at their tasks and risk-free.” “Of course, you can use it on both development and production models,” Datta explained.
According to a survey conducted by Accenture, 98% of worldwide executives believe that AI foundation models will play a major part in their organisations’ strategies during the next three to five years. This indicates that enterprise demand for products like TruLens will expand in the near future.