Deepchecks: Continuous Validation for LLMs

January 3, 2025

An Intellyx Brain Candy Brief

Generative AI responses can include hallucinations, incorrect answers, bias, and potentially harmful content.

Deepchecks’ testing framework continuously monitors and scores LLM responses to ensure you are getting the results you are looking for, avoiding such errors.

Without such a testing framework, validating LLM responses can be a trial and error situation involving significant manual effort.

Deepchecks structures responses into a type of spreadsheet, and filters and sorts them using properties and topics for easy validation and comparison. Deekchecks also highlights the interesting parts of the response text to streamline the LLM evaluation process.

Deepchecks compares versions of LLM apps and compares versions of prompts to automatically indicate what is a good or not good response.

The automated process allows you to continue developing and refining prompts at a measured, and regular pace. You can choose an LLM property such as “conciseness” and test score that property against the responses, without writing a lot of code on the back end.

Deepchecks also monitors LLMs in production for new use cases, and checks LLM upgrades that might change and affect accuracy or performance.

At its core, Deepchecks provides an automatic scoring mechanism for LLMs using properties, similarity, and judgement, combined into a response scoring so that you can test LLMs similarly to classic software testing.

Copyright © Intellyx BV. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. None of the organizations mentioned in this article is an Intellyx customer. No AI was used to produce this article. To be considered for a Brain Candy article, email us at pr@intellyx.com.