Unit testing LLM models - Lessons from Vertex AI
Intro
To help LLM models like those on Vertex AI align more closely with expected outputs, specific unit testing methods can assess output quality in various scenarios. This article covers four main testing techniques: classification testing to confirm output categories, text generation tests for format and content guidelines, semantic equivalence to check for consistency in meaning, and pattern matching to verify structural requirements. These methods provide a structured framework to evaluate and refine LLM responses effectively.
Testing Method 1: Classification Testing
Purpose: Verify if the model correctly classifies inputs into predefined categories, such as sentiment or intent.
Common Scenarios:
- Sentiment analysis (Positive, Negative, Neutral).
- Message categorization (e.g., determining if a message is a greeting, question, or statement).
- Customer intent recognition in support chat (e.g., billing inquiry, technical support).
Testing Approach:
- Provide specific inputs and compare the model’s output with the expected category.
Code Example:
Function Definition
def getPositiveOrNegative(prompt):
response = model.predict(
"""Context: You look at messages and categorize them as positive, negative, or neutral.
Output only Positive, Negative, or Neutral.
Message: {0}.
Output: """.format(prompt),
**parameters
)
return response.text.strip()
Unit Test
import unittest
class TestPositiveOrNegative(unittest.TestCase):
def test_getPositiveOrNegative1(self):
response = getPositiveOrNegative("Dinner was great")
self.assertEqual(response, "Positive")
def test_getPositiveOrNegative2(self):
response = getPositiveOrNegative("That broccoli was undercooked and cold")
self.assertEqual(response, "Negative")
def test_getPositiveOrNegative3(self):
response = getPositiveOrNegative("We want to try the new Italian place for dinner")
self.assertEqual(response, "Neutral")
Summary: Classification testing verifies if the model categorizes text inputs accurately. It’s straightforward and useful for tasks with clearly defined categories.
Testing Method 2: Text Generation Testing
Purpose: Validate if the model-generated response meets specific content, formatting, or quality requirements.
Common Scenarios:
- Generating promotional content (e.g., tweets, ads).
- Creating customer support replies that adhere to a certain format.
- Generating summaries or descriptions with specific structure guidelines.
Testing Approach:
- Provide a prompt to generate content, then check if the response meets predefined requirements (e.g., includes certain keywords, respects character limits).
Code Example:
Function Definition
def writeTweet(prompt):
response = model.predict(
"""Context: You write Tweets for the Marketing Department at Luigi’s Italian Cafe.
1. Keep your Tweets below 100 characters.
2. Include the hashtag #EatAtLuigis at the end of every tweet.
Input: {0}.
Output: """.format(prompt),
**parameters
)
return response.text.strip()
Unit Test
class TestTextGeneration(unittest.TestCase):
def test_tweet_results(self):
actual_result = writeTweet("Write a tweet about our half-price bottles of wine every Thursday")
expected_response = "Thirsty Thursday is here! Enjoy half-price bottles of wine at Luigi's! #EatAtLuigis"
self.assertIn("#EatAtLuigis", actual_result)
self.assertLessEqual(len(actual_result), 100)
Summary: Text generation testing ensures that model outputs adhere to certain guidelines, such as including specific phrases or staying within character limits, which is important for consistent quality in content creation.
Testing Method 3: Semantic Equivalence Testing
Purpose: Ensure that two generated outputs convey the same meaning, even if worded differently.
Common Scenarios:
- Checking if rephrased responses carry the same intent.
- Verifying consistent answers for similar questions.
- Ensuring paraphrased content maintains key information.
Testing Approach:
- Generate two responses for similar prompts and check if they are fundamentally equivalent in meaning. You can prompt the model to determine if the two responses are "the same."
Code Example:
Function Definition
def are_responses_equivalent(response1, response2):
response = model.predict(
"""Compare the following responses. Are they fundamentally the same?
Only return Yes or No
Response 1: {0}
Response 2: {1}
Output: """.format(response1, response2),
**parameters
)
return response.text.strip()
Unit Test
class TestSemanticEquivalence(unittest.TestCase):
def test_semantic_equivalence(self):
response1 = "The Eiffel Tower is a famous landmark in Paris."
response2 = "Paris is known for its famous Eiffel Tower."
same = are_responses_equivalent(response1, response2)
self.assertEqual(same, "Yes")
Summary: Semantic equivalence testing checks if the model’s responses are consistent in meaning, which is valuable when multiple acceptable outputs exist for the same input.
Testing Method 4: Pattern Matching and Rule-Based Testing
Purpose: Ensure that generated outputs meet specific structural or formatting requirements, like including required phrases or adhering to a character limit.
Common Scenarios:
- Enforcing branding language (e.g., specific hashtags in social media posts).
- Validating customer support responses for key phrases.
- Checking generated content for required structure (e.g., date, location information).
Testing Approach:
- Use regular expressions or predefined rules to verify that the response includes necessary keywords or adheres to formatting guidelines.
Code Example:
Function Definition
def writeResponseWithFormat(prompt):
response = model.predict(
"""Context: Respond to customer inquiries following our format guidelines.
- Keep responses under 150 characters.
- Include the phrase 'Thank you for reaching out'.
Input: {0}.
Output: """.format(prompt),
**parameters
)
return response.text.strip()
Unit Test
class TestPatternMatching(unittest.TestCase):
def test_response_format(self):
actual_result = writeResponseWithFormat("Customer wants to know about our opening hours.")
self.assertIn("Thank you for reaching out", actual_result)
self.assertLessEqual(len(actual_result), 150)
Summary: Pattern matching and rule-based testing ensures responses include essential elements and adhere to specific formatting requirements, which is crucial for maintaining brand consistency and meeting customer service standards.
Sum
While these testing methods help bring LLM outputs closer to expectations, some ambiguous cases will still require manual review*. These approaches provide a structured assessment but don’t guarantee perfectly aligned responses every time. Manual clarification remains essential in cases where model responses may vary or miss nuances.
*manual reivew:
In Classificaiton testing : For phrase like "We want to try the new Italian place for dinner" , words like "want to try" and "new Italian place" may carry positive associations, even if they aren't explicitly positive. If the response doesnt meet expectations, we may need to add more exmples.
ref : Google cloud doc