Prompt Engineering Course
Evaluation Prompts
Evaluation prompts are used to judge, score, or validate outputs produced by a model.
Instead of generating new content, the model is asked to analyze existing output against clear criteria.
In real-world GenAI systems, evaluation prompts decide whether an output is acceptable, needs revision, or must be rejected.
Why Evaluation Prompts Matter
Generating content is only half of the problem.
The bigger challenge is deciding whether the generated content is:
- Correct
- Relevant
- Complete
- Safe
Evaluation prompts turn subjective judgment into structured decision-making.
How Evaluation Prompts Are Used in Practice
Evaluation prompts are commonly used in:
- Content moderation pipelines
- RAG answer validation
- Automated grading systems
- Prompt refinement loops
They allow AI systems to self-check their own outputs.
Basic Evaluation Prompt Structure
Most evaluation prompts follow a clear pattern:
- Provide the output to be evaluated
- Define evaluation criteria
- Specify expected judgment format
Without structure, evaluations become inconsistent.
Simple Evaluation Example
Evaluate the following answer for correctness.
Criteria:
- Is the explanation technically accurate?
- Does it answer the question fully?
Answer:
"SQL joins combine tables using keys."
This prompt tells the model exactly what to check and how to reason.
Why Criteria Are Critical
If criteria are vague, the evaluation becomes unreliable.
Compare these two approaches:
- "Is this answer good?"
- "Check accuracy, completeness, and relevance"
Only the second produces repeatable results.
Scoring-Based Evaluation Prompts
Many systems require numerical scores instead of opinions.
Score the following answer from 1 to 5 based on clarity and correctness.
Return only a number.
Answer:
"The transformer uses attention mechanisms."
This is commonly used in automated pipelines.
Binary Evaluation Prompts
Sometimes a simple yes/no decision is enough.
Does the answer contain any factual errors?
Respond only with YES or NO.
Answer:
"GPT was released in 1995."
Binary evaluation is fast and easy to automate.
Using Evaluation Prompts for Refinement
Evaluation prompts are often paired with revision prompts.
A typical flow looks like:
- Generate output
- Evaluate output
- Revise if needed
This creates an iterative improvement loop.
Evaluation in RAG Systems
In retrieval-augmented generation, evaluation prompts check:
- Answer groundedness
- Use of retrieved context
- Hallucination risk
This prevents models from inventing unsupported facts.
Common Evaluation Mistakes
Common errors include:
- Overly subjective criteria
- Asking the model to judge itself without constraints
- Mixing generation and evaluation in one prompt
Evaluation prompts should stay focused on judgment only.
Best Practices
Strong evaluation prompts:
- Use explicit criteria
- Define output format clearly
- Avoid ambiguous language
Consistency is more important than complexity.
Practice
What is the most important part of an evaluation prompt?
What role do evaluation prompts play in AI pipelines?
Evaluation prompts are mainly used for output:
Quick Quiz
Evaluation prompts rely most on:
Numeric evaluation is useful for:
Evaluation prompts focus on:
Recap: Evaluation prompts convert subjective review into structured, repeatable judgment.
Next up: Meta-prompting — prompts that control and generate other prompts.