31 GenAI in Banking & Finance : Faithfulness Metrics: Evaluating the Factual Consistency of Generated Responses
Faithfulness Metrics: Evaluating the Factual Consistency of Generated Responses
Why Faithfulness Matters in Generation
- Is the response accurate?
- Is the information sourced from the right context, or is it hallucinated?
How Faithfulness Is Defined: Concepts and Examples
Stepwise Calculation:
- Extract all factual claims from the generated response.
- Cross-check each claim against the provided context.
- Score faithfulness as the ratio of supported claims to total claims:
Python Example: Calculating Faithfulness Score
import subprocessimport jsonimport textwrapimport ollama
def call_ollama(model: str, prompt: str) -> str:
"""
Call a local Ollama model via the CLI and return the generated text.
Requires: `ollama` installed and model already pulled.
"""
result = subprocess.run(
["ollama", "run", model],
input=prompt.encode("utf-8"),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
check=True,
)
return result.stdout.decode("utf-8").strip()
def build_judge_prompt(question: str, answer: str, contexts: list[str]) -> str:
context_block = "\n\n".join(f"- {c}" for c in contexts)
prompt = f"""
You are a strict evaluation assistant.
Your task is to evaluate whether the ANSWER is fully supported by the CONTEXT for the given QUESTION.
QUESTION:
{question}
CONTEXT:
{context_block}
ANSWER:
{answer}
Instructions:
- If every factual claim in the ANSWER can be inferred from the CONTEXT, output JSON: {{"faithfulness": 1.0, "explanation": "..."}}.
- If some claims are unsupported or contradicted, output a score between 0 and 1 (e.g. 0.0, 0.3, 0.5, 0.7, 0.9) as {{"faithfulness": <score>, "explanation": "..."}}.
- Do not add any extra text outside the JSON object.
Now respond with ONLY the JSON object.
"""
return textwrap.dedent(prompt).strip()
def evaluate_faithfulness_with_ollama(
model: str,
question: str,
answer: str,
contexts: list[str]
) -> dict:
prompt = build_judge_prompt(question, answer, contexts)
raw_output = call_ollama(model, prompt)
# Try to parse JSON from the model's reply
try:
result = json.loads(raw_output)
except json.JSONDecodeError:
# Fallback: try to find JSON substring if model added extra text
start = raw_output.find("{")
end = raw_output.rfind("}")
if start != -1 and end != -1 and end > start:
result = json.loads(raw_output[start:end+1])
else:
raise ValueError(f"Could not parse JSON from model output:\n{raw_output}")
return result
if __name__ == "__main__":
# Example data
question = "When was the first Super Bowl?"
answer = "The first Super Bowl was held on January 15, 1970, in Los Angeles."
contexts = [
"The First AFL–NFL World Championship Game (later known as Super Bowl I) was played on January 15, 1967, in Los Angeles."
]
model_name = "gemma3:4b" # any open-source model you pulled with Ollama
result = evaluate_faithfulness_with_ollama(
model=model_name,
question=question,
answer=answer,
contexts=contexts
)
print(f"Faithfulness score: {result.get('faithfulness')}")
print(f"Explanation: {result.get('explanation')}")Real-Life Application: Hallucinations and Trust
- Early detection of misinformation
- Improved reliability for high-stake industries (finance, healthcare, legal)
- Auditing and continuous improvement with transparent error analyses
Conclusion:
- Faithfulness is essential for judging the factual soundness of AI-generated text.
- It operates by checking each claim against context, with a clear mathematical formulation.
- Python tools like Ragas and Deepeval provide automated, scalable evaluation with customizable options for real-world deployments.
- Consistent scoring and review cycles help keep AI systems grounded, boosting reliability and reducing misinformation.
✍️ Author’s Note
This blog reflects the author’s personal point of view — shaped by 22+ years of industry experience, along with a deep passion for continuous learning and teaching.
The content has been phrased and structured using Generative AI tools, with the intent to make it engaging, accessible, and insightful for a broader audience.
Comments
Post a Comment