30 GenAI in Banking & Finance : Understanding ROUGE Metrics: Evaluating Summarization Quality
Understanding ROUGE Metrics: Evaluating Summarization Quality
How ROUGE Works
ROUGE operates by assessing the overlap of words, sequences of words, or even word pairs between the machine summary and the reference summaries. The assumption is simple: a good automated summary should capture key ideas present in human-written summaries.
Key Considerations
- Recall: Measures how much of the relevant information from the reference summaries appears in the generated summary. High recall means the summary captures most of the important points.
- Precision: Measures how much of the generated summary’s content is actually relevant or found in the reference summaries. High precision indicates that the model avoids including unnecessary or irrelevant details
Example in Practice
# Install the library if not already installed # !pip install rouge-score from rouge_score import rouge_scorer # Define the reference and machine-generated summaries reference_summary = "Company X launches a new smartphone with advanced features." generated_summary = "Company X has launched a new smartphone with improved camera capabilities." # Initialize the ROUGE scorer for ROUGE-1, ROUGE-2, and ROUGE-L scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) # Calculate the ROUGE scores scores = scorer.score(reference_summary, generated_summary) # Display the scores for rouge_type, score in scores.items(): print(f"{rouge_type}: Precision={score.precision:.2f}, Recall={score.recall:.2f}, F1={score.fmeasure:.2f}")
rouge1: Precision=0.83, Recall=0.71, F1=0.77 rouge2: Precision=0.67, Recall=0.50, F1=0.57 rougeL: Precision=0.83, Recall=0.71, F1=0.77
- ROUGE-1 (unigram): The generated summary captures most key individual words from the reference, with good precision and recall, resulting in a strong F1 score.
- ROUGE-2 (bigram): There is less overlap when considering pairs of consecutive words, so the scores are slightly lower.
- ROUGE-L (longest common subsequence): This score, similar to ROUGE-1, reflects a good match in sequence structure between the summaries.
Why ROUGE Matters
✍️ Author’s Note
This blog reflects the author’s personal point of view — shaped by 22+ years of industry experience, along with a deep passion for continuous learning and teaching.
The content has been phrased and structured using Generative AI tools, with the intent to make it engaging, accessible, and insightful for a broader audience.
Comments
Post a Comment