Pegasus Large Privacy Policy Summarization V2
Google Pegasus Large model fine-tuned on privacy policy documents and their corresponding summaries.
Model Details
- Model Type: Transformer-based abstractive summarization model
- Architecture: Google PEGASUS Large
- Fine-tuning Dataset: A curated dataset of privacy policy documents and their corresponding summaries.
- Intended Use: Summarizing long and complex privacy policies into concise and readable summaries.
- Limitations: May miss critical nuances, legal jargon, or context-dependent details in privacy policies.
Uses
Direct Use
This model can be used for summarizing lengthy privacy policy documents into concise summaries. It is designed for applications that require automated document summarization, such as compliance analysis and legal document processing.
Downstream Use
This model can be fine-tuned further for domain-specific summarization tasks related to legal, business, or government policy documents.
Out-of-Scope Use
- Legal Advice: The model is not a replacement for professional legal consultation.
- Summarization of Non-Privacy-Related Texts: Performance may degrade on general texts outside privacy policies.
- High-Stakes Decision-Making: Should not be used in critical legal or compliance decisions without human oversight.
Bias, Risks, and Limitations
Risks
- Summarization Bias: The model may overemphasize certain parts of privacy policies while omitting crucial information.
- Misinterpretation: Legal terms might not be accurately represented in layman's summaries.
- Data Sensitivity: Summarization results could be misleading if applied to incomplete or biased datasets.
Recommendations
- Human verification of summaries is advised, especially for legal and compliance use cases.
- Users should be aware of the potential biases in the training data.
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)
def summarize(text):
inputs = tokenizer(
f"Summarize the following document: {text}\nSummary: ",
padding="max_length",
truncation=True,
max_length=1024,
return_tensors="pt",
).to(device)
outputs = model.generate(**inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Training Details
Training and Evaluation Data
The documents and summaries were extracted from the ToS;DR website's API. Only comprehensively reviewed website documents with a rating were used.
Training Procedure
Preprocessing
TextRank algorithm was used to extract the top n sentences from both the documents and summaries, with a maximum of 30 sentences for documents and 20 for summaries. BeautifulSoup library was used to parse HTML text, and regular expressions were applied to remove excessive spaces. The dataset was then split into training and validation sets, with a test size of 0.2 and a random seed of 42.
Training Hyperparameters
- Epochs: 10
- Weight decay: 0.01
- Batch size: 2 (train & eval)
- Logging steps: 10
- Warmup steps: 500
- Evaluation strategy: epoch
- Save strategy: epoch
- Metric for best model: ROUGE-1
- Load best model at end: True
- Prediction mode: predict_with_generate=True
- Optimizer: Adam with learning rate 0.001
- Scheduler: Linear scheduler with warmup: num_warmup_steps=500, num_training_steps=1500
- Reporting: MLflow
Evaluation
Metrics
- ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) were used to measure summarization quality.
Results
- rouge1: 0.5141839409652631
- rouge2: 0.2895850459169673
- rougeL: 0.27764589200709305
- rougeLsum: 0.2776501244969102
- Downloads last month
- 92
Model tree for AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2
Base model
google/pegasus-large