Pegasus Large Privacy Policy Summarization V2

Google Pegasus Large model fine-tuned on privacy policy documents and their corresponding summaries.

Model Details

Model Type: Transformer-based abstractive summarization model
Architecture: Google PEGASUS Large
Fine-tuning Dataset: A curated dataset of privacy policy documents and their corresponding summaries.
Intended Use: Summarizing long and complex privacy policies into concise and readable summaries.
Limitations: May miss critical nuances, legal jargon, or context-dependent details in privacy policies.

Uses

Direct Use

This model can be used for summarizing lengthy privacy policy documents into concise summaries. It is designed for applications that require automated document summarization, such as compliance analysis and legal document processing.

Downstream Use

This model can be fine-tuned further for domain-specific summarization tasks related to legal, business, or government policy documents.

Out-of-Scope Use

Legal Advice: The model is not a replacement for professional legal consultation.
Summarization of Non-Privacy-Related Texts: Performance may degrade on general texts outside privacy policies.
High-Stakes Decision-Making: Should not be used in critical legal or compliance decisions without human oversight.

Bias, Risks, and Limitations

Risks

Summarization Bias: The model may overemphasize certain parts of privacy policies while omitting crucial information.
Misinterpretation: Legal terms might not be accurately represented in layman's summaries.
Data Sensitivity: Summarization results could be misleading if applied to incomplete or biased datasets.

Recommendations

Human verification of summaries is advised, especially for legal and compliance use cases.
Users should be aware of the potential biases in the training data.
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)

def summarize(text):
    inputs = tokenizer(
        f"Summarize the following document: {text}\nSummary: ",
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    ).to(device)

    outputs = model.generate(**inputs)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Training and Evaluation Data

The documents and summaries were extracted from the ToS;DR website's API. Only comprehensively reviewed website documents with a rating were used.

Training Procedure

Preprocessing

TextRank algorithm was used to extract the top n sentences from both the documents and summaries, with a maximum of 30 sentences for documents and 20 for summaries. BeautifulSoup library was used to parse HTML text, and regular expressions were applied to remove excessive spaces. The dataset was then split into training and validation sets, with a test size of 0.2 and a random seed of 42.

Training Hyperparameters

Epochs: 10
Weight decay: 0.01
Batch size: 2 (train & eval)
Logging steps: 10
Warmup steps: 500
Evaluation strategy: epoch
Save strategy: epoch
Metric for best model: ROUGE-1
Load best model at end: True
Prediction mode: predict_with_generate=True
Optimizer: Adam with learning rate 0.001
Scheduler: Linear scheduler with warmup: num_warmup_steps=500, num_training_steps=1500
Reporting: MLflow

Evaluation

Metrics

ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) were used to measure summarization quality.

Results

rouge1: 0.5141839409652631
rouge2: 0.2895850459169673
rougeL: 0.27764589200709305
rougeLsum: 0.2776501244969102

AryehRotberg
/

Pegasus-Large-Privacy-Policy-Summarization-V2