Pegasus Large Privacy Policy Summarization V2

Google Pegasus Large model fine-tuned on privacy policy documents and their corresponding summaries.

Model Details

  • Model Type: Transformer-based abstractive summarization model
  • Architecture: Google PEGASUS Large
  • Fine-tuning Dataset: A curated dataset of privacy policy documents and their corresponding summaries.
  • Intended Use: Summarizing long and complex privacy policies into concise and readable summaries.
  • Limitations: May miss critical nuances, legal jargon, or context-dependent details in privacy policies.

Uses

Direct Use

This model can be used for summarizing lengthy privacy policy documents into concise summaries. It is designed for applications that require automated document summarization, such as compliance analysis and legal document processing.

Downstream Use

This model can be fine-tuned further for domain-specific summarization tasks related to legal, business, or government policy documents.

Out-of-Scope Use

  • Legal Advice: The model is not a replacement for professional legal consultation.
  • Summarization of Non-Privacy-Related Texts: Performance may degrade on general texts outside privacy policies.
  • High-Stakes Decision-Making: Should not be used in critical legal or compliance decisions without human oversight.

Bias, Risks, and Limitations

Risks

  • Summarization Bias: The model may overemphasize certain parts of privacy policies while omitting crucial information.
  • Misinterpretation: Legal terms might not be accurately represented in layman's summaries.
  • Data Sensitivity: Summarization results could be misleading if applied to incomplete or biased datasets.

Recommendations

  • Human verification of summaries is advised, especially for legal and compliance use cases.
  • Users should be aware of the potential biases in the training data.
  • Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)

def summarize(text):
    inputs = tokenizer(
        f"Summarize the following document: {text}\nSummary: ",
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    ).to(device)

    outputs = model.generate(**inputs)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Training and Evaluation Data

The documents and summaries were extracted from the ToS;DR website's API. Only comprehensively reviewed website documents with a rating were used.

Training Procedure

Preprocessing

TextRank algorithm was used to extract the top n sentences from both the documents and summaries, with a maximum of 30 sentences for documents and 20 for summaries. BeautifulSoup library was used to parse HTML text, and regular expressions were applied to remove excessive spaces. The dataset was then split into training and validation sets, with a test size of 0.2 and a random seed of 42.

Training Hyperparameters

  • Epochs: 10
  • Weight decay: 0.01
  • Batch size: 2 (train & eval)
  • Logging steps: 10
  • Warmup steps: 500
  • Evaluation strategy: epoch
  • Save strategy: epoch
  • Metric for best model: ROUGE-1
  • Load best model at end: True
  • Prediction mode: predict_with_generate=True
  • Optimizer: Adam with learning rate 0.001
  • Scheduler: Linear scheduler with warmup: num_warmup_steps=500, num_training_steps=1500
  • Reporting: MLflow

Evaluation

Metrics

  • ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) were used to measure summarization quality.

Results

  • rouge1: 0.5141839409652631
  • rouge2: 0.2895850459169673
  • rougeL: 0.27764589200709305
  • rougeLsum: 0.2776501244969102
Downloads last month
92
Safetensors
Model size
571M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2

Finetuned
(56)
this model

Space using AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2 1