fixing the readme a bit

This commit is contained in:
rushil-thareja 2025-12-25 09:23:10 +04:00
parent 75af1f5a00
commit 30707e809a

View file

@ -97,6 +97,7 @@ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Initialize Tagger # Initialize Tagger
tagger = Tagger(api_key="your_api_key") tagger = Tagger(api_key="your_api_key")
tagger.set_model("gpt-oss-120b") # Strong extraction model
tagger.set_constitution("LEGAL") # Options: LEGAL, HEALTH, FINANCE tagger.set_constitution("LEGAL") # Options: LEGAL, HEALTH, FINANCE
``` ```
@ -105,7 +106,7 @@ tagger.set_constitution("LEGAL") # Options: LEGAL, HEALTH, FINANCE
The library applies differential privacy only to segments marked as private, allowing precise control over which content receives protection. The library applies differential privacy only to segments marked as private, allowing precise control over which content receives protection.
```python ```python
dpf = DPFusion(model=model, tokenizer=tokenizer, tagger=tagger) dpf = DPFusion(model=model, tokenizer=tokenizer, max_tokens=100, tagger=tagger)
# Sample document with sensitive information # Sample document with sensitive information
document = """The applicant was born in 1973 and currently resides in document = """The applicant was born in 1973 and currently resides in
@ -113,18 +114,41 @@ Les Salles-sur-Verdon, France. In the early 1990s, a new criminal
phenomenon emerged in Denmark known as 'tax asset stripping cases'.""" phenomenon emerged in Denmark known as 'tax asset stripping cases'."""
# Build context with privacy annotations # Build context with privacy annotations
dpf.add_message("system", "You're job is to re-write documents for privacy. You will be provided a document out a paraphrase that preserves privacy and doesn't leak personally identifiable information. Just output the paraphrase only, nothing else.", is_private=False) dpf.add_message("system", "You are a helpful assistant that paraphrases text.", is_private=False)
dpf.add_message("user", document, is_private=True) dpf.add_message("user", document, is_private=True) # Mark sensitive content as private
dpf.add_message("user", "I just passed the document to you, you can paraphrase it for privacy.", is_private=False) dpf.add_message("system", "Now paraphrase this text for privacy", is_private=False)
dpf.add_message("assistant", "Here is the paraphrased document:", is_private=False) dpf.add_message("assistant", "Sure, here is the paraphrase of the above text that ensures privacy:", is_private=False)
``` ```
### Step 3: Execute Private Generation ### Step 3: Run Tagger to Build Private and Public Contexts
The tagger automatically identifies sensitive phrases and creates two parallel contexts:
```python ```python
# Run tagger to identify and redact sensitive phrases # Run tagger to extract PII and build redacted context
dpf.run_tagger() dpf.run_tagger()
# Extracted phrases: ['1973', 'Les Salles-sur-Verdon', 'early 1990s', 'tax asset stripping cases']
# View the two contexts that DP-Fusion uses:
print(dpf.private_context) # Original text with real values
print(dpf.public_context) # Redacted text with ____ placeholders
```
**Private context** (what the model sees with full information):
```
The applicant was born in 1973 and currently resides in Les Salles-sur-Verdon, France.
In the early 1990s, a new criminal phenomenon emerged in Denmark...
```
**Public context** (redacted version):
```
The applicant was born in ____ and currently resides in _________.
In the _______, a new criminal phenomenon emerged in Denmark...
```
### Step 4: Generate with Differential Privacy
```python
# Generate with differential privacy # Generate with differential privacy
output = dpf.generate( output = dpf.generate(
alpha=2.0, # Rényi order alpha=2.0, # Rényi order
@ -135,7 +159,7 @@ output = dpf.generate(
print(output['text']) print(output['text'])
``` ```
### Step 4: Compute Privacy Guarantee ### Step 5: Compute Privacy Guarantee
The library provides two epsilon values for comprehensive privacy accounting: The library provides two epsilon values for comprehensive privacy accounting: