Redaction Detection & Resolution Engine
"The government redacted names but left SSNs, emails, phone numbers, and account numbers visible in surrounding text. Their own DLP failures become our resolution engine."
DLP (Data Loss Prevention) patterns are well-defined regex: SSNs, emails, phone numbers, credit cards, credentials. They work bidirectionally:
Offense (Redaction Resolution): Find unredacted PII near [REDACTED] markers.
An SSN found 200 characters from a blacked-out name tells you who was redacted.
Cross-reference that SSN against other documents where the name IS visible. Resolution.
Defense (Our DLP): Find PII our search API is currently serving. If someone's SSN is in our index, we mask it before returning results. Protect victims.
Confidence scoring: Same PII in unredacted doc with name (+30), matches across 2+ docs (+20), cross-index hit (+15), same date/location (+10), same doc_type (+5), co-occurrence (+5). Cap: 95% (epistemic humility).
False positive control: SSN pattern (\d{3}-\d{2}-\d{4}) also matches phone numbers
and reference numbers. Context-dependent filtering: require "SSN", "social security", or "tax id" in surrounding text.
Deny matches near "phone", "case no", "docket".