Hacker News
Show HN: A local-first, reversible PII scrubber for AI workflows
I’m one of the maintainers of Bridge Anonymization. We built this because the existing solutions for translating sensitive user content are insufficient for many of our privacy-concious clients (Governments, Banks, Healthcare, etc.).
We couldn't send PII to third-party APIs, but standard redaction destroyed the translation quality. If you scrub "John" to "[PERSON]", the translation engine loses gender context (often defaulting to masculine), which breaks grammatical agreement in languages like French or German.
So we built a reversible, local-first pipeline for Node.js/Bun. Here is how we implemented the tricky parts:
0. The Mapping
We use XML-like tags with ID’s that uniquely identify the PII, `<PII type=”PERSON” id=”1”>`. Translation models and the systems around them work with XML data structures since the dawn of Computer Aided Translation tools, so this improves compatibility with existing workflows and systems. A `PIIMap` is stored locally for rehydration after translation (AES-256-GCM-encrypted by default).
1. Hybrid Detection Engine
Obviously neither Regex nor NER was enough on its own.
- Structured PII: We use strict Regex with validation checksums for things like IBANs (Mod-97) and Credit Cards (Luhn). - Soft PII: For names and locations, we run a quantized `xlm-roberta` model via `onnxruntime-node` directly in the process. This lets us avoid a Python sidecar while keeping the package ‘lightweight’ (still ~280MB for the quantized model, but acceptable for desktop environments).
2. The "Hallucination" Guard (Fuzzy Rehydration)
LLMs often "mangle" the XML placeholders during translation (e.g., turning `<PII id="1"/>` into `< PII id = « 1 » >`). We implemented a Fuzzy Tag Matcher that uses flexible regex patterns to detect these artefacts. It identifies the tag even if attributes are reordered or quotes are changed, ensuring we can always map the token back to the original encrypted value.
3. Semantic Masking
We are currently working on "Semantic Masking"—adding context to the PII tag (like `<PII type="PERSON" gender="female" id="1" />` ) to preserve (gender) context for the translation. For now, we are relying on a lightweight lookup-table approach to avoid the overhead of a second ML model or the hassle of fine tuning. So far this works nicely for most use cases.
The code is MIT licensed. I’d love to hear how others are handling the "context loss" problem in privacy-preserving NLP pipelines! I think this could quite easily be generalized to other LLM applications as well.
zcw100
|next
[-]
bob1029
|next
|previous
[-]
I would be tempted to try a pseudonymous approach where inbound PII is mapped to a set of consistent, "known good" fake identities as we transition in and out of the AI layer.
The key with PII is to avoid combining factors over time that produce a strong signal. This is a wide spectrum. Some scenarios will be slightly identifying just because they are rare. Zip+gender isn't a very strong signal. Zip+DOB+gender uniquely identifies a large number of people. You don't need to screw up with an email address or tax id. Account balance over time might eventually be sufficient to target one person.
Ekaros
|next
|previous
[-]
minixalpha
|next
|previous
[-]
dsp_person
|root
|parent
[-]
I'm curious to what limit you can randomly replace words and reverse it later.
Even with code. Like say take the structure of a big project, but randomly remap words in function names, and to some extent replace business logic with dummy code. Then use cloud LLMs for whatever purpose, and translate back.
mishrapravin441
|next
|previous
[-]
tjruesch
|root
|parent
[-]
welcome_dragon
|next
|previous
[-]
bigiain
|root
|parent
|next
[-]
Security First
Because the “PII Map” (the link between ID:1 and John Smith) effectively is the PII, we treat it as sensitive material.
The library includes a crypto module that forces AES-256-GCM encryption for the mapping table. The raw PII never leaves the local memory space, and the state object that persists between the masking and rehydration steps is encrypted at rest.
I've bookmarked this for inspiration for a medium/long term project I am considering building. I'd like to be able to take dumps of our production database and automatically (one way) anonymize it. Replacing all names with meaningless but semantically representative placeholders (gender matching where obvious - Alice, Bob, Mallory, Eve, Trent perhaps, and gender neutral like Jamie or Alex when suitable). Use similar techniques to rewrite email addresses (alice@example.org, bob@example.com, mallory@example.net) and addresses/placenames/whatever else can be pulled out with Named Entity Recognition. I suspect I'll in general be able to do a higher accuracy version of this, since I'll have an understanding of the database structure and we're already in the process of adding metadata about table and column data sensitivity. I will definitely be checking out the regexes and NER models used here.