I’d recently written about an experimental library to detect PII. When discussing it with an acquaintance of mine, I was told that PII can also be disguised. For example, a corpus of text like a review or a comment can contain email address in the form “johndoeatgmaildotcom”. This led me to update the library so that emails like these can also be flagged. In a nutshell, I had to update the regex which was used to find the email.
Example
This is best explained with a few examples. In all of the examples, we begin with a proper email and disguise it one step at a time.
1 | column = Column(name="comment") |
All of these assertions pass and the regex detector is able to flag all of these examples as email.