Out Of The Box Algorithm Frameworks¶

This section describes the different algorithm frameworks (Secure Lookup, Segment Mapping, etc) that are available.

Secure Lookup Algorithm Framework¶

Secure Lookup is the most commonly used type of algorithm. It is easy to generate and works with different languages. When this algorithm replaces real, sensitive data with fictional data, it is possible that it will create repeating data patterns, known as “collisions.” For example, the names “Tom” and “Peter” could both be masked as “Matt.” Because names and addresses naturally recur in real data, this mimics an actual data set. However, if you want the Masking Engine to mask all data into unique outputs, you should use Segment Mapping.

Starting version 6.0.4.0 we introduced builtin Extended Secure Lookup Algorithm Framework, co-existing with the legacy one (described above). That new framework uses SHA256 hashing method and allows case configurations for input and output (i.e. masked) values. For details please see the Secure Lookup Algorithm Framework

Character Mapping Algorithm Framework¶

The Character Mapping framework maps text values, defined by a set of character groups, to other text values generated from the same character groups. Mappings are calculated algorithmically, so it is not necessary to provide the set of mapping values. The algorithm preserves any characters not assigned to a group. Any characters from the first Unicode plane can be mapped - this covers most characters used in modern languages. Other (supplementary) characters can only be preserved.

As an example, a Character Mapping algorithm could be defined with a single character group, "[0-9]". It might mask as follows:

"(603) 867-5309" -> "(463) 638-0193"
"999-12-3456" -> "453-71-6283"
"Call Tom at 8:00PM" -> "Call Tom at 2:75PM"

The particular set of permutations used is determined by the algorithm's key, so rekeying the algorithm will cause different outputs to be generated for each input.

The algorithm has the following properties:

The masked value for each input is consistent unless the algorithm is rekeyed.
No two text inputs produce the same text output. Collisions are possible for some data types, such as Numeric, where multiple text values, such as "001" and "1", are treated as the same value.
As long as at least one maskable character is present in the input, the masked value will never match the input.
Each masked position influences the mapping done at every other masked position.

For these reasons, this algorithm is useful for masking columns with uniqueness requirements, such as primary and foreign key columns.

This algorithm was introduced in version 6.0.5.0, and uses the algorithm extensibility framework, allowing it to be called from other algorithms using that framework.

Choosing Between Character and Segment Mapping Frameworks¶

The Character Mapping algorithm is intended to replace Segment Mapping for many use cases. That said, it does not replicate every feature of that algorithm, so the specific masking application will determine which one is appropriate.

Reasons to choose Character Mapping over Segment Mapping:

Character Mapping has no limit on the number of positions masked. Segment Mapping cannot handle inputs longer than 36 maskable characters.
Character Mapping can mask all characters in the first Unicode plane. Segment Mapping can only mask "[a-zA-Z]" + "[0-9]"
Character Mapping automatically preserves all non-masked characters. Segment Mapping requires configuration of all preserve characters, which can be impossible due to the limit on the number of preserve characters. Character Mapping is much easier to use when the data is potentially "dirty" or not consistently formatted.
Character Mapping always changes the input (unless no maskable characters are present). With Segment Mapping, there is typically a small chance an input will mask to the same value.
Character Mapping can process preserve ranges in reverse, allowing the last positions of an input to be preserved when inputs have different lengths. Segment Mapping preserve ranges are always processed from the beginning of input.
Character Mapping uses a more complex masking computation, so that every maskable position influences every other position in the masked value. Segment Mapping pre-computes the permutations for each segment independently.

Reasons to choose Segment Mapping over Character Mapping:

Segment mapping can mask different parts of the input, determined by position, differently. Character Mapping always masks the same groups of characters regardless of position.
Segment mapping can map inputs to different outputs at a position, like { A, B, C, D } -> { W, X, Y, Z } by specifying different Real and Mask values. This is not possible with Character Mapping.
Segment mapping supports numeric segments, with up to 4-digit segments masked to a specific range. Character Mapping doesn't allow this kind of range limiting.
Segment Mapping can be used for tokenization. Character Mapping does not support tokenization at this time.

Segment Mapping Algorithm Framework¶

Segment Mapping algorithms produce no overlaps or repetitions in the masked data. They let you create unique masked values by dividing a target value into separate segments and masking each segment individually.

You can mask up to a maximum of 36 values using segment mapping. You might use this method if you need columns with unique values, such as Social Security Numbers, primary key columns, or foreign key columns. When using segment mapping algorithms for primary and foreign keys, in order to make sure they match, you must use the same Segment Mapping algorithm for each. You can set the algorithm to produce alphanumeric results (letters and numbers) or only numbers.

With Segment Mapping, you can set the algorithm to ignore specific characters. For example, you can choose to ignore dashes [-] so that the same Social Security Number will be identified no matter how it is formatted. You can also preserve certain values. For example, to increase the randomness of masked values, you can preserve a single number such as 5 wherever it occurs. Or if you want to leave some information unmasked, such as the last four digits of Social Security numbers, you can preserve that information.

Segment Mapping Example¶

Perhaps you have an account number for which you need to create a segment mapping algorithm. You can separate the account number into segments, preserving the first two-character segment, replacing a segment with a specific value, and preserving a hyphen. The following is a sample value for this account number:

NM831026-04

Where:

NM is a plan code number that you want to preserve, always a two-character alphanumeric code.
831026 is the uniquely identifiable account number. To ensure that you do not inadvertently create actual account numbers, you can replace the first two digits with a sequence that never appears in your account numbers in that location. (For example, you can replace the first two digits with 98 because 98 is never used as the first two digits of an account number.) To do that, you want to split these six digits into two segments.
-04 is a location code. You want to preserve the hyphen and you can replace the two digits with a number within a range (in this case, a range of 1 to 77).

Mapping Algorithm Framework¶

A Mapping algorithm allows you to state what values will replace the original data. It sequentially maps original data values to masked values that are pre-populated to a lookup table through the Masking Engine user interface. There will be no collisions in the masked data because it always matches the same input to the same output. For example “David” will always become “Ragu,” and “Melissa” will always become “Jasmine.” The algorithm checks whether an input has already been mapped; if so, the algorithm changes the data to its designated output.

You can use a Mapping algorithm on any set of values, of any length, but you must know how many values you plan to mask. You must supply AT MINIMUM the same number of values as the number of unique values you are masking; more is acceptable. For example, if there are 10,000 unique values in the column you are masking you must give the Mapping algorithm AT LEAST 10,000 values.

Info

When you use a Mapping algorithm, you cannot mask more than one table at a time. You must mask tables serially.

Binary Lookup Algorithm Framework¶

A Binary Lookup algorithm is much like the Secure Lookup algorithm but is used when entire files are stored in a specific column. This algorithm replaces objects that appear in object columns. For example, if a bank has an object column that stores images of checks, you can use a Binary Lookup algorithm to mask those images. The Delphix Engine cannot change data within images themselves, such as the names on X-rays or driver’s licenses. However, you can replace all such images with a new, fictional image. This fictional image is provided by the owner of the original data.

Tokenization Algorithm Framework¶

A Tokenization algorithm is the only type of algorithm that allows you to reverse its masking. For example, you can use a Tokenization algorithm to mask data before you send it to an external vendor for analysis. The vendor can then identify accounts that need attention without having any access to the original, sensitive data. Once you have the vendor’s feedback, you can reverse the masking and take action on the appropriate accounts.

Like Mapping, a Tokenization algorithm creates a unique token for each input such as “David” or “Melissa.” The actual data (for example, names and addresses) are converted into tokens that have similar properties to the original data – such as text and length – but no longer convey any meaning. The Delphix Masking Engine stores both the token and the original so that you can reverse masking later.

Min Max Algorithm Framework¶

The Delphix Masking Engine provides a "Min Max algorithm" to normalize data within a range – for example, 10 to 400. Values that are extremely high or low in certain categories allow viewers to infer someone’s identity, even if their name has been masked. For example, a salary of $1 suggests a company’s CEO, and some age ranges suggest higher insurance risk. You can use a Min Max algorithm to move all values of this kind into the midrange. This algorithm allows you to make sure that all the values in the database are within a specified range.

If the Out of range Replacement Values checkbox is selected, a default value is used when the input cannot be evaluated.

Data Cleansing Algorithm Framework¶

A data cleansing algorithm does not perform any masking. Instead, it standardizes varied spellings, misspellings, and abbreviations for the same name. For example, “Ariz,” “Az,” and “Arizona” can all be cleansed to “AZ.” Use this algorithm if the target data needs to be in a standard format prior to masking.

Free Text Redaction Algorithm Framework¶

A Free Text Redaction algorithm helps you remove sensitive data that appears in free-text columns such as “Notes.” This type of algorithm requires some expertise to use because you must set it to recognize sensitive data within a block of text.

One challenge is that individual words might not be sensitive on their own, but together they can be. The algorithm uses profiler sets to determine what information it needs to mask. You can decide which expressions the algorithm uses to search for material such as addresses. For example, you can set the algorithm to look for “St,” “Cir,” “Blvd,” and other words that suggest an address. You can also use pattern matching to identify potentially sensitive information. For example, a number that takes the form 123-45-6789 is likely to be a Social Security Number.

You can use a Free Text Redaction algorithm to show or hide information by displaying either a “denylist” or an “allowlist.”

Denylist – Designated material will be redacted (removed). For example, you can set a deny list to hide patient names and addresses. The deny list feature will match the data in the lookup file to the input file.

Allowlist – ONLY designated material will be visible. For example, if a drug company wants to assess how often a particular drug is being prescribed, you can use an allow list so that only the name of the drug will appear in the notes. The allow list feature enables you to mask data using both the lookup file and a profile set.

For either option, a list of words can be imported from an external text file, or alternatively, you can use Profiler Sets to match words based on regular expressions, defined within Profiler Expressions. You can also specify the redaction value that will replace the masked words. Regular expressions defined using Profiler Sets will match individual words within the input text, rather than phrases.