The Character Mapping framework maps text values, defined by a set of character groups, to other text values generated from the same character groups. Mappings are calculated algorithmically, so it is not necessary to provide the set of mapping values. The algorithm preserves any characters not assigned to a group. Any characters from the first Unicode plane can be mapped - this covers most characters used in modern languages. Other (supplementary) characters can only be preserved.
The particular set of permutations used is determined by the algorithm's key, so rekeying the algorithm will cause different outputs to be generated for each input.
The algorithm has the following properties:
- The masked value for each input is consistent unless the algorithm is rekeyed.
- No two text inputs produce the same text output. Collisions are possible for some data types, such as Numeric, where multiple text values, such as "001" and "1", are treated as the same value.
- As long as at least one maskable character is present in the input, the masked value will never match the input.
- Each masked position influences the mapping done at every other masked position.
For these reasons, this algorithm is useful for masking columns with uniqueness requirements, such as primary and foreign key columns.
This algorithm was introduced in version 22.214.171.124, and uses the algorithm extensibility framework, allowing it to be called from other algorithms using that framework.
To decide whether Character Mapping or Segment Mapping is the correct option for your use case, see Choosing Between Character and Segment Mapping Frameworks.
The character mapping algorithm can be used for tokenization and reidentification jobs.
Creating a Character Mapping Algorithm via UI¶
In the upper right-hand region of the Algorithm tab under Settings, click Add Algorithm.
Select Character Mapping Algorithm. The "Create Character Mapping Algorithm" pane appears.
Enter an Algorithm Name.
This MUST be unique.
Enter a Description.
Define Character Groups for each group of characters among which you would like to map. Each group may defined either by specifying each literal character in the group, such as "0123456789", or using Java Regular Expression style character ranges, such as "[0-9]". The algorithm will freely map characters to other characters within the same group, so by defining groups "[0-9]" and "[A-Z]", numbers would be replaced by other numbers, and letters by other letters, but a number would never be replaced by a letter. Groups should not contain duplicate characters, and each character may belong to only one group. Any character that is not assigned to a group will be preserved (not masked) by the algorithm.
The box below the entry area allows selection of character groups defined for other, preexisting Character Mapping algorithms.
Check the Case Sensitive box to cause the algorithm to treat upper and lower case characters as distinct characters for mapping.
Select a value for Minimum Masked Position, which sets the minimum number of characters that the algorithm must mask; fewer positions triggers non-conformant data handling. Null, empty, and all-whitespace values never trigger non-conformant data handling.
Check the Preserve Leading Zeros box to cause the algorithm to preserve any number of '0' characters at the beginning of each input. This is only useful if '0' has been assigned to a character group in step 5.
Masked results are not guaranteed to be unique if Preserve Leading Zeros is used, and the algorithm cannot be used for tokenization/re-identification jobs.
If desired, define ranges of the input value to ignore using the Preserve Ranges controls. For Character Mapping algorithms, only characters that would otherwise be masked count when determining position for preserve ranges. Each preserve range is defined by:
- Start Position - The position at which to start preserving, starting from 0.
- Length - The number of characters to preserve.
- Direction - The direction, either forward or reverse, determining whether to process from the beginning or end of input for this range.
Be wary of the following Preserve Range processing differences between Segment Mapping and Character Mapping: Segment Mapping ranges start with index 1, while Character Mapping ranges begin with index 0. Segment Mapping includes perserved characters when determining position, while Character Mapping only counts maskable characters. For example, to ignore the first two characters, you would enter Starting Position 0 for Character Mapping, but Starting Position 1 for Segment Mapping. If both algorithms were configured to preserve "-", and preserve the first two positions, Character Mapping might mask "--0000" to "--0073", while Segment Mapping might mask "--0000" to "--4638".
As an example, a Character Mapping algorithm could be defined with a single character group, "[0-9]". It might mask as follows:
- "(603) 867-5309" → "(463) 638-0193"
- "999-12-3456" → "453-71-6283"
- "Call Tom at 8:00PM" → "Call Tom at 2:75PM"