This site is supported by donations to The OEIS Foundation.

User:M. F. Hasler/Representing large digits

From OeisWiki
Jump to: navigation, search

In several situations, for example when dealing with sequences that list permutations or numbers written in base-N, one has often the wish to be able to represent digits larger than 9 (in an unambiguous way, of course). Here I propose several practical solutions to this problem. This also provides a means of representing numbers with leading zeros.

The basic problem

In the OEIS, sequences are recorded as lists of possibly signed integers, written in decimal notation and without leading zeros. The decimal notation of course allows to write integers in bases smaller than 9 without any problem, but for a base b>9, an unambiguous interpretation is not possible if digits are written in decimal and just concatenated; e.g. "10" might be one digit '10' (in some base > 10) or a digit '1' followed by a digit '0' (which would mean 13 in base 13). Similarly, "111" might be a digit '1' followed by digit '11' or the converse, representing either 1 x 12 + 11 = 23 or the much larger 11 x 12 + 1 = 133, in base 12.

The well known standard approach is to use upper or lower case letters A, B, C, ... to represent digits 10, 11, 12, ..., but for the reasons given above, this is not possible in the OEIS (DATA sections and b-files).

Another problem occurs due to the impossibility of recording "leading zeros", e.g., one cannot list the permutations of "0123", actually not even the initial term. In this precise case, the problem can be circumvented by shifting all digits by 1, i.e., by listing permutations of "1234" instead, but this is not always a valid solution.

Terms without digits 0

A very simple solution

I was lead to these considerations when looking at sequence A247701 where only nonzero digits occur. This is a case where an extremely simple solution to the announced problem exists:

One possibility of encoding digits d > 9 is to write them as (d-9k)*10^k for 9*k < d < 9*k+10, i.e., d=10 as "10", d=11 as "20",..., d=18 as "90", then d=19 as "100", d=20 as "200", etc.

This solution is very convenient in many situations, especially when the limit of base b=9 is reached soon, requiring a generalized notation, but base b=19 is beyond practical need. This is frequently the case, since typically the number of terms would grow from 9! = 362880 (which can be considered small in many regards) to 19! = 121645100408832000, which is already quite large, e.g., far beyond the capacity (in bytes) of the database for the next years.

It is also a very natural solution in the sense that the digit 10 is simply and completely unambiguously represented by "10". (Of course, it is understood that numbers must be decoded from right to left, i.e., one looks first for the (rightmost) '0' and the digits preceding it are grouped together according to the coding scheme.)

Up to d=18, this is also certainly the most compact representation possible, since it is clear that digits > 9 cannot be coded with a single digit and therefore require at least two digits, and the provided solution remains sharply at this minimal bound.

Alternate solutions for larger digits

Beyond d=18 (and especially beyond d=27 represented as "900"), the growing number of somehow "unused" digits "0" looks as a waste of space. If larger digits are to be frequently encoded, there are two possible solutions, inspired by existing standards: (1) reserving a multiple of additional digits for each "escape character 0", or (2) using one digit (or more digits) to determine the "width" of the character.

Duplicating block size

One possibility to avoid the large number of digits "0" which would occur using the first encoding scheme for d>27, is to reserve twice the number of digits with each additional "0". More explicitly,

x0, with 1 <= x <= 9, would mean, as before, a digit x+9 in the range from 10 to 18, 
yx00, with 0 <= y <= 9, 1 <= x <= 9 as above (x=0 still forbidden), would mean one of 90 possible digits ranging from 19 to 108,
zzzx000, with 000 <= zzz <= 999, would mean a digit in the range from 109 to 9108 (*)

etc.: If there are L consecutive "0"s, then 2^(L-1) characters (decimal digits) to the left thereof are to be grouped together. Any other multiplier m > 2 could be chosen if this would be more useful in a given situation.

(*) There is no problem of leading zeros here: if the "extended-width" digit is the first in the number, those leading 0's are not written, without creating an ambiguity, but if preceded by any other digit, they are written to their full width as indicated.

(**) It is crucial to observe the constraint that the last digit preceding the string of "0"s must be nonzero to maintain unambiguous interpretation. Therefore, admissible values for yx are 01, ..., 09, 11, ...19, 21, etc., for zzzx one can use 0001, ..., 0009, 0011, ..., 0099, 0101, ..., 0999, 1001, ..., 9989, 9991, ..., 9999. The "0"s in zzz do not cause an ambiguity due to the global right-to-left parsing of the number.

Specified block size

As an alternative to use the "tally mark 0"s for indicating duplicated block size/"width" of the extended digit, one can use one decimal digit to make that size precise, again in various ways:

yx0 where 0 <= x <= 9 indicates the number of characters to be reserved for the "extended digit" y >= 0.

Either x > 0 literally gives the number of characters (0 would not make much sense here, except that 00 could represent a digit zero), or via some adequate formula another number w(x) of (preceding) characters would be reserved for y, e.g., w(x) = m*x or w(x) = m^x (where x = 0 also can make sense), with m = 2 or more.

  • Since digits 1..9 are already represented in non-extended notation, y = 0 could again represent 10 (or zero, if needed), etc. For x > 1, depending on the formula of w(x), y = 0 could represent the next larger digit which cannot be represented for a smaller x (cf. Example below), unless it is preferred to allow multiple representations for the same number in order to make encoding and decoding easier.
  • In this case, y may have trailing 0's without risk of ambiguity.
  • x = 9 (or x = 0) might be used to switch to a further extended coding scheme, where y not only represents the value of the digit but also specifies a further number of characters to be considered part of this extended digit.

Examples

  • With w(x) = x, ("00"; "010", "110", "210", ..., "910"; "0020", "0120", ..., "9920"; "00030" .. "99930"; ... "99999999990")
could represent digits (0; 10, 11, 12, ..., 19; 20, 21, ..., 119; 120 .. 1119; ... 1111111119 ≈ 1.1e9).
One can check that the fact that leading zeros disappear in the "code", does not pose a problem to represent numbers: leading zeros in the "plain text" (original number) are irrelevant and possibly missing '0's in the code of a leading nonzero extended digit can be reconstructed, knowing the number of digits supposed to be there.
Leading zeros can't be coded in that way. If this is needed, one could reserve "010" for zero and shift the code for all digits > 9 by one.
  • With w(x) = 2^x, "00" would not be valid and can't be used to represent 0, since 2^0 = 1 digits are expected to the left of this.
("0010" .. "9910"; "000020" .. "999920") could represent, e.g., digits (10 .. 109; 110 .. 10109), etc.

Comparative analysis

The latter coding scheme seems preferable to the "duplicating block size" one, since it allows to determine the number of characters in the "extended digit" in a more flexible and adjustable way, without requiring more space beyond the 2-digit case, and without the need to exclude trailing 0's in y.

Depending on the application, on should choose between the latter scheme (with convenient choice of w(x)) and the first and simplest one, possibly with a fixed convention for "00" to introduce a further extended scheme (such as the latter one with 0 replaced by that "00").

Coding digits zero

All preceding proposals use the digit "0" as an "escape character" to introduce multi-character digit coding, so it cannot be used on its own to code a digit 0. Often, shifting all digits by one (e.g., as to list permutations of "1234" instead of "0123") is not desirable.

In this case, the most natural solution appears to use the code proposed in the above for d=10 to code the digit 0, and shift the remaining codes by one, i.e., for example in the first scheme, "10" would encode a digit d=0, "11" would encode d=10, "12" would encode d=11 etc.

References

See also

Authorship

The first version of this page was written by M. F. Hasler, 11:09, 5 October 2014 (UTC-400)

M. F. Hasler, User:M._F._Hasler/Representing_large_digits. — From the On-Line Encyclopedia of Integer Sequences® (OEIS®) wiki.Available at https://oeis.org/wiki/User:M._F._Hasler/Representing_large_digits