Previous |  Up |  Next

Article

Summary:
The UTF-8 encoding keeps the standard ASCII characters unchanged and encodes the accented letters of our alphabets in two bytes. The standard 8bit TeX is not ready for the UTF-8 input because it has to manage the single character as two tokens. It means you cannot set the \catcode, \uccode, etc. to these single characters and you cannot do \futurelet of the next character in normal sense. The second version of my encTeX solves these problems. The encTEX is full backward compatible with the original TeX. It adds ten new primitives by which you can set or read the conversion tables used by input processor of TeX or used during output to the terminal, log and \write files. The second version gives possibility to convert the multi-byte sequences to one byte or to a control sequence. You can implement up to 256 UTF-8 codes as one byte and unlimited number of other UTF-8 codes as control sequences. All internals in 8bit TeX are working in the same way as if "normal one byte encoding" of input files is used. I think that the UTF-8 encoding will be in more common use. In such situation, there is no other way than to modify the input processor of TeX otherwise the 8bit TeX will die in a short time.
References:
[2] Olšák, Petr: ftp://math.feld.cvut.cz/pub/olsak/enctex/encdoc.pdf
[5] Olšák, Petr: ftp://math.feld.cvut.cz/pub/cstex/base/cstrip.tar.gz
[6] Questions & Answer Session with Donald Knuth. In: TUGboat, Volume 22 (2001), No. 1/2, pp: 15–19.
[7] Olšák, Petr: EncTeX - změny konverzních tabulek v TeXu. Zpravodaj Československého sdružení uživatelů TEXu, 3 (7), 109–118 (1997)
[8] Olšák, Petr: Putování písmene ř z klávesy na papír. Zpravodaj Československého sdružení uživatelů TEXu, 3 (7), 129–140 (1997)
Partner of
EuDML logo