Latin-1 Text Annotations (tEXt, zTXt)

The PNG Guide is an eBook based on Greg Roelofs' book, originally published by O'Reilly.

Home PNG Options and Extensions Latin-1 Text Annotations (tEXt, zTXt)	Index



Latin-1 Text Annotations (tEXt, zTXt) Status: PNG Specification Location: anywhere Multiple: yes That brings us to PNG's original text chunks, which are perhaps its most popular nonessential chunks. Regardless of how many words a picture is worth, it is often useful or necessary to add a few more in order to record pertinent information like title and author, store requisite legal notices such as a copyright or disclaimer, or merely to transfer text from one image to another. PNG supports two types of Latin-1-based text chunks, uncompressed (tEXt) and compressed (zTXt). There is also a new Unicode-based chunk (iTXt) that I'll discuss next. For the first two, the format is basically the same: an uncompressed keyword or key phrase, a null (zero) byte, and the actual text. In zTXt the text is compressed; the first byte after the null indicates the compression method, for which only deflate is currently defined (method zero). The remainder is the compressed stream, which for method zero must be in zlib 1.x format, just as for image data. (The zlib 1.x format is described by revision 3.3 of the zlib specification, which is available from http://www.zlib.org/zlib_docs.html/.) Both keyword and raw text should be encoded with the Latin-1 (ISO/IEC 8859-1) character set; neither may contain null bytes. Since the keyword is intended to be recognizable by both humans and computer programs, additional restrictions are placed on it: it may not contain leading, trailing, or consecutive spaces, and it is restricted to characters in the range 32-126 and 161-255 (which, in particular, rules out both control characters and the nonbreaking space, decimal value 160). The only other restriction on the main text of the chunk is that newlines should be in Unix format, i.e., represented by a single line-feed character (decimal value 10). I mentioned in Chapter 7, "History of the Portable Network Graphics Format", that the Unicode UTF-8 character set was one of the items in the design of PNG that was voted down. In retrospect this was, perhaps, a lamentable decision; it was finally addressed early in 1999 with the iTXt chunk. But at the time, UTF-8 was very new and had not been extensively tested in the field. In particular, it had little or no operating-system support and no support in standard programming libraries, either for encoding and decoding or for the translation and display of UTF-8 characters in the native character set(s) of existing systems. Since PNG's design goals included both the use of well-tested technologies and the avoidance of undue burdens on developers of PNG applications, support for UTF-8 was dropped in favor of the more familiar Latin-1 character set. The following list summarizes all of the keywords that are either included in the specification itself or officially registered as extensions to the spec: *Author* The name of the author of the image. If the original image were a painting or other nonelectronic medium, both the original artist and the person who scanned the image might be listed. *Title* A one-line title or caption. Longer captions should generally use the Description keyword, but see the end of this section for an unofficial alternative. *Description* A longer description of or caption for the image, perhaps including details about the tools and settings used; the name, age, and/or location of the subject matter; or the mood the artist was trying to convey. See also the Software and Source keywords. *Creation Time* The time the image was created, in whatever sense is most appropriate. The recommended format is that prescribed by Internet RFC 822 (Section 5), as amended by RFC 1123 (Section 5.2.14); specifically: day month year hour:minute timezone where day is either one or two digits; month is a three-letter English abbreviation such as Jun; year is two or four digits (though the latter is strongly recommended); hour and minute are two digits each; and timezone is either a three-letter abbreviation (e.g., PST for Pacific Standard Time), or a one-letter U.S. military designation, or a four-digit number with a leading positive or negative sign indicating the hour:minute offset from Coordinated Universal Time (e.g., -0800 for Pacific Standard Time, which is eight hours and zero minutes earlier than UTC). In addition, the entire string may optionally be preceded by a weekday field, where weekday is a three-letter English abbreviation (e.g., Fri). A colon and two-digit seconds field may also be appended to the time (that is, hour:minute:second). Note that this is merely a recommendation; strings such as ``circa 1492'' are allowed, as is explanatory text following an RFC-style date string. *Copyright* The legal copyright notice for the image. For example, ``Copyright 1999 by Greg Roelofs. This image may be freely used and distributed provided that it is not modified in any way and that this notice remains intact.'' *Disclaimer* A legal disclaimer notice for the image. This might include a company's standard boilerplate on all copyrighted works; in particular, it might be lengthy enough to store in a compressed (zTXt) chunk, while the copyright notice remains uncompressed. *Warning* A warning about the content or effects of the image. For example, certain types of popular material may not be suitable for minors, or a random-dot stereogram (``Magic Eye'' 3D image) may induce headaches in some people. *Software* The name and possibly the version of the software used to create the image. This is most often generated automatically, but it need not be. More than one software application may be listed. *Source* Information about the device used to generate the image, such as a digital camera or a scanner. *Comment* A miscellaneous comment, often converted from a GIF comment (which lacks keywords). In addition to these official keywords, one of the technical reviewers of this book and I have been known to make use of a few unofficial keywords. The Caption keyword is used to provide a brief description of an image that is more specifically tailored for use as a publishable caption than the generic Description keyword; it is also generally lengthier than is appropriate for the Title keyword. The E-mail keyword stores the email address of the author in standard Internet format (RFC 822, Section 6, as amended by RFC 1123, Sections 5.2.15 through 5.2.19); for example, roelofs@pobox.com . And the URL keyword is for a standard WWW Uniform Resource Locator (RFC 2068, Section 3.2); for example, http://www.oreilly.com/ . If the URL is reasonably self-explanatory, it is recommended that the chunk consist of the single URL and nothing else, but this is not a requirement. Multiple URLs should be separated by newline characters. Note that spaces and other white space (tabs, newlines, and so forth) are considered unsafe by the URL standard and therefore must be escaped within a conforming URL. For example, a space character must be encoded as %20. This allows easy parsing of optional explanatory text after a URL: the URL ends when the first white space (space, tab, or newline) is encountered.
Home PNG Options and Extensions Latin-1 Text Annotations (tEXt, zTXt)

Last Update: 2010-Nov-26