Abstract This document addresses issues raised by the Unicode Technical Committee and builds on information in “Proposed Changes to Gurmukhi” (L2/05-088). Any relevant information from the previous proposal is included within for completeness. Additional evidence is available in “Proposed Changes to Gurmukhi” (L2/05-088) and is not duplicated here.
Thanks to everyone who has contributed both time and effort to the research involved in making this document. Special thanks to: Jeevan Deol Anoop Singh Manudeep Singh Serjinder Singh Kulbir Thind …and all the members of the Indic mailing list who have constructively discussed and debated the contents of these proposals.
Page 1 of 13
A1. Double Vowel Signs Older Gurmukhi (for example, in the Sikh holy book – the Sri Guru Granth Sahib) is known to use two vowel signs on one consonant. This behaviour is restricted to Hora (Vowel Sign OO, ◌ੋ, U+0A4B) and Aunkar (Vowel Sign U, ◌ੁ, U+0A41). This particular combination represents “the metrical shortening of ‘ō’ or lengthening of ‘u’ depending on context.”1 The additional vowel sign is added to a syllable and lengthens or shortens the vowel based on the original vowel sign. It is designed to keep the meaning of the original word in tact, while indicating how the vowel should be pronounced in poetry. 2
A1.1 SGGS page 1386 (੧੩੮੬)
A1.2 SGGS page 1396 (੧੩੯੬)
Example Umāhā (ਉਮਾਹਾ) becomes Ūmāhā (ਓੁਮਾਹਾ) Gōbind (ਗੋਿਬੰਦ) becomes Gobind (ਗਿਬੰਦ) Both examples maintain the original meaning of the word while altering the pronunciation. Proposed Changes It was originally suggested that this phenomenon be accommodated using existing characters. However after further discussion it was realised that this would break older rendering engines and would introduce unnecessary exceptions to Gurmukhi rendering. Assigning new code points would not only be of an advantage to users with older implementations of Unicode, but it would also be more consistent with the rendering rules of Gurmukhi and other Indic scripts. The two part vowel sign also follows similar behaviour in other Indic scripts such as Bengali. Two new characters are recommended for inclusion in the standard to accommodate this phenomenon: ਓੁ – U+0A11 – GURMUKHI LETTER SHORT OO OR LONG U ◌ – U+0A49 – GURMUKHI VOWEL SIGN SHORT OO OR LONG U
0A11;GURMUKHI LETTER SHORT OO OR LONG U;Lo;0;L;;;;;N;;;;; 0A49;GURMUKHI VOWEL SIGN SHORT OO OR LONG U;Mn;0;NSM;0A4B 0A41;;;;N;;;;;
Jeevan Deol, Research Fellow in Indian History, St. John’s College, University of Cambridge Sahib Singh, Gurbani Vyakaran (Gurbani Grammar), (1994, In Punjabi), p. 405.
Page 2 of 13
These code points correspond to the independent and dependent forms of Devanagari Candra O although they have no relation to this character. ‘Short OO or Long U’ is used instead of ‘O or UU’ as it more accurately conveys the use of the actual character. GURMUKHI VOWEL SIGN SHORT OO OR LONG U may also be constructed as shown: U+0A49 (◌) = U+0A4B (◌ੋ) + U+0A41 (◌ੁ) The sequence U+0A41 and U+0A4B is not equivalent and as such, U+0A4B should be forced to stand alone.
Page 3 of 13
A2. Recommended Character Sequences After the submission of the initial proposals, it became apparent that there were problems with multiple ways of representing Gurmukhi syllables that could not be addressed with normalisation. In response to this, the following rules have been formulated: • • •
No additional vowel signs should attach to independent vowels – this is especially true for Aira (Letter A). Ura and Iri are only designed for singular representation and have no inherent meaning on their own. They should not combine with any signs in the Gurmukhi block, including Nukta. Only one vowel sign should attach to a consonant unless a specific exclusion is listed. The only exclusion should be for the new code point U+0A49 which decomposes to U+0A4B and U+0A41. The sequence U+0A41 and U+0A4B is not a valid sequence.
In response to the recommendations by the UTC, the following table lists the acceptable and unacceptable forms for a given graphical appearance. Graphical Appearance
Vowel signs should not be attached to the standalone forms of the vowel bearers (U+0A05, U+0A72 and U+0A73). The pre-composed code points should be used instead.
*Indicates code points recommended for inclusion into the Unicode Standard
Page 4 of 13
B1. Named Sequence Corrections The recommendations in this section are based on UAX #34. Although the document is considered an ‘integral part of the Unicode Standard’ it does not contain any details of the stability of named sequences. Although this may be inferred, there is no specific mention that named sequences cannot be added, remove or renamed. In Unicode 4.1, six named sequences were added for Gurmukhi. Of these, two are incorrect: GURMUKHI HALF YA;0A2F 0A4D GURMUKHI PARI YA;0A4D 0A2F Half Ya was recognised as a conjunct in the Unicode Standard 4.03 and is listed incorrectly as a named sequence. Half Ya is a C2-conjoining consonant – i.e. it takes an alternative form in the second half of a conjunct: ਦ + ◌੍ + ਯ = ਦ As such, the current listing for Half Ya should be changed to: GURMUKHI HALF YA;0A4D 0A2F If it is possible that it can be renamed, it should be renamed to: GURMUKHI ADDA YA;0A4D 0A2F Adda (ਅੱਧਾ) is the Punjabi word for half and remains consistent with ‘Pari’ or ‘Pairin’. This poses a problem for the existing listed ‘Pari Ya’ which should be removed. Further details on Pairin Ya are listed in C2. In addition, the existing named sequences are labelled as ‘Pari’ which is an incorrect transliteration. ‘ਪੈਰੀਂ’ should be transliterated as ‘Pairin’ or ‘Pairīn’, and this should be reflected in the existing named sequences. If the named sequences cannot be changed, the new additions mentioned below should be consistent and use ‘Pari’ instead of ‘Pairin’.
The Unicode Standard 4.0, (2003), p. 235 table 9-4.
Page 5 of 13
B2. Subjoined Consonants The following subjoined consonants should be recognised. All are archaic and are not used in modern Gurmukhi. They should all be added as named sequences. Virtually all of the subjoined consonants are equivalent to their full form but without the top bar. Virama (U+0A4D) + Ka (U+0A15) =
◌੍ + ਕ = ◌
= GURMUKHI PAIRIN KA
Virama (U+0A4D) + Ga (U+0A17) =
◌੍ + ਗ = ◌
= GURMUKHI PAIRIN GA
Virama (U+0A4D) + Ca (U+0A1A) =
◌੍ + ਚ = ◌
= GURMUKHI PAIRIN CA
Virama (U+0A4D) + Ja (U+0A1C) =
◌੍ + ਜ =
= GURMUKHI PAIRIN JA
Virama (U+0A4D) + Tta (U+0A1F) =
◌੍ + ਟ = ◌
= GURMUKHI PAIRIN TTA
Virama (U+0A4D) + Ttha (U+0A20) =
◌੍ + ਠ = ◌
= GURMUKHI PAIRIN TTHA
Virama (U+0A4D) + Ta (U+0A24) =
◌੍ + ਤ = ◌
= GURMUKHI PAIRIN TA
Virama (U+0A4D) + Tha (U+0A25) =
◌੍ + ਥ = ◌
= GURMUKHI PAIRIN THA
Virama (U+0A4D) + Da (U+0A26) =
◌੍ + ਦ = ◌
= GURMUKHI PAIRIN DA
Virama (U+0A4D) + Dha (U+0A27) =
◌੍ + ਧ = ◌
= GURMUKHI PAIRIN DHA
Virama (U+0A4D) + Na (U+0A28) =
◌੍ + ਨ = ◌
= GURMUKHI PAIRIN NA
The conjuncts already recognised by the Unicode Standard should be listed as named sequences (Pairin Va is already listed, for Half Ya see B2): Virama (U+0A4D) + Ra (U+0A30) =
◌੍ + ਰ = ◌
= GURMUKHI PAIRIN RA
Virama (U+0A4D) + Ha (U+0A39) =
◌੍ + ਹ = ◌
= GURMUKHI PAIRIN HA
Page 6 of 13
C1. Udaat (ਉਦਾਤ) Initially it was determined that Udaat was a variant form of subjoined Ha (Pairin Haha), however after further research this is now believed to be incorrect. This also explains why both subjoined Ha and Udaat are used concurrently in the same document. Udaat4 looks like the Halant or Virama character in Devanagari, but it is not that character. It is found in the Sri Guru Granth Sahib 1188 times5. The Udaat is/was used for a non-segmental phoneme (akhãndi tòni) known as the high tone6. This sign is related to Ha, because Ha itself is used to distinguish tones, but it is not a variant form. Udaat may be related to Devanagari Udatta (U+0951) which also indicates a high tone in Sanskrit literature. High tone is still present in modern Punjabi, however, the Udaat is not used in modern Gurmukhi. In modern Gurmukhi, there are no symbols that highlight the high or low tones. But at places where the Udaat was used earlier, now another symbol known as the Pairin Haha is being used. This does not mean that the Udaat is equivalent to Pairin Haha. The orthographical rules of Gurmukhi suggest that Pairin Haha is used for the pronunciation of an aspirated sound of the initial letter7. However, in various Punjabi dialects, we find a variety of pronunciations, such as in the Majhi of Central Punjab, the words written with Pairin Haha would certainly be pronounced with a high tone, however, in most other dialects (both Western Punjabi and Eastern Punjabi dialects), either a complete or seminal /h/ would be found, or in places we would find the aspirate sound8. In the Old Gurmukhi of the Sri Guru Granth Sahib, both Udaat and Pairin Haha have been used. This is a result of the wide range of Punjabi dialects, apart from other languages, being represented in Gurbani, at different stages in their evolution (from the 12th century to the 17th century). The Udaat suggests the high tone, while the Pairin Haha denotes the aspirate /h/ with the inherent vowel being suppressed. In modern Gurmukhi, only Pairin Haha is used, but orthographically it does not suggest the high tone. Both high tone and /h/ pronunciation are to be found among the Punjabi dialects. The Halant or Virama of Devanagari, which has the similar form to Udaat, is used in English-Punjabi dictionaries to transcribe the correct pronunciation of English words and in other technical writings, such as lexicons. It is recommended that Udaat be encoded as a separate Unicode character, with the following properties: 0A51;GURMUKHI SIGN UDAAT;Mn;0;NSM;;;;;N;;;;; Udaat differs very slightly in its graphical appearance when compared to Halant. Udaat starts with a small tip and slopes inward to the right whereas Halant has a more uniformed thickness and slopes outwards to the right.
◌ Udaat should push down U and UU in the same way that existing subjoined consonants do.
The Punjabi-English dictionary, published by Punjabi University, Patiala (1994) gives following meanings of the term Udaat: ’sublime; acutely accentuated, sharply intoned’ (p. 9) 5 Kulbir S Thind, Text Trivia in Gurbani-CD 2004. The basis of the file is the Sri Guru Granth Sahib, published by the Shiromani Gurdwara Parbandak Committee in 1994. 6 Harkirat Singh, Gurbani di Bhasha te Vyakaran (1997, in Punjabi), pp. 102-3. 7 Joginder Singh Talwara, Gurbani da Saral Viakarn-Bodh, part I, pp. 27-8. 8 Ibid, p. 103. Sukhjinder Sidhu
Page 7 of 13
Udaat should be placed after the consonant whose tone is being changed but before the vowel. In many ways, Udaat should be treated as a subjoined consonant. In the following examples, an acute accent indicates the high tone. ਖੋਿਲਓ (Khōlí 'ō) 0A16 ਖ
ਸੰਮਾਲੇਹ (Samhā́lēhāṁ) 0A38 ਸ
ਓਲਾਮੇ (Ōlāmhḗ) 0A13 ਓ
Page 8 of 13
C2. Yakash (ਯਕਸ਼) Yakash is found in the Sri Guru Granth Sahib a total of 268 times9. The Yakash is commonly said to be a form of the Half Yaiyya character of Gurmukhi10. Yakash may take up to three variant forms, but it is most commonly shown in Sikh religious texts as a small hook below a consonant. In other texts it is shown as a subjoined Yaiyya without the top bar. Unlike the forms of Haha and Udaat, which are related to aspirated and high tones, the conjoined forms of Yaiyya have a different clarification11. The pronunciation of the Half Yaiyya character is less ambigious. It represents the /y/ sound, with the inherent vowel /a/ being supressed. The problem is related to the Yakash (Pairin Yaiyya). The prevalent view among a section of Gurbani scholars12 is that ‘y’ is to be regarded as both a vianjan (consonant) and an ardh-svar (semi-vowel). This means that ‘y’ represents both the sounds of /y/ and a number of sounds close to those of Gurmukhi vowels. Giani Harbans Singh (2000) has formulated it likewise that the Yakash is used at places where a semi-vowel is to be pronounced. Here is an example to illustrate this view. We use the word ਿਸਿਖਆ (‘sikhi'ā’), where the ਿ◌ and ਅ are to be replaced by the forms of Yaiyya: Yaiyya:
ਿਸਖਯਾ should be pronounced ‘sikhayā.
ਿਸਖ ਾ should be pronounced ‘sikhyā’.
ਿਸਖਾ should be pronounced with a semi-vowel sound as between ‘sikhyā’ and ‘sikhiā’.
This is related to the evolution of Sanskrit words, from their tatsam (original) to tadbhav (derivated) stages. Half Yaiyya was to be used where writings were transcribed into Gurmukhi, however, their pronunciation remained close to the original term. The second form, with the Yakash, suggests the change in pronunciation, where the consonant sounds moved towards a semi-vowel sound. The present way of writing, where we now use vowel signs, denotes the modern pronunciation of the term. It is recommended that Yakash be encoded as a separate Unicode character, with the following properties: 0A75;GURMUKHI SIGN YAKASH;Mn;0;NSM;;;;;N;;;;; Yakash looks like a hook and attaches to the bottom of the bearing consonant:
◌ Yakash, like Udaat, should push down U and UU in the same way that existing subjoined consonants do.
◌ Yakash should be treated as a subjoined consonant.
Thind, op.cit. Harkirat Singh, op.cit. p. 104. 11 The information given in this part is largely based upon Giani Harbans Singh, Gurbani Viyakaran (2000), p. 247-50. The views presented herein should not be regarded as scholarly sound, as other writers, such as Harkirat Singh, op.cit., pp. 104-5, have presented alternative views. 12 See Joginder Singh Talwara, op.cit. pp. 24-6 and 32-3, and Giani Harbans Singh, op.cit, p. 248. 10
Page 9 of 13
D1. Character Annotations The main Gurmukhi characters should be annotated with their formal Gurmukhi names. The table below lists the code point, letter name, formal transliteration and requested annotation. In some annotations for Nukta characters the word ‘Pairin’ is used. If the named sequences are not changed to ‘Pairin’, then ‘Pari’ should be used for consistency. Letters are listed in alphabetic and not code point order. Code point
Page 10 of 13
Sassa Pairin Bindi
Khakha Pairin Bindi
Gagga Pairin Bindi
Jajja Pairin Bindi
Phapha Pairin Bindi
Lalla Pairin Bindi Kana
*Denotes a proposed code point.
Page 11 of 13
E1. Proposal Summary A. Administrative 1. Title Proposed Changes to Gurmukhi 2 2. Requester’s name Sukhjinder Sidhu (Punjabi Computing Resource Centre) 3. Requester type (Member body/Liaison/Individual contribution) Individual contribution. 4. Submission date 2005-08-01 5. Requester’s reference (if applicable) 6. Choose one of the following: 6a. This is a complete proposal Yes. 6b. More information will be provided later No.
B. Technical – General 1. Choose one of the following: 1a. This proposal is for a new script (set of characters) No 1b. The proposal is for addition of character(s) to an existing block Yes. 1c. Name of the existing block Gurmukhi 2. Number of characters in proposal 4 3. Proposed category (see section II, Character Categories) Category C 4a. Proposed Level of Implementation (1, 2 or 3) (see clause 14, ISO/IEC 10646-1: 2000) Level 1 4b. Is a rationale provided for the choice? No 4c. If YES, reference 5a. Is a repertoire including character names provided? Yes.
GURMUKHI GURMUKHI GURMUKHI GURMUKHI
LETTER SHORT OO OR LONG U VOWEL SIGN SHORT OO OR LONG U SIGN UDAAT SIGN YAKASH
5b. If YES, are the names in accordance with the character naming guidelines in Annex L of ISO/IEC 10646-1: 2000? Yes. 5c. Are the character shapes attached in a legible form suitable for review? Yes. 6a. Who will provide the appropriate computerized font (ordered preference: True Type, or PostScript format) for publishing the standard? Dr K Thind, True Type 6b. If available now, identify source(s) for the font (include address, e-mail, ftp-site, etc.) and indicate the tools used: Development version of AnmolUniBani available by request by emailing [email protected] 7a. Are references (to other character sets, dictionaries, descriptive texts etc.) provided? In document L2/05-088. 7b. Are published examples of use (such as samples from newspapers, magazines, or other sources) of proposed characters attached? In document L2/05-088. 8. Does the proposal address other aspects of character data processing (if applicable) such as input, presentation, sorting, searching, indexing, transliteration etc. (if yes please enclose information)? No. 9. Submitters are invited to provide any additional information about Properties of the proposed Character(s) or Script that will assist in correct understanding of and correct linguistic processing of the proposed character(s) or script. Yes. See above.
C. Technical – Justification 1. Has this proposal for addition of character(s) been submitted before? If YES, explain. Yes, an incomplete proposal was submitted in “Proposed Changes to Gurmukhi” (L2/05-088). 2a. Has contact been made to members of the user community (for example: National Body, user groups of the script or characters, other experts, etc.)?
Page 12 of 13
Yes. 2b. If YES, with whom?
Jeevan Deol Anoop Singh Manudeep Singh Serjinder Singh Kulbir Thind And others 2c. If YES, available relevant documents 3. Information on the user community for the proposed characters (for example: size, demographics, information technology use, or publishing use) is included? No. 4a. The context of use for the proposed characters (type of use; common or rare) Common (Archaic) 4b. Reference 5a. Are the proposed characters in current use by the user community? No. 5b. If YES, where? 6a. After giving due considerations to the principles in Principles and Procedures document (a WG 2 standing document) must the proposed characters be entirely in the BMP? Yes.. 6b. If YES, is a rationale provided? Yes. 6c. If YES, reference Additional Gurmukhi characters. 7. Should the proposed characters be kept together in a contiguous range (rather than being scattered)? No. 8a. Can any of the proposed characters be considered a presentation form of an existing character or character sequence? No. 8b. If YES, is a rationale for its inclusion provided? 8c. If YES, reference 9a. Can any of the proposed characters be encoded using a composed character sequence of either existing characters or other proposed characters? Yes. 9b. If YES, is a rationale for its inclusion provided? Yes. 9c. If YES, reference Yes, see A1. Compatibility with existing conventions. 10a. Can any of the proposed character(s) be considered to be similar (in appearance or function) to an existing character? Yes. 10b. If YES, is a rationale for its inclusion provided? Yes 10c. If YES, reference See C1, C2. 11a. Does the proposal include use of combining characters and/or use of composite sequences (see clauses 4.12 and 4.14 in ISO/IEC10646-1: 2000)? Yes. 11b. If YES, is a rationale for such use provided? Yes. 11c. If YES, reference See above. 12a. Is a list of composite sequences and their corresponding glyph images (graphic symbols) provided? No. 12b. If YES, reference 13a. Does the proposal contain characters with any special properties such as control function or similar semantics? No. 13b. If YES, describe in detail (include attachment if necessary) 14a. Does the proposal contain any Ideographic compatibility character(s)? No. 14b. If YES, is the equivalent corresponding unified ideographic character(s) identified?