Add some research on diacritic deletion
authorAryeh Gregor <AryehGregor+gitcommit@gmail.com>
Mon, 08 Aug 2011 12:28:58 -0600
changeset 508 ae81251fdcbf
parent 507 1d982f193ec3
child 509 2cf21b27eea3
Add some research on diacritic deletion

Someone pointed out to me IRL that at least some version of Word doesn't
behave like the spec for forwardDelete of characters with combining
marks after them. I wasn't able to reproduce in Word 2007, which acts
like the spec says, but IE9 turns out to have quite interesting
behavior. Will follow up.

Reported-By: Tali Fuss
editing.html
implementation.js
source.html
tests.js
--- a/editing.html	Mon Aug 08 11:39:48 2011 -0600
+++ b/editing.html	Mon Aug 08 12:28:58 2011 -0600
@@ -63,7 +63,7 @@
 <body class=draft>
 <div class=head id=head>
 <h1>HTML Editing APIs</h1>
-<h2 class="no-num no-toc" id=work-in-progress-&mdash;-last-update-5-august-2011>Work in Progress &mdash; Last Update 5 August 2011</h2>
+<h2 class="no-num no-toc" id=work-in-progress-&mdash;-last-update-8-august-2011>Work in Progress &mdash; Last Update 8 August 2011</h2>
 <dl>
  <dt>Editor
  <dd>Aryeh Gregor &lt;<a href=mailto:[email protected]>[email protected]</a>&gt;
@@ -367,6 +367,9 @@
 
   <li>Allow some type of switch to affect non-editable regions too, perhaps on a
   per-command basis.
+
+  <li>Things like delete, forwardDelete, insertText need to handle non-BMP
+  characters.
 </ul>
 
 <p>Also TODO: Things that are only implemented by a couple of browsers and may
@@ -6405,6 +6408,11 @@
     we'll usually merge the <var title="">offset</var>th child of <var title="">node</var> with
     the last descendant of the <var title="">offset</var> &minus; 1st.
   </ol>
+
+  <p>Unlike forwardDelete, there's no special case for diacritics.  This means
+  backspacing will just delete the last combining diacritic typed, or the whole
+  character if it's precomposed.  This matches everything I tested (IE9,
+  Firefox 7.0a2, Chrome 14 dev, etc.).
   </div>
 
   <p>If <var title="">node</var> is a <code class=external data-anolis-spec=domcore><a href=http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#text>Text</a></code> node and <var title="">offset</var> is not zero,
@@ -7061,11 +7069,27 @@
     <li>Let <var title="">end offset</var> be <var title="">offset</var> plus one.
 
     <li>
-    <p class=comments>TODO: This is probably not right.  We probably want to
-    normalize to grapheme cluster boundaries, using UAX#29 or something.  We
-    also need to handle non-BMP stuff.  The idea is that if the cursor is
-    before a character that precedes a combining mark, you need to delete the
-    combining mark too.
+    <div class=comments>
+    <p>Firefox 7.0a2, Chrome 14 dev, Word 2007, and OpenOffice.org 3.2.1 Ubuntu
+    act as the spec says, getting rid of all diacritics on forward delete.  IE9
+    and Opera 11.50 have no special case and just delete the next character.  I
+    go with Firefox/Chrome/Word/OO.
+
+    <p>However, when I actually type in the text box as opposed to running
+    semi-automated tests, IE9 has magical behavior: it replaces the base
+    character with something that looks like &#9676; U+25CC DOTTED CIRCLE.
+    Further strikes of the delete key remove the diacritics, and the circle
+    vanishes along with the last of them.  I wasn't able to get it to actually
+    replace the base character, so I'm not sure what the point is.  The circle
+    doesn't seem to appear in the DOM, and apparently it disappears in some
+    circumstances.  This might be worth standardizing somehow, I don't know.
+
+    <p>TODO: The way we remove diacritics is probably not right.  We probably
+    want to normalize to grapheme cluster boundaries, using UAX#29 or
+    something.  We also need to handle non-BMP stuff.  The idea is that if the
+    cursor is before a character that precedes a combining mark, you need to
+    delete the combining mark too.
+    </div>
 
     <p>While <var title="">end offset</var> is not <var title="">node</var>'s <a class=external data-anolis-spec=domrange href=http://html5.org/specs/dom-range.html#concept-node-length title=concept-node-length>length</a> and the
     <var title="">end offset</var>th <a href=http://es5.github.com/#x8.4>element</a> of <var title="">node</var>'s <code class=external data-anolis-spec=domcore title=dom-CharacterData-data><a href=http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#dom-characterdata-data>data</a></code> has
@@ -8699,6 +8723,7 @@
   Ehsan Akhgari,
   Tim Down,
   Markus Ernst,
+  Tali Fuss,
   Daniel Glazman,
   Cameron Heavon-Jones,
   Ryosuke Niwa,
@@ -8715,8 +8740,7 @@
   Brett Zamir,
   and
   Boris Zbarsky
-  for their feedback on drafts of this document and participation in related
-  mailing list and Bugzilla discussions
+  for their feedback, participation, or other helpful contributions
   <li>Tab Atkins, Ian Hickson, Glenn Maynard, Ms2ger, Simon Pieters, and most
   of the rest of the <a href=irc://irc.freenode.net/whatwg>#whatwg</a> crowd
   for giving quick online feedback when I have questions or need to solicit
--- a/implementation.js	Mon Aug 08 11:39:48 2011 -0600
+++ b/implementation.js	Mon Aug 08 12:28:58 2011 -0600
@@ -6188,7 +6188,7 @@
 //@}
 ///// The forwardDelete command /////
 //@{
-commands["forwarddelete"] = {
+commands.forwarddelete = {
 	action: function() {
 		// "If the active range is not collapsed, delete the contents of the
 		// active range and abort these steps."
@@ -6267,9 +6267,11 @@
 			// as a Unicode code point, add one to end offset."
 			//
 			// TODO: Not even going to try handling anything beyond the most
-			// basic combining marks.
+			// basic combining marks, since I couldn't find a good list.  I
+			// special-case a few Hebrew diacritics too to test basic coverage
+			// of non-Latin stuff.
 			while (endOffset != node.length
-			&& /^[\u0300-\u036f]$/.test(node.data[endOffset])) {
+			&& /^[\u0300-\u036f\u0591-\u05bd\u05c1\u05c2]$/.test(node.data[endOffset])) {
 				endOffset++;
 			}
 
--- a/source.html	Mon Aug 08 11:39:48 2011 -0600
+++ b/source.html	Mon Aug 08 12:28:58 2011 -0600
@@ -296,6 +296,9 @@
 
   <li>Allow some type of switch to affect non-editable regions too, perhaps on a
   per-command basis.
+
+  <li>Things like delete, forwardDelete, insertText need to handle non-BMP
+  characters.
 </ul>
 
 <p>Also TODO: Things that are only implemented by a couple of browsers and may
@@ -6453,6 +6456,11 @@
     we'll usually merge the <var>offset</var>th child of <var>node</var> with
     the last descendant of the <var>offset</var> &minus; 1st.
   </ol>
+
+  <p>Unlike forwardDelete, there's no special case for diacritics.  This means
+  backspacing will just delete the last combining diacritic typed, or the whole
+  character if it's precomposed.  This matches everything I tested (IE9,
+  Firefox 7.0a2, Chrome 14 dev, etc.).
   </div>
 
   <p>If <var>node</var> is a [[text]] node and <var>offset</var> is not zero,
@@ -7115,11 +7123,27 @@
     <li>Let <var>end offset</var> be <var>offset</var> plus one.
 
     <li>
-    <p class=comments>TODO: This is probably not right.  We probably want to
-    normalize to grapheme cluster boundaries, using UAX#29 or something.  We
-    also need to handle non-BMP stuff.  The idea is that if the cursor is
-    before a character that precedes a combining mark, you need to delete the
-    combining mark too.
+    <div class=comments>
+    <p>Firefox 7.0a2, Chrome 14 dev, Word 2007, and OpenOffice.org 3.2.1 Ubuntu
+    act as the spec says, getting rid of all diacritics on forward delete.  IE9
+    and Opera 11.50 have no special case and just delete the next character.  I
+    go with Firefox/Chrome/Word/OO.
+
+    <p>However, when I actually type in the text box as opposed to running
+    semi-automated tests, IE9 has magical behavior: it replaces the base
+    character with something that looks like &#x25cc; U+25CC DOTTED CIRCLE.
+    Further strikes of the delete key remove the diacritics, and the circle
+    vanishes along with the last of them.  I wasn't able to get it to actually
+    replace the base character, so I'm not sure what the point is.  The circle
+    doesn't seem to appear in the DOM, and apparently it disappears in some
+    circumstances.  This might be worth standardizing somehow, I don't know.
+
+    <p>TODO: The way we remove diacritics is probably not right.  We probably
+    want to normalize to grapheme cluster boundaries, using UAX#29 or
+    something.  We also need to handle non-BMP stuff.  The idea is that if the
+    cursor is before a character that precedes a combining mark, you need to
+    delete the combining mark too.
+    </div>
 
     <p>While <var>end offset</var> is not <var>node</var>'s [[length]] and the
     <var>end offset</var>th [[strel]] of <var>node</var>'s [[cddata]] has
@@ -8774,6 +8798,7 @@
   Ehsan Akhgari,
   Tim Down,
   Markus Ernst,
+  Tali Fuss,
   Daniel Glazman,
   Cameron Heavon-Jones,
   Ryosuke Niwa,
@@ -8790,8 +8815,7 @@
   Brett Zamir,
   and
   Boris Zbarsky
-  for their feedback on drafts of this document and participation in related
-  mailing list and Bugzilla discussions
+  for their feedback, participation, or other helpful contributions
   <li>Tab Atkins, Ian Hickson, Glenn Maynard, Ms2ger, Simon Pieters, and most
   of the rest of the <a href=irc://irc.freenode.net/whatwg>#whatwg</a> crowd
   for giving quick online feedback when I have questions or need to solicit
--- a/tests.js	Mon Aug 08 11:39:48 2011 -0600
+++ b/tests.js	Mon Aug 08 12:28:58 2011 -0600
@@ -286,8 +286,16 @@
 		'<span>foo[</span><span>]bar</span>',
 		'foo<span style=display:none>bar</span>[]baz',
 		'foo<script>bar</script>[]baz',
+
 		'fo&ouml;[]bar',
 		'foo&#x308;[]bar',
+		'foo&#x308;&#x327;[]bar',
+		'&ouml;[]bar',
+		'o&#x308;[]bar',
+		'o&#x308;&#x327;[]bar',
+
+		'&#x5e9;&#x5c1;&#x5b8;[]&#x5dc;&#x5d5;&#x5b9;&#x5dd;',
+		'&#x5e9;&#x5c1;&#x5b8;&#x5dc;&#x5d5;&#x5b9;[]&#x5dd;',
 
 		'<p>foo</p><p>[]bar</p>',
 		'<p>foo</p>[]bar',
@@ -1116,6 +1124,12 @@
 		'fo[]&ouml;bar',
 		'fo[]o&#x308;bar',
 		'fo[]o&#x308;&#x327;bar',
+		'[]&ouml;bar',
+		'[]o&#x308;bar',
+		'[]o&#x308;&#x327;bar',
+
+		'[]&#x5e9;&#x5c1;&#x5b8;&#x5dc;&#x5d5;&#x5b9;&#x5dd;',
+		'&#x5e9;&#x5c1;&#x5b8;&#x5dc;[]&#x5d5;&#x5b9;&#x5dd;',
 
 		'<p>foo[]</p><p>bar</p>',
 		'<p>foo[]</p>bar',