The Carbon version of Gecko doesn’t interoperate with anything
but other Carbon Gecko processes. I figured I should try to do better
with the Cocoa nsClipboard
.
This stuff is so underdocumented that it isn’t even funny. This document is written so that others might find something when they search the Web.
The clipboard in Gecko is implemented around two interfaces:
nsIClipboard
and nsITransferable
.
There’s a single service instance of nsIClipboard
. It
provides methods for putting an nsITransferable
on the
clipboard, querying the clipboard for flavor availability and filling
an nsITransferable
from the contents of the clipboard.
An nsITransferable
instance is a wrapper for the data
being transfered through the clipboard. An nsITransferable
instance can wrap multiple alternative representations of the
data—for example HTML and plain text. On copy, the nsITransferable
instance advertises an array of flavors it can provide. The array is
ordered by the most high-fidelity representations coming first. On
paste, the nsITransferable
instance advertises an array
of flavors it can accept.
A flavor has a name (a C string—in practice a pseudo-MIME type
in ASCII), data (nsISupports
) and length (I have no idea
why). In theory, the data can be any XPCOM object instance. In
practice, it is always an nsISupportString
, an
nsISupportsCString
or an nsIImage
. (More on
this later.)
At least in theory, an nsITransferable
can promise
data. That is, the data is not necessarily traveling in the
transferable until it is requested.
For historical reasons, the clipboard in Cocoa is called “pasteboard” in the API. On the UI layer, it is called “clipboard” for consistency with the Mac tradition.
On OS X, the inter-process pasteboards are implemented as a server
process called pbs. (Yes, there’s support for multiple
pasteboards.) Applications don’t talk to pbs directly. Instead,
they use a Cocoa or Carbon API that takes care of the inter-process
communication with pbs. In Cocoa, the communication happens through
the NSPasteboard
class.
The ask the NSPasteboard
class to provide an
NSPasteboard
object for a particular pasteboard. The
object can be queried for available flavors and the flavor can be
retrieved. Also the app can use the object to promise flavors. The
data can then be written immediately or provided on a later callback.
Gecko used to have an internal plain text flavor called
text/plain
, which meant an nsISupportsCString
in the platform encoding. The concept of “platform encoding” is
seriously defective. Luckily we no longer need to pretend that Mac
users can only deal with MacRoman.
I did not implement any support for the obsolete text/plain
flavor.
The contemporary flavor for moving plain text around inside Gecko
is called text/unicode
. (Notice the completely bogus
MIME type.) It is host-order UTF-16 without a BOM using LF line
breaks (must use LF) as an nsISupportsString
.
The Cocoa way of passing around Unicode plain text is
NSStringPboardType
. It is usually written and read using
convenience methods that use NSString
s. The actual
clipboard data format is UTF-8 without a BOM and without a \0
terminator.
Cocoa apps typically write LF line breaks to the pasteboard. Line
breaks are preserved between Cocoa apps. Cocoa apps automatically
also see “NeXT plain ascii pasteboard type
” as the
last available flavor on the pasteboard when NSStringPboardType
has been provided. No sane app should try to tamper with the lossy
legacy NeXT flavor. When a Cocoa app has provided NSStringPboardType
,
Carbon and Classic apps see utxt
and TEXT
flavors on the clipboard. LF linebreaks are automatically converted
to CR linebreaks.
TEXT
is MacRoman (with lossy conversion, obviously)
without \0
-termination. Sane apps should avoid it if
they can help it.
Traditionally on PPC,
has been
big-endian UTF-16 (or more likely originally UCS2) without a BOM and
without U+0000-termination. At 10.2, Mac OS X started putting a BOM
in utxt
when converting from
utxt
. Apple backed out the
change due to compatibility problems, but said
that it will reappear later. Frankly, I think it wasn’t the
right thing to do. Retroactively changing the format of a clipboard
flavor is not the “right thing” to do. The Right Thing to do
would be either keeping it big-endian ad infinitum even on
little-endian hosts or redefining the flavor as being in the host
order and making Rosetta byte-swap NSStringPboardType
to
and from PPC apps. I didn’t find proper documentation on this and I
don’t have an Intel Mac to test with, but according to the
Universal Binary Programming Guidelines, 2nd ed. utxt
is now BOMless for good and there’s a new utxt
flavor that has a BOM. The document doesn’t say whether ut16
utxt
is big-endian or in the host order, though. Also, the benefits of the
flavor proliferation are not obvious to me. And, of course, Apple
didn’t revise the technote stating that BOM in utxt
may reappear.
Fortunately, Cocoa developers don’t need to worry about utxt
byte order or BOM. When a Carbon app (or a Classic app) has put utxt
(or TEXT
) on the clipboard, Cocoa apps sees
NSStringPboardType
in addition to the original flavors
written by the Carbon app. The Cocoa app just reads
NSStringPboardType
and leaves the byte order issue to
Apple. There’s one catch though: unlike NSStringPboardType
to utxt
conversion, utxt
to
NSStringPboardType
conversion does not change line
breaks. Therefore, NSStringPboardType
has LF line breaks
if the data was exported by a typical Cocoa app but has CR line
breaks if the data was exported (as utxt
) by a typical
Carbon app. Since Gecko requires LF line breaks, the clipboard
implementation has to make sure that each CR is replaced with a LF.
(Note that sane Mac apps don’t export CRLF line breaks to the
pasteboard, so there’s no need to check for those.)
HTML is so common nowadays that one might expect there to be a system-wide pasteboard flavor for HTML. Would be reasonable, right? After all, Windows is known to enable copying and pasting HTML between apps.
The documentation for NSPasteboard lists a type called
NSHTMLPboardType
. Whoopee!
The type is documented as follows: “HTML (which an NSTextView
object can read from, but not write to)”. That it! Really. Google
for it and you find someone asking for the exact format but no one
replies. I tried to find apps on my system that export
NSHTMLPboardType
but found none.
Since docs were lacking, I dumped the clipboard exports on my own system and visited other Mac users who have apps that could potentially export HTML.
When HTML is copied in Gecko, four interrelated flavors are put in
the nsITransferable
instance: text/html
,
text/_moz_htmlcontext
, text/_moz_htmlinfo
and text/x-moz-url-priv
. These all QI to
nsISupportsString
(host-order UTF-16 string without a
BOM).
text/html
contains a rootless serialization of the
selection DOM range that was copied. text/_moz_htmlcontext
contains a serialized doctypeless tag soup document that parses into
a branchless document tree that only contains the element nodes from
the root to the parent of the selection (including the parent) with
attributes present. (At least I think that’s what it contains. I
haven’t investigated what happens if the range starts and ends in a
different parent.) text/_moz_htmlinfo
contains a string
representation of two numbers (base ten?) separated by a comma. I
have no idea what they mean. text/x-moz-url-priv
contains the URI of the document from which the selection was copied
(about:blank
if a real URI is unavailable).
Note how the concept of alternative representations is abused
here. The different flavors augment each other instead of being
alternatives. Also note how the MIME type text/html
is
used for labeling a fragment instead of a full document and how the
other types are private and two of the private types don’t follow
the naming convention for private types (those two are also
undocumented).
Upon paste, if text/html
is unavailable, Gecko tries
to read application/x-moz-nativehtml
, which means the
Microsoft Windows CF_HTML clipboard data as an nsISupportsCString
.
(Note that Gecko internals don’t themselves export this
flavor.)
Carbon Gecko writes the data from the nsISupportsString flavors as BOMless UTF-16 to the clipboard. (I haven’t investigated the byte order on Intel, but I’d expect host order and I’ve heard about problem with Gecko apps in Rosetta and native Gecko apps not interoperating, which makes perfect sense.)
Each Gecko-internal flavor is mapped to a Carbon scrap type, whose
16 most significant bits are “MZ
” when interpreted
as MacRoman. The lower 16 bits contain an integer. (Carbon scrap
types are 32-bit integers that are usually interpreted as four
MacRoman characters with the most-significant byte as the leftmost
character.) The Carbon Gecko clipboard implementation assigns an
integer from a counter to each Gecko-internal nsISupportsString
-based
flavor as they are encountered by the clipboard implementation. The
mappings are remembered for the lifetime of the Gecko process.
As the least favored scrap type, Carbon Gecko exports a MOZm
scrap whose data contains mapping from the generated MZ
types to Gecko-internal flavors, so that other Carbon Gecko processes
can import the autogenerated scrap types.
Curiously, Opera 9.0 doesn’t export any HTML flavor at all. It
just exports a plain text repsesentation as utxt
(and
TEXT
). Apparently, Opera isn’t maintaining an
app-internal HTML clipboard, either, because Opera-to-Opera copying
and pasting (into a contenteditable
part of an HTML
document) doesn’t preserve HTML formatting.
As of Mac OS X 10.3.9, WebKit exports copied HTML to the
pasteboard in its Web Archive format. According to Apple’s
documentation, the constant WebArchivePboardType
is
available only starting with Mac OS X 10.4. The header from Mac OS X
10.3.9 seems to have the constant, though. Hmm… Anyway, to be safe,
the constant resolves to the NSString
“Apple Web
Archive pasteboard type
”.
I don’t know if the format of the Web Archive is documented. I didn’t find any documentation. However, there’s a documented API for looking inside the mystery bag of bytes. As far as I can tell, there’s no documentation on what one should expect to find inside the archive in the pasteboard case.
If the copied data originated in a text/html
document, the main resource of the Web Archive claims to have the
MIME types text/html
. However, as with Gecko, the MIME
type label doesn’t mean that you should expect to find a full HTML
document. Instead, the main resource is a rootless serialization of
the selection range like the Gecko text/html
flavor with
these exceptions:
If the document of origin had a doctype, the serialization starts with a reconstruction of that doctype.
Elements whose computed style differs from what it would be
with the UA style only get a style
attribute with the
differing competed style serialized.
Each text node is wrapped in a span
element
whose class is Apple-style-span
. The span
has a style
attribute that repeats computed style that
differs from UA defaults at that point in the document.
The resulting document fragment is exceedingly crufty. Clearly, whoever designed the requirements for this feature did not think in terms of semantic markup but instead continued to believe in the MacWrite legacy notion of rich text where rich text is a string of characters with font properties attached to each character and line breaks acting as paragraph separators. It seems that whoever implemented this tried to make the semantic markup recoverable (if the recipient cares to do some scrubbing) while also satisfying satisfying very structure-hostile presentational requirements.
The weirdest and most extreme symptom of the MacWrite
legacy-influenced structure-hostile thinking is what happens when
pasting structural markup. Suppose you have an h2
element with text content on the clipboard. If the insertion point is
on what looks like a blank line on its own when you paste, you get an
h2
element in the DOM. However, if there is text on the
line when you paste, you don’t get an h2
element is
the DOM! Instead, you get a span
with a style
attribute that reproduces the font properties of an h2
!
The joke is that Gecko and WebKit have a reputation of being more
standards-oriented than Trident, but when it comes to editing block
elements, Trident gets it right and both Gecko (in the form of br
elements) and WebKit exhibit block-hostile presentationalism that
anyone trying to build a CMS for structural markup will hate.
But back to the format.
The Web Archive is byte-oriented, so the HTML fragment needs to be in some character encoding. WebKit seems to write it always in UTF-8 regardless of the encoding of the source document, although I don’t see it promised anywhere in documentation. I’m going to expect that it is always in UTF-8. If some program other than WebKit chooses to export something other than UTF-8 to the clipboard inside a Web Archive, instead of dealing with it in my code, I think the developer of the other program needs to be attitude-readjusted with a cluestick. (See GoLive below.)
There’s another kind of source document leak, though. If the
source document had the MIME type application/xhtml+xml
,
the fragment exported to the clipboard will be doctypeless (good) and
have that MIME type even though the fragment is rootless and produced
with a namespace-unaware serializer.
The main improvement of the Web Achive format over what has
existed before is that if the copied selection encompasses img
elements, the images are also transferred inside the Web Archive.
Unfortunately, this feature does not map nicely to Gecko internals.
On Windows MS Office is known to export CF_HTML to the clipboard,
and this is even documented
(well, on the 0.9 level at least even if the export is 1.0). It turns
out that MS Office on Mac exports CF_HTML as well. Interestingly,
whatever Carbon type they use shows up as NSHTMLPboardType
on the Cocoa side (no conversion of data—just flavor name mapping)!
When Googling for NSHTMLPboardType
, I had discovered
a version control log entry for WebKit stating that they don’t use
NSHTMLPboardType
due to problems with Word. Perhaps they
were exporting a straight fragment without the CF_HTML wrapping or
something.
Word puts what appear to be descriptors pointing to internal stuff
and RTF on the export list before CF_HTML. Since I wouldn’t be
exporting RTF from Gecko, I checked that NSTextView
really works with CF_HTML. I captured the clipboard from Word and
re-exported only CF_HTML and plain text. NSTextView
accepted CF_HTML just fine (so the docs didn’t lie). However,
normally when pasting from Word NSTextView
takes the RTF
version.
Note that CF_HTML is exported even if the document being edited is not an HTML document.
MS Office exports CF_HTML 1.0 but imports 0.9 just fine.
If I recall my experiences correctly, OpenOffice.org on Windows supports CF_HTML. Not so on Mac, alas. NeoOffice 2.0 Alpha 4 patch 3 can copy and paste HTML internally. However, it exports only RTF and plain text to the system clipboard. Interestingly, it doesn’t even export a private marker on the system clipboard, so it has to have another mechanism for tracking whether another app has put something on the clipboard since the time NeoOffice last put some HTML on its internal clipboard.
Anyway, under these circumstances, I can’t make Cocoa Gecko interoperate on the HTML level with NeoOffice.
It turns out that Dreamweaver’s idea of an HTML clipboard flavor
is closest to Gecko’s internal text/html
flavor.
Dreamweaver exports and imports a scrap flavor called DwUH
.
It contains the HTML source (as seen in the source view)
corresponding to the selection in the layout view as
U+0000-terminated BOMless UTF-16. On PPC the byte order is
big-endian, but there’s no way of knowing whether Macrodobia
deliberately or accidentally changes the meaning of the scrap type on
Intel (to host-order making native and Rosetta apps not
interoperate).
The U+0000-terminator is important. The system pasteboard is designed for transferring arbitrary binary data. Hence, the data buffer has an explicit length. Dreamweaver happily ignores the length and reads until it sees a U+0000—even if that means reading past the end of the buffer.
Like Dreamweaver, GoLive also exports the piece of HTML source as seen in the source view that corresponds to the selection in the layout view. There’s a crucial difference, though: Whereas all the apps discussed above use either UTF-8 or UTF-16 on the clipboard depending on the app, GoLive uses the character encoding of the source document on the clipboard! That’s some bad craziness!
GoLive exports the HTML fragment twice: first as GLTE
and then as GLML
. I was unable to come up with any test
conditions that would cause these two flavors to get different
contents. The data does not have a null character at the end. The
interpretation of these flavors depends on a third flavor: MENV
.
It contains an XML document like this:
<MarkupEnv version="1"> <base url="URL"/> <urlsettings> <urlhandling version="1" casesensitive="yes" linksabsolute="no" autoaddmailto="yes" honorcgiparameters="yes" cgiparameters="?" hhescapingState="complete" encoding="UTF8"/> </urlsettings> <markuptype markuptype="2020111469"/> <encoding charset="utf-8"/> <structure kind="area" boxClass="htmT"/> <actions/> <selectedVar name=""/> </MarkupEnv>
Note how UTF-8 is labeled differently when describing markup and when describing URLs.
GoLive also writes a flavor called GLBx
, but I
haven’t been able to guess the purpose of this flavor. It seems to
always contain the same six bytes.
In the Cocoa clipboard code that I wrote for Gecko, as the last
resort, if the flavor does not get special treatment and the data QIs
to nsISupportsString
, I generate a Cocoa flavor name by
prepending “Mozilla nsISupportsString
” to the Gecko
flavor string and use the UTF-8 representation of the
nsISupportsString
as the pasteboard data. (Neither BOM
nor
-termination.)\0
Since Cocoa works with flavor strings like Gecko, there’s no need to map strings to 32 flavor identifiers. However, as a result, you cannot read these flavors through the legacy Carbon API.
text/html
is exported the same way—that is, as
“Mozilla nsISupportsString text/html
”. However, also
representations in the Web Archive, CF_HTML, Dreamweaver and GoLive
formats are written to the pasteboard (in that order). I didn’t
implement support for transferring HTML to and from Carbon Gecko. I
figured that one Firefox and Thunderbird switch over to Cocoa, users
will upgrade all their Gecko-based apps in active use to versions
that use the Cocoa widget implementation, so interop with the old
Carbon builds would be wasted effort. (Of course, you still get plain
text copying and pasting between Carbon and Cocoa Geckos.)
When exporting to Web Archive, CF_HTML and to the GoLive flavors,
the Gecko text/x-moz-url-priv
flavor is appropriately
mapped to the source URL of the fragment in these formats.
The DwUH
flavor is considered to be big-endian
regardless of host, since now the way to run Dreamweaver on an Intel
Mac is on Rosetta. I hope Macrodobia keeps the flavor big-endian on
Intel Macs when they ship a Universal Binary or use a different
flavor identifier for a little-endian version. Of course, chances are
that they don’t care make it impossible to interoperate with both a
native version and a Rosetta-hosted version without guesswork. (I
have posted about this macromedia.dreamweaver.appdev, but it appears
that members of the Dreamweaver team don’t respond there.)
On pasting, when the
instance advertises that it accepts nsITransferable
,
the clipboard implementation looks for “text/html
”, “Mozilla
nsISupportsString text/html
”, Apple
Web Archive pasteboard type
and DwUH
GLML
(in that order).
In the case of Web Archive and GLML
, only UTF-8 is
accepted. With Web Archive, the code checks that the data claims to
be UTF-8. With GLML
the code checks whether the data
looks like UTF-8. New documents in GoLive default to UTF-8, and
supporting non-UTF-8 craziness is just not worth it. With Web
Archive, the URL of the main resource is mapped to Gecko’s
text/x-moz-url-priv
flavor. Doing the source URL mapping
when pasting from GoLive didn’t seem worth the trouble. Besides,
with Dreamweaver and GoLive the main use case is copying from Gecko.
Pasting to Gecko is there only for completeness.
I did not implement any scrubbing of the style
attribute cruft that WebKit exports. However, such scrubbing is
needed. Suppose a user uses Safari as the browser and Thunderbird as
the mail app and copies something from a Web page to an HTML email
message. It is really hard if not impossible (I couldn’t figure it
out quickly) to get rid of WebKit’s character formatting cruft
using the Thunderbird UI.
Frankly, I think Apple’s approach to copying “rich text” is just wrong. In general, regardless of HTML, if you have one text in 14 pt Times and another in 12 pt Palatino, why would you want the font and size to be preserved when copying? The first thing you have to do after pasting is making the font match the font of the target document. I often paste into TextWrangler and recopy in order to get rid of RTF on the clipboard. Perhaps I should write a small program to automate this. But I digress.
When nsITransferable
instance advertises that it accepts application/x-moz-nativehtml
,
the clipboard implementation looks for CF_HTML, i.e.
NSHTMLPboardType
.
When the user invokes the context menu on an image, Gecko-based apps typically offer a menu command for copying the image to clipboard. As far as I can tell there are no cases where an existing Gecko-based app tried to read a bitmap from the clipboard.
The IDL for nsITransferable
has defines for four
image flavors: image/png
, image/gif
,
image/jpeg
and application/x-moz-nativeimage
.
As far as I can tell, the three image/*
are not actually
used. However, code for other platforms trigger image copying for all
the four types, so I did, too. When the transferable is an image, the
data object QIs to nsIImage
.
When the QuickDraw gfx is used, the image QIs to
,
which can write itself into a nsIImageMac
.
Even the Cocoa pasteboard supports a type called PicHandle
NSPICTPboardType
,
so writing the image to the pasteboard is easy.
When the Thebes gfx is used, there’s no easy way to get a PICT,
so it makes sense to endeavor to get a TIFF that can be put onto the
pasteboard as
. The part of
the Cocoa API that makes it possible to take a raw bitmap buffer and
turn it into a TIFF is NSTIFFPboardType
NSBitmapImageRep
.
NSBitmapImageRep
looks versatile on the surface in
terms of the raw buffer formats that it accepts. But it isn’t that
versatile really. It assumes a row order (bottom to top?) that is the
opposite of what Thebes uses (top to bottom?), so I had to reverse
the row order manually. Also NSBitmapImageRep
understands ARBG (which is what Thebes uses) only since Tiger and
Gecko is still supposed to support Panther, so I had to rotate the
pixel words to the RBGA format myself.
Thebes uses ARGB as the sample order within the 32-bit pixel word
as seen through C bitwise operations. This means that on
little-endian systems Thebes actually uses BGRA if you look at the
memory buffer iterating by bytes. The documentation for
NSBitmapImageRep
does not say whether its notion of RGBA
means ABGR on Intel.
Of course, when dealing with Unicode and dealing with it properly, one has to test with astral characters. It turns out that Word, Dreamweaver and GoLive don’t support them, so there’s nothing to deal with.