Outlining the “Ultimate” Blogging Server

Since I read Matthew “mpt” Thomas’ outline of “The ultimate Weblogging system”, I’ve been thinking what a really good blogging system or a news site content management system would be like. Here’s my attempt at outlining the “ultimate” blogging server.

I have no experience in keeping a personal blog. However, I read some blogs and I have observed others trying to make tag soup in—tag soup out blogging systems behave. Also, I have participated in improving the content management system of Macsanomat—a news site which is technically rather blog-like. I also write news items for Macsanomat.

Input
- Check input for syntactic sanity. Do not just copy the bytes into a database.
  - Most blogging systems and content management systems in general seem to be tag soup in—tag soup out systems.
- Ability to ingest nearly-valid tag soup
  - Needed for common blogging APIs
  - Needed for tag souper form interface
  - Reject tag soup that cannot be reasonably converted to the internal storage format.
- Unicode-savvy
  - The XML-RPC spec (on which common blogging APIs are based) says that the string data type is an ASCII string type, which seem non-sensical considering that the XML spec requires XML processors (on which XML-RPC implementations are supposed to be based) to handle Unicode strings. On the other hand, in the update section of the spec it is said that strings can contain any characters. I suppose these can be dismissed as spec bugs and the string type can in practice be used to transfer any XML characters.
    
    Update: Dave Winer has removed “ASCII” from the spec.
- Interfaces
  - HTML form-based interface
  - At least one popular XML-RPC-based blogging API
- Help the author with boring tasks
  - Extract metadata from link targets for use in attributes of <a> (unless the author has explicitly provided the attributes).
    - Provide the title attibute from target page <title>
    - Provide the hreflang attribute from target page language
      - French bloggers in particular like to to warn readers about English-language link targets by using different styling based on the hreflang attribute.
- Make it possible to upload images.
  - Crush PNG images using pngcrush automatically.
Output
- mpt-approved URLs
  - If the author writes is using a mainly Latin-based alphabet, generate URL-safe ASCII representations of the entry titles automatically.
    - URLs with non-ASCII are ambiguous, because historically ISO-8859-1 has been assumed but now there is a tendency to assume UTF-8.
    - URL presented with the % notation are less readable than URLs that have been downgraded to ASCII.
    - Europeans tend to prefer URLs that work over URLs that contain non-ASCII
    - Allow language-sensitive mappings
      - Finnish context: ö becomes o
      - German context: ö becomes oe
  - As a bonus, consider transliterating titles written using other alphabets or syllabics to URL-safe ASCII when readers are likely to understand such transliteration.
    - Candidates for transliteration include the Cyrillic and Greek alphabets.
  - Named undated pages
    - Make it possible to use the same content management infrastucture for maintaining “About” and “Feedback” (etc.) pages that don’t have date pages URLs.
- Enforce output correctness
  - For text/html, use HTML 4.01 Strict
  - Combine content and templates by manipulating DOM trees (or perhaps by using SAX).
    - Do not fall into the tag soup trap of string and regexp-based XML manipulation
  - Language-label output
    - Implies the need to get language of input
- RSS feeds
  - Title-only RSS 0.9 feed
  - Tag soup over RSS 0.92 feed with full content.
- Possibility not to include long articles in full on the front page.
  - Separate listing of long articles.
- <link> navigation.
  - Should “next” mean “Newer item” (as in Karl Ove Hufthammer’s blog for example) or “Down in the front page item order” (as in Macsanomat—so that you can open the newest item in Opera 7 and keep pressing the space bar to catch up with older items)?
- Clueful caching
  - Many content management systems deliberately break HTTP caching and regenerate pages from a database on each request.
    - Breaking cacheability by default and regenerating pages on every request makes rapid develoment of dynamic pages easy. However, if cacheability isn’t built into the architecture of the CMS, it is hard to add later.
  - Instead of just converting slashdotted entries to static pages (as mpt suggests), cache the most recently requested pages as byte buffers in any case.
    - Optionally increase HTTP object expriry time upon slashdotting to allow proxies to serve out cached copies of pages without even verifying freshness with the originating server. (Always allow proxies to do this with images and style sheets.)
      - The vanity consideration is, of course, that if proxies don’t need to revalidate the freshness of pages, all page accesses aren’t logged on the origin server.
  - Invalidate cache entries as needed when the master data changes as opposed to checking whether the master data has changed on every hit.
  - Support HTTP 1.1 conditional GETs.
  - Avoid varying responses based on request headers.
    - Do not vary on UserAgent. There are so many UA strings these days that HTTP caching in proxies is rather pointless, if the respose varies depending on the UA string.
  - Virtualize style sheet URLs.
    - Allow nearly-permanent caching of style sheet HTTP objects by setting the expiry eg. to one year.
    - When a style sheet changes, change the style sheet URL and the references to the style sheet automatically.
    - Redirect old style sheet URLs to the newest version just in case.
  - Allow nearly-permanent caching of image HTTP objects by setting the expiry eg. to one year.
    - Perhaps, as a bonus, provide URL virtualization as with style sheets. However, this isn’t crucial, because images are tweaked less frequently than style sheets.
Storage
- Storage abstraction that can be implemented using either flat files or a database
  - Default flat file back end implementation
    - Data base disaster recovery is harder than fixing a directory stucture with sanely formatted flat files.
- Store the entry text body using a sane, well-defined application of XML.
  - Straight-forward mapping to and from HTML needed.
    - In practice, a sane subset of XHTML 1.1 would be good (although h / section from XHTML 2 would be nice to have).
  - Comply with charmod
Metadata
- Keep it simple.
  - Processing RDF the right way is not simple.
    - You can have metadata without RDF.
    - Parsing RDF with a couple of regexps is not the right way to process RDF.
- Don’t overdo it.
  - Entering metadata is tedious.
  - By default, avoid requiring the author to enter any per-item metadata beyond the item title, which the author has to type anyway in order to display meaningful headings.
- Declare the natural language of the content
  - However, for a monolingual site, this should be a configuration-time setting.
- Automatically record the creation date of an entry.
- Record the latest modification date of an entry.
  - Preferrably allow flagging typo fixes as minor edits that don’t show up as substantial updates.
Statistics and interlinking
- Support Pingback and Referer logging
  - Send outgoing pingbacks by default.
    - Also set the Referer header when requesting the target page for Pingback autodiscovery.
  - Do not automatically link (from publicly accessible) to incoming URLs by default in order to avoid being open to Pingback and Referer spam by default.
  - Provide anti-spam measures such as moderation for displayed incoming Pingbacks and Referers.
- If UserAgent statistics are shown, show browser engine statistics
  - Generic detection code doesn’t work because of spoofing. Use up-to-date code that tries to find cases that are known to exist.
  - Mozilla, Camino, Netscape 6+, Galeon, etc. as one Gecko group.
  - Don’t confuse KHTML/WebKit (Safari and OmniWeb 4.5) with Gecko.
  - Don’t confuse KHTML/Konqueror with Gecko or IE.
  - Separate Tasman and Trident (Mac and Windows IE).
  - Count Opera and iCab as Opera and iCab regardless of where in the UA string the name appears.
  - Separate statistics for the RSS feeds.
Extras (mainly for news sites)
- Calendar of upcoming events related to the subject matter of the site.
  - Available in the iCal format.
- A special “review” category and template with summary fields about the product being reviewed.
- If the system has a commenting feature (Should a blogging system even have discussion forum features?)
  - Provide comments as an RSS feed.
  - Make it possible to include links in comments.
  - A robust moderation facility.
- Polls for entertainment purposes
  - Avoid pretending that the polls have any statistical credibility.