« June 2006 | Main | August 2006 »

July 22, 2006

it's just semantics

Via Bill de hÓra came a link to  Google Maps, the fool's gold of mashups by Phil Wainewright : what makes mapping mashups so easy, and so untypical of the kind of mashup challenges people face in the real world: the critical data is already structured according to a specification that all of us internalize by the time we graduate from junior school. The same is true of names, dates, time of day, quantities and dimensions.

Actually even things that feel like they should be trivial to map on google maps can turn out not to be - for example, there is no freely available source of data to allow geocoding of Australian street addresses, and even the data that is freely available it doesn't all use the same geodetic system. So if (for example) you merge data from the Geographical Names Board of NSW with Australia Post's Postcode Data File and then plot the location of postcode 2041 (Balmain), it turns out to be an uninhabited island in Sydney Harbour

Or how about plotting a map showing all the places where Australia's two largest retailers compete head to head. Something like "Software Wars",  but less amusing. As Stevey said It was easy to think, so it must be easy to do. 

It turns out this is yet another area for which there are indeed publicly sources of data, but those sources don't line up with a layperson's expectations that  things that are considered competitors should be equivalent.  For example, in the Australian Federal Government's "Australian Business Register", Woolworths Limited has a single ABN, with multiple trading names, where as a search for Coles Myer returns lots of different ABNS but none have links to any of their wholly owned businesses (like Target or KMart). If you go to the Coles Myer website you can search for CML stores (which includes the Shell service stations that are also co-branded "Coles Express", as well as the Myer stores, even though Myer is now no longer owned by Coles Myer Limited). The closest equivalent on the Woolworths Limited site is an incomplete list of brands - amongst other things, there's no mention of the 100+ pubs that make up the Australian Leisure & Hospitality Group (which aren't actually on Woolworths listing in Australian Business Register either, since they are only 75% owned by Woolworths).

But one shouldn't be too surprised at how hard it is to find a complete and unambiguous list of all Coles Myer or Woolworths stores on the web - things aren't any clearer when you're inside the corporate firewalls either.

This issue is not really about the "format" of data -  whether you use XML or CSV, or whether numbers are zero padded. It's about the "meaning" of data, i.e. the semantics. So would the semantic web help here at all? One could imagine a program trying to build a list of stores belonging to each retailer by yoking together a chain of ownership inferred from the Australian Business Register with store location data from corporate websites. But there are still lots of ad hoc judgement calls to be made. Should the list include entities that are partly owned? if so, do you include only majority owned stores? What about franchised outlets (where someone else owns the store and pays for the right to use a brand name?)

All of these decisions are things that a normal person would probably feel they know the answer for, but they could not define a general method to arrive at that answer before the specific question is asked ("I know it when I see it...") Identifying and resolving all these ambiguities is what makes defining a data model really, really hard.

And, as I previously wrote, people who are motivated to go to all the effort of creating a complete and unambigous data model will have very different concerns than a casual consumer of that data model

the fallacy of using commercial identifiers in applications aimed at the general public

On the barcodepedia forum there's a discussion about using EANs in a database of 'ethical information' of products.

This is just one of number of efforts I've seen where people have tried to use a commercially  developed identifier as the index into a system to be used by the general public, another being Jon Udell's LibraryLookup bookmarklet that let you search your local library's catalogue by ISBN.

But these repurposings always come up against the problem that the original identification scheme uses a much finer grained categorisation than the general public care about. So there is no single barcode for 'coca cola', there's thousands, varying not just by flavour (classic,cherry,zero,diet, diet with lemon, diet with lemon and a backrub) but also by country of manufacture, container size, whether the container is sold standalone or as a 6 pack, and lots of other variables that really matter to manufacturers, distributers, wholesalers and retailers of products (who created the UPC/EAN/GTIN scheme in the first place), but don't matter at all to someone wanting to discuss the ethics,  nutritional value, or social history of coca cola.

Similarly, as Jon Udell discovered, an ISBN denotes a species, not a genus so there is no unique ISBN for "On The Road". And the need for members of the book trade to differentiate between books that have the same content but differ in publisher, edition or binding etc was one of the motivations for creating the ISBN system in the first place.

In the LibraryLookup case, the proposed solution is to use a service like xISBN takes an ISBN, and then returns a list of 'related' ISBNs, i.e. entries in an ISBN catalogue that have the same title.

And I suspect if anyone attempts to build a list of information about products for consumers and uses the GTIN barcode as the index, they will eventually hit the same problem of wanting to combine multiple GTINs into a single 'product', but the only way of combining multiple records will be by looking for items that have the same name.

So why not just index and search by name in the first place?

Clay Shirky wrote about this, and a whole lot more, (and a whole lot better) in "Ontology is Overrated"

July 20, 2006

random links 2006-07-20

Dog or higher proposes a corollary to Godwin's law: When Dvorak enters the debate, the other side has won.

The 10000 Lines Of Code Project - We believe that most software exceeding this maximum is bloated and seriously wrong

The Register: Multivalued datatypes considered harmful

Hugo Ortega, Woolworths and Microsoft Australia does a pretty good job describing life at Norwest, where every employee walks around with RFID tags but business cards are an unjustifiable expense.

July 14, 2006

rendering html from sql server

I can't think of a single legitimate reason for ever doing this, but here's my latest post on rendering html from within a repl, this time from the SQL Server management studio (formerly known as Query Analyzer) . Previous posts show how to do this in irb (ruby) and in PowerShell.

EDIT PROC usp_Show_HTML (@html varchar(8000))
AS
    DECLARE @ie int
    EXEC sp_OACreate 'InternetExplorer.Application',@ie OUT
    EXEC sp_OASetProperty @ie,'menubar',0
    EXEC sp_OASetProperty @ie,'toolbar',0
    EXEC sp_OASetProperty @ie,'statusbar',0
    EXEC sp_OAMethod @ie,'navigate',null,'about:blank'
    DECLARE @doc int
    EXEC sp_OAGetProperty @ie,'document',@doc OUT
    EXEC sp_OAMethod @doc,'write',null,@html
    EXEC sp_OASetProperty @ie, 'Visible', 'true'

This only works if you are connecting to a SQL Server running on localhhost, and if the SQL Server service is set to run under the 'Local System Account' with the 'Allow service to interact with desktop' checkbox ticked.

To use:
SET NOCOUNT ON
CREATE TABLE #t(output varchar(256))
INSERT INTO #t
EXEC xp_cmdshell 'SET'

DECLARE c CURSOR FAST_FORWARD FOR SELECT output FROM #t WHERE output IS NOT NULL
DECLARE @output varchar(256)
DECLARE @html varchar(8000)

SET @html='<table>'
OPEN c
FETCH NEXT FROM c INTO @output

WHILE @@FETCH_STATUS>-1 BEGIN
    SET @html=@html+'<TR><TD>'+REPLACE(@output,'=','</TD><TD>')+'</TD></TR>'
     FETCH NEXT FROM c INTO @output
END
CLOSE c
DEALLOCATE c
SET @html=@html+'</TABLE>'
EXEC usp_Show_HTML @html
DROP TABLE #t

stevey's right, marketing matters

A few months ago, I read Stevey's drunken blog rant on the different marketing styles of python and ruby

I was reminded of that tonight when I went looking for a rich text REPL environment. I'm sure such a beast is out there. Something called Leo might even be it, but after looking at the screen shots, beginners guide and FAQ, I still don't really know what it's all about, or feel at all inspired to give it a shot.

Compare and contrast with 'try ruby'

 

July 09, 2006

rendering html from monad

I've just started dabbling with monad (aka "Windows Powershell"), so I tried porting my code for 'rendering HTML from irb' to it. It turns out to be a pretty good task for getting to know a new REPL environment.

Just for the record, this is what I came up with:

function Show-HTML ([string]$html) {
    $ie=New-Object -ComObject InternetExplorer.Application
    $ie.menubar=0
    $ie.toolbar=0
    $ie.statusbar=0
    $ie.navigate("about:blank")
    $ie.document.write($html)
    $html=[string]$input
    $ie.document.write($html)
    $ie.visible="true"
}

You can either pass a string param, as in Show-Html "Hello World!", or you can pipe HTML in. Monad also has a nifty builtin called 'ConvertTo-HTML' which pretty much does what it says. So in order to get the same HTML table of environment variables from the irb example, you could use

dir Env:  | ConvertTo-HTML Name,Value | Show-HTML

 

random links 2006-07-09

Ted Neward - More on Monad vs Ruby and The Vietnam Of Computer Science

Barcodepedia - the online barcode database

 

payment systems blogroll

Blogs that are mostly about payment systems technology: 
Blogs that are mostly about the business of payment systems:
  • Aneace's Blog - Ideas from around the world on how to re-invent the payment transaction as a privileged moment of contact, communication and exchange between customers, merchants and their banks
  • Linkdump on Payments - in the Netherlands - Europe - the World...
  • Bankwatch - Which banks understand the web lifestyle?
  • AllPayNews - Making Sense of Payments
Coder blogs that sometimes talk about payment systems stuff:
  • Rambling Comments - Len Holgate's thoughts on this and that... Mainly test driven software development in C++ on Windows platforms...

If you've got any other suggestions for this list, post a comment.

July 08, 2006

cascading filetypes - proof of concept

Following on from my proposal for a cascading filetypes demultiplexer, I've hacked up a quick and dirty 'proof of concept' implementation. The zip file is here, which also includes a readme and the C# source code.

cascading filetype demultiplexing on windows

Yesterday, Leon (aka secretGeek) came up with the cool idea of 'cascading filetypes'. (go read that now or the rest of this will make no sense at all)

Update 2006-07-8 : proof of concept implementation

Leon's original proposal was to have the final extension be unique to each application, e.g.

MyTimesheetSettings.txt.sgml.xml.snapper

The problem here is that this would require Microsoft to modify the Windows shell to understand the heirarchy of file types. This is a classic bootstrapping 'chicken and egg' problem - there's no incentive to modify the shell unless there are apps that would use the functionality, but the apps can't get built unless the shell is already modified.

But a slight tweak would turn this into something that could be incrementally deployed, namely using a standard extension that pointed at an app the did nothing but parse the file name, walk the heirarchy of file types, and then look in the registry for the appropriate file to execute.

I propose that the file extension that is used be 'CFD', partly because 'Cascading Filetype Demultiplexing' sounds kinda cool, and also because not much else seems to be associated with that extension.

Also, even though XML strictly speaking is SGML, I don't think there will be anyone out there who has an editor associated with SGML but not with XML, I'd suggest dropping that from the name as well.

So the example above would change to:

MyTimesheetSettings.txt.xml.snapper.cfd

So to get this idea to float, this is what needs to happen:

First, there needs to be a freely distributable app that can bind to ".cfd". Actually there can be a whole bunch of implementations of apps that claim that extension, as long as they all follow the same convention of parsing the filename to find the appropriate app to hand off to,

Then developers wanting the advantage of cascading filetypes (namely, being able to bind a doc to your own app for some verbs, such as 'open', and have the shell hand off to another app for other verbs, such as 'edit') need to do the following:

  1. make up a new extension unique to your application (as in .snapper, from Leon's example) 
  2. during your application install, you need to register your unique extension with the shell, and tell it what DDE verbs you implement (e.g. 'open')
  3. your install should ALSO make sure that a cascading filetype demultiplixer is installed (i.e. that something is registered with the shell for the .cfd extension). If nothing is associated with that extension, then you should also install the cascading filetype demultiplexer.
  4. when your app creates files, make sure the filenames following the cascading filetype extension, with your unique extension and then ".cfd" at the end.

If this idea takes off,  and apps start using this, then maybe MS can build the demultiplexing directly into the shell, in which case the need to install the separate demultiplexing app goes away.

Anyone wanting to start implementing a demultiplexer should probably start by reading this MSDN article on Verbs and File Associations

July 03, 2006

geocoded NSW postal areas

 A list of suburbs in NSW, including their postcodes and approximate location (using the GDA94 datum).

This was made by merging data from the Geographical Names Board of NSW with Australia Post's Postcode Data File 

July 01, 2006

random links 2006-07-01

Happy new (financial) year!

balloon - one of those things where if you don't know why it's cool as soon as you see it, no-one could explain it to you.

Very detailed and serious comparison of functional and imperative programming styles

A site all about real programmers (thank you, OdeToCode!)