Monday, 5 October 2009

Free data

The differences in the policy towards government IP in the US versus the UK has been well documented - some of the best examples include NASA's photos from Hubble, and many, many items from data.gov.

The general idea being that, as taxpayers, the American people own the data that they have paid for. What a great idea.

The UK is a luddite in comparison.

It seems that all our taxpayer's money has been used to bail out banks, so to get access to Ordnance Survey data costs a fair whack.

At the same time, I'm unsure what the Post Office is spending its money on (definitely not doing up its outlets, but maybe funding a few holidays for Roger Moore), as it charges for access to its postcode data and has issued notices to those using similar postcode services. I guess the Post Office is a fan of monopolies.

Rant over.

The government finally seems to be getting its act together with the launch of its Open Data Developers initiative, which is currently in beta with a Google group. By joining the group, you're provided with a login and password to the beta website. From what I've read on the Google group, SPARQL is the language used for querying the databases. Have a look at wikipedia or google it if you want to find out more.

I'm still exploring the site and the datasets available, but first impressions are very promising and it looks a lot easier to use IMHO than something like open.gov.uk which is well meaning, but a nightmare to use.

I'll write a few more posts once I've got my head around some of what is possible.

Saturday, 12 September 2009

Getting started with r

One of the most prolific free data visualisation tools (it also does lots of other very cool things) available is r (google "r" and you'll be OK, google "r" plus a problem that you're having and it's a nightmare). Getting started with it, there are loads and loads of great resources.

I don't intend to replace any of these, just share a few useful things that I've come across in the past few months since I've started using it.

The general idea with r is that it's open source and there are also a load of packages that you can add-on to the base. Think of it like Excel, but with every function that you could ever wanted also available to you (once you've downloaded the package that contains them).

ggplot2
written by Hadley Wickham, this is a great tool for visualising data - some of my favourite things are facets and histograms, since these are things which are an absolute nightmare to do in Excel (but shouldn't be a nightmare)

RCurl
Beside ggplot, rcurl is one of my favourite packages. It allows you to interact with web resources and gain access to data that is held on websites. This means you can make use of something like the Yahoo! Placemaker API to get the coordinates of a list of different placenames (think cities or train stations for instance).

However, you can push it further and also interact with non-API web resources in a couple of ways:

  1. Running the query that will cause a web site to return a dataset. In any case where a web site offers you the opportunity to download a file, you can replicate the request that the browser sent using r (sometimes this is easier than other times). This is worthy of a separate post.
  2. Ripping data from websites. In some cases, a website will return the data that you need as within the HTML web page. When this happens, combining rCurl with XML (another package) allows you to parse the HTML web page and rip the data that you need from it. As others note, this is the choice of last resort as any changes in the web page structure will screw up your code.

reshape
Again, Hadley written. "melt" (which is part of the reshape package) is turning into one of my favourite r functions. It allows you to quickly and easily transpose a data set into a flat file that you can plot with ggplot2. This is particular useful if you have a regional time series dataset with a separate column for each region. I've only scratched the surface with this one.

maps
Handy for plotting US state data and world data. The documentation IMHO is quite poor from what I've seen to date. The general idea (I'll write a more detailed example once I've worked through one myself) is that you build up a map layer by layer (or that's how I've got it to work), with the key idea to identify how states/countries have been coded into the relevant map.

Here's an example of what is possible from FlowingData (I figured I better start putting some visualisations into posts). As Nathan notes in the post, this map was created with only 5 or 6 lines of code in r (and a bit of post-r work in Adobe Illustrator).

Tuesday, 11 August 2009

The right information in the right place just changes your life or The Jerry Maguire post

The quote to keep in mind
"On the one hand information wants to be expensive, because it's so valuable. The right information in the right place just changes your life.

On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other."

Stewart Brand
Now, a bit of background

Over the past 18 months, there has been an explosion in data visualisation on the web (a visualisation of this explosion might be done at some point). Part of this explosion has been the expansion of freely available data sources and tools to visualise these data sources. Never has it been easier to get hold of data and do something with it.

From what I've come across to date (and so not an exhaustive list), there are quite a few free (typically, open source) tools and date sources out there:

Free Tools
  1. r & associated packages to accomplish specific tasks
  2. ManyEyes (by IBM)
  3. Verifiable
  4. Processing
  5. OECD Data Visualiser - become your very own Hans Rosling
  6. Charles (this records the information that is being sent by your internet browser to the website, so you can replicate this information in r or other programs).
  7. Dapper (I haven't tried this one out)
Data sources:

Unofficially, anything on the web that is in a half decent structure - either as a table or a csv (full post on using r & rCurl for some of these to follow).

Officially, lots of things have an API, here are a few:

  1. TheyWorkForYou (UK MPs information)
  2. Yahoo (lots of different apis - the placename one is nice)
  3. Twitter
There are also sources where data is stored in some form or other:

  1. The Guardian has also started a data store which attempts to liberate data from various places. This includes crowdsourcing the MPs expenses.
  2. The Office for National Statistics
  3. Transport for London
  4. National Rail
  5. Wikipedia (great if you need a list of, say Zone 1 Tube Stations)
And there are also lots of people creating and critiquing visualisations:

  1. FlowingData
  2. Junk Charts
  3. Data Visualisation
  4. Indexed (not strictly data)

All of these lists are in no way exhaustive, and will be added to over time.