Saturday, 12 September 2009

Getting started with r

One of the most prolific free data visualisation tools (it also does lots of other very cool things) available is r (google "r" and you'll be OK, google "r" plus a problem that you're having and it's a nightmare). Getting started with it, there are loads and loads of great resources.

I don't intend to replace any of these, just share a few useful things that I've come across in the past few months since I've started using it.

The general idea with r is that it's open source and there are also a load of packages that you can add-on to the base. Think of it like Excel, but with every function that you could ever wanted also available to you (once you've downloaded the package that contains them).

ggplot2
written by Hadley Wickham, this is a great tool for visualising data - some of my favourite things are facets and histograms, since these are things which are an absolute nightmare to do in Excel (but shouldn't be a nightmare)

RCurl
Beside ggplot, rcurl is one of my favourite packages. It allows you to interact with web resources and gain access to data that is held on websites. This means you can make use of something like the Yahoo! Placemaker API to get the coordinates of a list of different placenames (think cities or train stations for instance).

However, you can push it further and also interact with non-API web resources in a couple of ways:

  1. Running the query that will cause a web site to return a dataset. In any case where a web site offers you the opportunity to download a file, you can replicate the request that the browser sent using r (sometimes this is easier than other times). This is worthy of a separate post.
  2. Ripping data from websites. In some cases, a website will return the data that you need as within the HTML web page. When this happens, combining rCurl with XML (another package) allows you to parse the HTML web page and rip the data that you need from it. As others note, this is the choice of last resort as any changes in the web page structure will screw up your code.

reshape
Again, Hadley written. "melt" (which is part of the reshape package) is turning into one of my favourite r functions. It allows you to quickly and easily transpose a data set into a flat file that you can plot with ggplot2. This is particular useful if you have a regional time series dataset with a separate column for each region. I've only scratched the surface with this one.

maps
Handy for plotting US state data and world data. The documentation IMHO is quite poor from what I've seen to date. The general idea (I'll write a more detailed example once I've worked through one myself) is that you build up a map layer by layer (or that's how I've got it to work), with the key idea to identify how states/countries have been coded into the relevant map.

Here's an example of what is possible from FlowingData (I figured I better start putting some visualisations into posts). As Nathan notes in the post, this map was created with only 5 or 6 lines of code in r (and a bit of post-r work in Adobe Illustrator).