Sunday, 9 December 2012

Viziblogging about a viziblog

Viziblogging about a viziblog - "Download the data"

This week I am sharing my final piece of work for Alberto Cairo's Introduction to Infographics and Data Visualisation MOOC. I have chosen to investigate a week's worth of posts from The UK Guardian's datablog. It was a particularly busy data week in Britain, as our Chancellor announced his spending plans for the years ahead, and the site provided detailed analysis before, during and after. The centrepiece of the week was this great interactive by Simon Rogers and Garry Blight. 

There is a story behind this story. Early on in this course I made more than a few comments at how surprised I was that the data we were encouraged to download from the Guardian did not seem to be formatted with further visualisation in mind. Having found a total of two examples of this I jumped the gun and slated the Guardian's entire policy on providing data.

Readers of last week's viziblog will see the results of that, as I looked again to see how widespread the problem was. The answer was, "not at all"! My apology may seem a little over the top for someone with only 9 followers on Twitter, but it matched my mortification. 

So I have designed the following visualisation to allow you to make up your own minds, and to explore both the variety of visualisation types and tools the Guardian uses, but also some insight into the navailability of the data. I have scored each report on whether it includes A) a data summary on the landing page, which can be copied and pasted into Excel or similar to be manipulated further (1 point)
B) a rationally formatted spreadsheet to download directly (1 point)
Either one of these is sufficient to disprove my previous precipitous opinion.

Also, the Guardian posts both its own visualisations, and showcases those of others, you may like to investigate the correlation between author and data availability score.

Powered by Tableau

Wednesday, 5 December 2012

Not my final MOOC submission

Not my final MOOC submission!

Hello again, so much to share and so little time

I'm sorry for posting such a small viziblog*, but its late and I can't wait till the morning to share three things with you...and start a little rant.

First the 3 things:

1. This link to Duke University Library - an excellent dataviz resource, which includes some of Mike Bostock's most recent d3.js examples. The site feels a bit old fashioned, but is  clearly updated regularly.

2. This map from Shawn Allen @shawnbot, which, if I could find a way to bind it to the US Unemployment data,and then found a way to manipulate the colours, and then found a way to animate it to scroll through the years from 2003-2012... would have been my viz for last week's task: . I've forked the code so watch this space (for a very long time I expect).

3. and finally,  a first step towards this week's final submission for Alberto Cairo's excellent MOOC. Playing with IBM's oft neglected Many Eyes, I took the data behind today's Guardian Datablog's excellent data visualisations and pushed them into an interactive treemap. What I came up with is below. But before you enjoy it, my rant (as promised). 

Why? Oh Why? Oh Why... does the leading promoter of data visualisation and data journalism in the UK insist on providing its source data in heavily formatted Excel files? Lately I have been clicking on their 'Download the data' link quite a bit, and then spending far more time than I'd like removing table junk: extra cells with unnecessary titles, extra cells with  blank columns inserted randomly (and narrowed so you don't find them til they've broken your .csv file!); text commentary in cells; dates as column headings (not part of the record row), numbers as text; dates as numbers; multiple tables on one tab...I could go on, but I already have.

Please @GuardianData, I have loved you since childhood, and am so proud that you are leading in data journalism in the UK as in so many other areas. I ask only one thing - that when I click on 'Download the data' I at least have the option of taking a nice clean Comma Separated Value file with no bells and whistles,  with all the measures starting on the left and then the dimensions. No Row Headings!

(deep breath)
Not deep enough. I have been in communication with Simon Rogers, the award winning editor of the Guardian datablog, who has been both gracious and tolerant of my insistence that the Guardian should do better in ensuring that the data they promote for use by readers be better formatted. I have done this publicly, both here and on Twitter. Belatedly, I decided to investigate further, in order to determine the extent of what I perceived as a lack of rigour in preparing the data behind their excellent, World class data visualisations, and data journalism. What I discovered is that I have been guilty of drawing conclusions based on insufficient data, and my survey of seven days of @GuardianData has revealed the high data curation standards undertaken by Simon's team. which underpin the excellent journalism for which the Guardian is rightly renowned  I will present the results of my investigation as my final assignment for Alberto Cairo's Data visualisation MOOC.

I am posting this now because  I believe that I have already discovered enough to completely and unconditionally withdraw my criticism and apologise without reservation for my overly smug and entirely precipitous judgement. It would be remiss of me to leave an unjust claim in the public domain. The basis for my criticism was based on an unrepresentative sample and I am sufficiently experienced to have realised what I should have done before writing a single word on this matter: the research I have now undertaken to assess the extent, or otherwise, of the problem I perceived. I failed in that responsibility and seek to make amends by displaying the facts so that others may judge for themselves. I am still in the process of doing that, but the weight of evidence is already sufficient to justify this action.

I hope that I can do a good job at that at least, and apologise to all those that have been kind enough to visit this blog, to whom I have a responsibility of pursuing best practice. I apologise publicly to Simon here and to all members of his team who may have felt that they were also unjustly criticised; I will endeavour to apologise to him personally as well. 

If any of you feel the same, or know other sites that are guilty of providing their work-up sheets rather than the data - just the data - then email, blog, tweet, or post a comment on their sites in support of Clean Data for All. Or if you disagree, comment below and let the debate begin.  

Now, I also promised you a chart and here it is - just a first stab (I've been thinking about a Slopegraph as well). My actual contribution will be delivered just under the wire (as usual).

*viziblog - soon to be the international term to denote data visualisation blogs, therefore removing the concern over whether you should search twitter for #datavisualisation or #datavizualisation or #datavisualization or #.. you get the idea. #viziblog saves anxiety, and importantly for twitter, saves characters