OUseful Info

Syndicate content OUseful.Info, the blog...
In the expectation that Google will axe feedburner, please change any subscriptions to this feed to http://blog.ouseful.info/feed/
Updated: 3 hours 1 min ago

Pondering Bibliographic Coupling and Co-citation Analyses in the Context of Company Directorships

Fri, 05/24/2013 - 8:13am

Over the last month or so, I’ve made a start reading through Mark Newman’s Networks: An Introduction, trying (though I’m not sure how successfully!) to bring an element of discipline to my otherwise osmotically acquired understanding of the techniques employed by various network analysis tools.

One distinction that made a lot of sense to me came from the domain of bibliometrics, specifically between the notions of bibliographic coupling and co-citation.

Co-citation
The idea of co-citation will be familiar to many – when one article cites a set of other articles, those other articles are “co-cited” by the first. When the same articles are co-cited by lots of other articles, we may have reason to believe that they are somehow related in a meaningful way.

cocitation analysis
Image via Wikipedia

In graph terms, we might also represent this as simpler graph within which edges between two articles indicate that they have been co-cited by documents within a particular corpus, with the weight of each edge representing the number of documents within that corpus that have co-cited them.

Bibliographic coupling
Bibliographic coupling is actually an earlier notion, describing the extent to which two works are related by virtue of them both referencing the same other work.

Bibliographic coupling
Image via Wikipedia

Again, in graph terms, we might think of a simpler undirected network in which edges between two articles act as an indicator that they have cited or referenced the same work, with the weight of the edge representing the number of documents that they have co-cited.

A comparison of co-citation and bibliographic coupling networks shows one to be “retrospective” and the other to be “forward looking”. The articles referenced in bibliographic coupling network can be generated directly from a corpus set of articles, and to this extent bibliographic coupling looks to the past. In a co-citation network, the edges that connect two articles can only be generated when a future published article cites them both.

Co-citation, Bibliographic Coupling and Company Director Networks

For some time I’ve been tinkering with the notion of co-director networks, using OpenCorporates data as a data source (eg Mapping Corporate Networks With OpenCorporates). What I’ve tended to focus on are networks built up from active companies and their current directors, looking to see which companies are currently connected by virtue of currently sharing the same directors. On the to do list are timelines showing the companies that a particular director has been associated with, and when, as well as directorial appointments and terminations within a particular company.

In both co-citation and bibliographic analyses, the nodes are the same type of thing (that is, works that are citated, such as articles). A work cites a work. (Note: does author co-citation analysis rely on mappings from works to cited authors, or citing authors to cited authors?). In company-director networks, we have bipartite representation, with directors and companies representing the two types of node and where edges connect companies and directors but not companies and companies or directors and directors; unless a company is a director, but we generally fudge the labelling there.

If we treat “companies that retain directors” as “articles that cite other articles”:

- under a “co-citation” style view, we generate links between companies that share common directors;
- under a “bibliographic coupling” style view, we generate links between directors of the same companies.

I’ve been doing this anyway, but the bibliographic coupling/co-citation distinction may help me tighten it up a little, as well as improving ways of calculating and analysing these networks by reusing analyses described by the bibliometricians?

Pondering the “future vs. past” distinction, the following also comes to mind:

- at the moment, I am generating networks based on current directors of active companies;
- could we construct a dynamic (temporal?) hypergraph from hyperedges that connect all the directors associated with a particular company at a particular time? If so, what could we do with this graph?! (As an aside, it’s probably worth noting that I know absolutely nothing about hypergraphs!)

I’ve also started wondering about ‘director pathways’ in which we define directors as nodes (where all we require was that a person was a director of a company at some time) and directed “citation” edges. These edges would go from one director to other director nodes under the condition that the “citing” director was appointed to a particular company within a particular time period t1..t2 before the appointment to the same company of a “cited” director. If one director follows another director into more than one company, we increase the weight of the edge accordingly. (We could maybe also explore modes in which edge weights represent the amount of time that two directors are in the same company together.)

The aim is… probably pointless and not that interesting. Unless it is… The sort of questions this approach would allow us to ask would be along the lines of: are there groups of directors whose directorial appointments follow similar trajectories through companies; or are there groups of directors who appear to move from one company to another along with each other?


Categories: Faculty

Notes on Narrative Science and Automated Insights

Wed, 05/22/2013 - 10:40am

In October 2009, the New York Times Media Decoder blog picked up on a story that had been doing the rounds about a research project called Stats Monkey from the Intelligent Information Laboratory at Northwestern University. The Robots Are Coming!, it declared, with the immediate rejoinder, Oh, They’re Here. Using play by play baseball data, Stats Monkey produced human readable reports of a baseball game, formulaic admittedly, but good enough, particularly when complemented by quotes from a post-match press conference report. Mechanical churnalism complementing data-driven analysis, cast into prose. (It’s worth noting that the Media Decoder post itself is little more than a restatement of what was presumably the Stats Monkey website blurb at the time.)

In April 2010, Bloomberg Businessweek Magazine asked Are Sportswriters Really Necessary?, describing how Narrative Science, a company that incorporated at the start of that year and spun out off the back of the Stats Monkey project had teamed up with the Big Ten Network to produce automatically generated sports reports, a relationship that presumably continues to this day.

BTN and Narrative Science?

A year later, and Forbes magazine produced a report in June 2011 about GameChanger and Narrative Science: Fulfilling the Heretofore Unrealized Demand for Stilted Stories About Children’s Games, describing a tie-up between Narrative Science and GameChanger, a company that produces a scorekeeping app that allows sports fans, parents and coaches to capture data about a match.

(What other companies/apps are out there for crowdsourcing sports analytics in this way, I wonder?)

Using GameChanger data and narrative Science story generation tools, it was possible to automate the creation of match reports for small number audiences. I don’t know if these stories used to be freely accessible, but today the match reports appear to take the form of paywalled notion of recap stories.

recap stories commercial

Paywall aside, examples of other stories generated by Narrative Science using GameChanger data can be found using a simple web search on the phrase “Powered by Narrative Science and GameChanger Media”

powered by gamechanger media

You can also just search for the byline, as for example it appears in this report:

Narrative science byline

In passing, it’ll be interesting to see how automatically generated stories start to feed into the glitch aesthetic (h/t @danmcquillan for introducing me to this phrase and the related notion of the new aesthetic in his presentation at #opentech last week).

GLitch aesthetic

September 2011 saw a media outlook report from Mediabistro’s Media Jobs Daily noting that Narrative Science’s ‘Robot Journalists’ Now Tackling Real Estate. The story links through to a page on Builder Online that provides a summary report of housing data for various US cities.

Canned reporting

What this example, and the GameChanger example, show is how the generation of timely text stories can be automated on top of the regularly updated datasets. The use of natural language interpretive text to describe patterns observed in the underlying data presumably also has SEO benefits.

That same month, September 2011, saw another stats-to-insight company, again emerging from the automated interpretation of sports data, renaming itself from StatSheet to Automated Insights. Today, Statsheet continues to publish game recaps combining short natural language summaries with statistical charts, all of which are presumably automatically generated. Within a year, the parent company, Automated Insights had scaled up and begun publishing recaps for Yahoo!’s fantasy sports matches.

More recently, Automated Insights have started producing realtime content feeds to support sports commentators – Real-time Insights for MLB – as well as feeding consumers via the stat.us powered Twitter feeds.

(See also: yseop, a French company that generates automated reports from data. [Any more?])

Fast forward to the start of 2013, and Narrative Science started publishing human readable prose reports based on US schools data (ProPublica: How To Edit 52,000 Stories at Once). They’re also doing a lot more work with financial reporting, for example with Forbes as well as for financial services clients, as this interview with Narrative Science’s Stuart Frankel describes.

narrative science forbes

Generating human readable reports from Google Analytics data and dashboards also appears to be a hot topic, with both Narrative Science (Automated Insight From Google Analytics With Quill) and Automated Insights (With Site Ai, Automated Insights Provides A Cliffs Notes Version Of Your Web Analytics) recently developing tools around this topic.

What I thought was particularly interesting about the ProPublica example was how it suggests a possible widespread future use of “automatically generated insight” pulling out headline interpretations from open data sets, as touched on in this great introductory technical presentation by Narrative Science’s Larry Adams (which also happens to mention the possibility of Narrative Science offering platform services via an API…? It also mentions work with the NHS?):

At one point during that presentation, Larry Adams suggests that Narrative Science use small set of narrative templates or story types (“the horserace” for example, or “top 10″) to frame the construction of their stories, as well as mentioning the sorts of feature that they look for within a data set (trends and changes in trends, for example, or outliers). Another presentation, this time by Narrative Science’s Kris Hammond also hints at some of the features they look for in data: “inflexion points, trends, correlations”.

So what sorts of techniques might we use ourselves to start generating the insights that we might be able to work up into simple narrative sentences, at least for starters?

Top 10, bottom 5 are easy pickings if we can rank the data somehow. I thought this trick for detecting inflexions by coding a time series symbolically and then using a regular expression to detect features was really interesting: Finding patterns in time series using regular expressions. And I wonder, how does the OpenSecrets anomaly tracker define the anomalies it detects?

Other posts you might be interested in:
- The Tesco Data Business – Notes on “Scoring Points”
- More Remarks on the Tesco Data Play


Categories: Faculty

Are We Just Google’s Lab Rats?

Sun, 05/19/2013 - 7:19am

There are some interesting comments relating to my previous post on Google Lock-In Lock-Out in a comment thread on OSnews: Why Google gets so much credit. Here are some of my own lazy Sunday morning notes/thoughts relating to that, and other comments…

- killing Google Reader does not kill RSS/there was no “malicious intent” mapping out the Reader/RSS strategy:

A nice phrase in an #opentech talk yesterday was that we (technologists and engineers and data scientists, for example) have to “act responsibly”. Google Reader helped popularise feed reading when some of us were hopeful for its future (“We ignore RSS at OUr Peril”), and as such attracted many readers away from other clients (myself included), with the result that competition was harder (“compete against Google? Hmm… maybe not…”). Google Reader’s infrastructure and unofficial APIs enabled folk to build services off the back of the Google Reader infrastructure turning it into de facto infrastructure for other peoples’ applications and services. (Remember: the Google Maps API was unofficial at first). There aren’t many OPML bundlers out there, for example, but for hackers into appropriating tech Google Reader is one. Since I moved away from Google Reader (to theoldreader) I haven’t used Flipboard so much, which as far as I was concerned was using Reader essentially as infrastructure. Caveat emptor, I guess, for developers building on top of other companies services (as many Twitter and Facebook app developers keep discovering).

With Feedburner, Google bought up a service that acted as a proxy, taking public syndication feeds, instrumenting them with analytics, and then encouraging the people taking up the syndicated content to subscribe to the Feedburner feed. Where RSS and Atom were designed to support syndication between independent parties, Feedburner – and then Google – insinuated itself between those parties. By replacing self-controlled feeds as the subscription endpoint with Google controlled endpoints, publishers gave up control of their syndication infrastructure. With Google losing interest in open syndication feeds as it pursues its own closed content network agenda, we are faced with a situation whereby Google can potentially trash a widespread syndication infrastructure that would have remained resilient if Google hadn’t insinuated itself into it. Or if we hadn’t been so stupid as to simplistically accept it’s overtures.

Hmmm… thinks… do we need a Google users’ motto? Don’t be stupid perhaps…?!

I applaud Google for developing the services it does, getting them to scale and opening up API access. But as these services become de facto infrastructure, the question of how Google acknowledges any responsibility, that flows from this (even if this responsibility is incorrectly assumed) becomes an issue. Responsibilities arise in other areas too, of course. Such as taxation and corporate transparency. But that’s another issue. (Would Google act differently if its motto was “Be responsible” or “Act responsibly” rather than “Don’t be evil”? It strikes me that “Act responsibly” could work as a motto for both companies and their users?)

It seems to me that with Google+, Google is not adopting open syndication standards in two ways: not using it “internally”, and not making feeds publicly available. There may be good technical reasons for the first, but by the second Google is *not allowing* its community members to participate in a open content syndication network/system. Google’s choice, but I’m not playing.

Google is not killing the open standards by closing off access to them in commercial licensing terms, but it may contribute to stifling their adoption by adopting alternative standards that others feel they have to adopt because of the influence Google has on web traffic.

Consider this other way of looking at it – Google is presumably trying to get other parties to adopt WebP by developing it as an openstandard. Google assumes that it can drive adoption of this as a web standard by adopting it itself. In terms of argumentation, it doesn’t follow that by not adopting something Google can prevent it being adopted, (i.e. not adopting or by stopping its own use of a standard, Google kills it generally) but people follow bad logic all the time (and if they follow Google for their technology choices, or have a technology model based on being parasitic on Google infrastructure, Google’s dropping of a standard effectively kills it for those people) …

- control of what we see

Google makes money by putting ad-links in front of eyeballs that people click on. By presenting “relevant” ads, Google presumably tries to maximise the click-thru rate so that it can make more money per displayed link.

To encourage you to spend your attention on pages that Google controls, Google has adopted the idea that by presenting you (and me; us) with “relevant” content, we are likely to remain engaged. With Google web search, the relevance of search results supposedly attracts us back to the Google search tool. With services such as Google now, Google pre-emptively tries to present you with information it thinks you need, presumably based on predictive models of sequences of action that other people (or you yourself) have demonstrated in the past.

I’m not really up on behavioural psychology models, but I have a vague memory that intermittent reinforcement schedules were demonstrated to be one of the more effect modes of behaviourist training/operant conditioning. So I wonder: how effective are predictive intermittent positive reinforcement schedules. (You get the idea, right? We’re pigeons that peck at Android phones and Google is the experimenter trying to get us to peck the right way, by reinforcing us every now and again by satisfying out intent. That is, has there been in a flip away from Google using us to provide reinforcement training signals to its algorithms in to a situation in which we have become Google’s experimental lab rats that are coupled in a series of ongoing experiments that train us and its algorithms, jointly, together, to maximise… something…)

There is a danger, I think, in Google chasing the “relevance” thing too far, seeing the maximisation of whatever conversion metrics it decides on as being a sign that it has “got things right” for us, that it is satisfying our “intent”. And if operant conditioning does influence the way we behave, maybe we do actually need to start thinking about what the machine algorithms are training us to do. Are training us to do. Training us.

Google’s stated aim is to “organize the world’s information and make it universally accessible and useful”.

- Through web search, it started to organise information it presented to use through search results that were more appealingly ranked (seemed “more relevant”) than the other search engines did.

- Through personalised search, it started to organise the way it presented results to each of us individually.

- Through web tracking, it presents us with information – adverts – organised in a way it presumably thinks are more personally meaningful to use (but maximising what metic exactly? More likely to cause us to act in a particular way, as measured by whether we click the link, or linger on a page, or engage in a particular behaviour that can be captured – for model building and exploitation purposes – by web tracking algorithms?)

- Through Google Now, and the new Google image gallery tools, Google is seeking to organise our information (we’re part of the world, right?) on our behalf and present it back to us in a way that the Google algorithms decide.

The old photos in a drawer back at my family home are sorted howsoever (by whatever algorithm “use” and random access results in). Now they’ll be sorted by Google. Maybe the algorithms are similar. Or maybe they’re not. What would be evil, I think, was if the ranking algorithms that are used to decide the order in which organic information is presented us start to be influenced by the algorithms that are tied to advertising or marketing, that is, to algorithms that are used to try to maximise the extent to which we are influenced in accord with the goals, beliefs, desires and intents of others (with a hat tip there to agent logic and the theories of intelligent software agents).

At the moment I believe that Google believes it is trying to develop algorithms that benefit us personally, in an utilitarian way. But I’m not sure what function it is they are maximising or how they think it maps onto any personal theories or preferences we may have about what is “accessible” and “useful”. I guess we might also ask whether “accessible” and “useful” are the road to a Good Life (because in the end this comes down to philosophy and ethics, doesn’t it?) or whether we should be “organising the world’s information” with some other purpose in mind?

PS Just by the by, it’s worth noting that the educational arena is seeking to use learning analytics to instrumentalise our behaviour and engagement within learning systems and contexts for our, erm, learning benefit. (Measured how?)


Categories: Faculty

Google Lock-In Lock-Out

Thu, 05/16/2013 - 2:13pm

As John Naughton feels obliged to remind folk every now and again, the web is not the internet. Because we all know that for many people, Facebook apparently is. Or Google is.

And as anyone following my tweets over the last year or two will know, I’ve started finding Google more and more irksome.

It’s not just that the one or two people I know who use Google Plus (Google+?) are now all but lost to me as sources of neat ideas because I don’t do Gooplus and it doesn’t do RSS…

It’s not just because Google is shutting down the Google Reader backbone that powers a lot of RSS and Atom syndication feed services (and leaves me wondering: how long is Feedburner for this world? Maybe it’s time to start moving your feeds and trying to get folk off that piece of infrastructure…)…

It’s not just that geocoding done within Fusion Tables is not exported – if you look at a KML feed from Google Fusion Tables, you’ll find there’s no lat-long data there. To get a geo-view, you need to stick in Google Fusion Tables or wire the feed into Google Earth, which will then “initiate geocoding of location descriptions while viewing [the] KML file”…

It’s not just that Google is deprecating gadgets from spreadsheets, which as Martin points out means that if I want to visualise data in a spreadsheet all I’m going to be left with is Google’s crappy charts

It’s not just that Google moved away from using CalDav to support calendar interoperability… (announcement: “CalDAV API will become available for whitelisted developers, and will be shut down for other developers on September 16, 2013. Most developers’ use cases are handled well by Google Calendar API, which we recommend using instead.”)

It’s not just that Google is moving away from using the XMPP instant messaging protocol (and nor, I think, making a move towards using MQTT?)…

It’s not just that Google will be using your photos to create photos you never took and presumably offer them up via your image gallery in favour of photos it thinks aren’t up to scratch…

Though I’m sure that Google wouldn’t start pushing images in just the WebP image format so that you’d feel obliged to use Chrome…

And also in the browser, I’m sure Google wouldn’t start using Google Public DNS as a Chrome default setting. (Is the same true of Chromebook? Presumably folk connected to Google Fiber use Google Public DNS?) But does it use SPDY as a default? How about on Android?

It’s not just that Google will tag your social media posts using tags you might never use yourself, and as it does so altering the externalised memory embodied by that post…

It’s not just that as web search gets increasingly personalised and localised, we lose any sense of Google ground truth; I’m not quite sure how the info-skills trainers are going to address this when training a motley crew of different learners to discover a particular resource other than by using known-item search strategies (which sort of misses the point). Or maybe it’s right that a cohort of students should all get different results when they run ostensibly the same search?

Hmmm.. thinks: if personalised/localised search could be reduced to raw search phrase (whatever I put in the search box) plus a set of invisible search limits that reflect the personalisation/localisation tweaks applied to my search, how might my hidden/invisible search limits compare with yours?

It’s not just that Google uses tax efficient corporate structures to minimise its tax bill, because lots of companies do that…

It’s not just any one of these things, taken on its own merits… it’s all of them taken together…

“Embrace, extend, extinguish”… where have we heard that before?

Drip; drip; drip…

PS see also M. Wunsch on The Great Google Goat Rodeo

PPS Although not an open standard, I forgot this one – Google dropped support for the closed Microsoft ActiveSync protocol (see also Google Sync End of Life)


Categories: Faculty

Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language

Wed, 05/15/2013 - 6:00am

There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners, we can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.

To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).

IW Poll spreadsheet

The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.

To query the data, I’ve popped up a simple query form on Scraperwiki: Google Spreadsheet Explorer

Google spreadsheet explorer

To use the explorer, you need to:

  1. provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
  2. preview the table headings;
  3. construct a query using the column letters;
  4. select the output format;
  5. run the query.

So what sort of questions might we want to ask of the data? Let’s build some up.

We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F

SImple query

We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC

Results in order

Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').

Papers spoilt included

How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC

Postal Turnout

Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'

Good vote turnout

Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.

How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…

Worst 3 by turnout

The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A

Summing rows in a group

In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5

Top 5 by votes cast

How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5

TOp 5 as percentage on roll

Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC

VOtes by party

How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC

Average votes per candidate

To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.


Categories: Faculty

To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info

Wed, 05/08/2013 - 6:50am

In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)

From the Scraperwiki database, we need to interrogate the API:

scraperwiki api

To do this, we’ll use a database query language – SQL.

What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:

select * from support where support in (select candinit from support)

Note that “support” refers to two things here – these are columns:

select * from support where support in (select candinit from support)

and these are the table the columns are being pulled from:

select * from support where support in (select candinit from support)

Here’s the result of Runing the query:

sql select on scraperwiki

We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).

candidates mutual table

There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:

Tidier intra-candidate support map

Simples:-)


Categories: Faculty

More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?

Wed, 05/08/2013 - 5:05am

In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)

This is what we’ll be working towards:

Tidier intra-candidate support map

If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.

scraperwiki council elections - assentors

Here’s a reminder of what the original PDF doc looked like (archive copy):

IW poll notice assentors

Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…

When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).

We can download a CSV representation of the data from the scraper directly:

Scraperwiki CSV download

The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.

Gephi.org

To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.

csv rename

Start Gephi, select Data Laboratory tab and then New Project from the File menu.

geohi data lab new project

You should now see a toolbar that includes an “Import Spreadsheet option”:

gephi import spreadsheet

Import the CSV file as such, identifying it as an Edges Table:

import data into gephi data laboaratory

You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…

SOurce and Target recognised

You should now see the data has been loaded in to Gephi…

Data loaded in

If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.

mess...

Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:

force layout tweaks

If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)

Close up simple patterns

However, there are also other groupings in which one candidate may support another:

candidate support

These connections may allow us to see grouping of candidates supporting each other along party lines.

One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (oit degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:

lots going on

Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:

A few more tweaks

That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:

improve the layout

We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:

colour the groups

Let’s reset the colours and go back to colourings along party lines:

Gephi reset colours

If we go to the Preview view, we can generate a prettified view of the network:

Preview layout

In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:

this is odd

Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)

To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:

Filter on nodes out degree 2

Just that one then…

Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.

nested filters

Here’s the result:

Nesting filters in gephi

And again we can go into Preview mode to generate a nice vectorised version of the graph:

Tidier intra-candidate support map

This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).

So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.

If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.

PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)


Categories: Faculty

Ephemeral Citations – When Presentations You Have Cited Vanish from the Public Web

Tue, 05/07/2013 - 1:35pm

A couple of months ago, I came across an interesting slide deck reviewing some of the initiatives that Narrative Science have been involved with, including the generation of natural language interpretations of school education grade reports (I think: some natural language take on an individual’s academic scores, at least?). With MOOC fever in part focussing on the development of automated marking and feedback reports, this represents one example of how we might take numerical reports and dashboard displays and turn them into human readable text with some sort of narrative. (Narrative Science do a related thing for reports on schools themselves – How To Edit 52,000 Stories at Once.)

Whenever I come across a slide deck that I think may be in danger of being taken down (for example, because it’s buried down a downloads path on a corporate workshop promoter’s website and has CONFIDENTIAL written all over it) I try to grab a copy of it, but this presentation looked “safe” because it had been on Slideshare for some time.

Since I discovered the presentation, I’ve been recommending it to variou folk, particularly slides 20-22? that refer to the educational example. Trying to find the slidedeck today, a websearch failed to turn it up so I had to go sniffing around to see if I had mentioned a link to the original presentation anywhere. Here’s what I found:

no narrative science slideshow

The Wayback machine had grabbed bits and pieces of text, but not the actual slides…

wayback narrative science

Not only did I not download the presentation, I don’t seem to have grabbed any screenshots of the slides I was particularly interested in… bah:-(

For what it’s worth, here’s the commentary:

Introduction to Narrative Science — Presentation Transcript

We Transform Data IntoStories and Insight…In Seconds
Automatically,Without Human Intervention and at a Significant Scale
To Help Companies: Create New Products Improve Decision-MakingOptimize Customer Interactions
Customer Types Media and Data Business Publishing Companies Reporting
How Does It Work? The Data The Facts The Angles The Structure Stats Tests Calls The Narrative Language Completed Text Our technology platform, Quill™, is a powerful integration of artificial intelligence and data analytics that automatically transforms data into stories.
The following slides are examples of our work based upon a simple premise: structured data in, narrative out. These examples span several domains, including Sports Journalism, Financial Reporting, Real Estate, Business Intelligence, Education, and Marketing Services.
Sports Journalism: Big Ten Network – Data InTransforming Data into Stories
Sports Journalism: Big Ten Network – NarrativeTransforming Data into Stories
Financial Journalism: Forbes – Data InTransforming Data into Stories
Financial Journalism: Forbes – NarrativeTransforming Data into Stories
Short Sale Reporting: Data Explorers – JSON Input
Short Sale Reporting: Data Explorers – Overview North America Consumer Services Short Interest Update There has been a sharp decline in short interest in Marriott International (MAR) in the face of an 11% increase in the companys stock price. Short holdings have declined nearly 14% over the past month to 4.9% of shares outstanding. In the last month, holdings of institutional investors who lend have remained relatively unchanged at just below 17% of the companys shares. Investors have built up their short positions in Carnival (CCL) by 54.3% over the past month to 3.1% of shares outstanding. The share price has gained 8.3% over the past week to $31.93. Holdings of institutional investors who lend are also up slightly over the past month to just above 23% of the common shares in issue by the company. Institutional investors who make their shares available to borrow have reduced their holdings in Weight Watchers International (WTW) by more than 26% to just above 10% of total shares outstanding over the past month. Short sellers have also cut back their positions slightly to just under 6% of the market cap. The price of shares in the company has been on the rise for seven consecutive days and is now at $81.50.
Sector Reporting: Data Explorers – JSON Input
Sector Reporting: Data Explorers – OverviewThursday, October 6, 2011 12:00 PM: HEALTHCARE MIDDAY COMMENTARY:The Healthcare (XLV) sector underperformed the market in early trading on Thursday. Healthcarestocks trailed the market by 0.4%. So far, the Dow rose 0.2%, the NASDAQ saw growth of 0.8%, andthe S&P500 was up 0.4%.Here are a few Healthcare stocks that bucked the sectors downward trend.MRK (Merck & Co Inc.) erased early losses and rose 0.6% to $31.26. The company recentlyannounced its chairman is stepping down. MRK stock traded in the range of $31.21 – $31.56. MRKsvolume was 86.1% lower than usual with 2.5 million shares trading hands. Todays gains still leavethe stock about 11.1% lower than its price three months ago.LUX (Luxottica Group) struggled in early trading but showed resilience later in the day. Shares rose3.8% to $26.92. LUX traded in the range of $26.48 – $26.99. Luxottica Group’s early share volumewas 34,155. Todays gains still leave the stock 21.8% below its 52-week high of $34.43. The stockremains about 16.3% lower than its price three months ago.Shares of UHS (Universal Health Services Inc.) are trading at $32.89, up 81 cents (2.5%) from theprevious close of $32.08. UHS traded in the range of $32.06 – $33.01…
Real Estate: Hanley Wood – Data InTransforming Data into Stories
Real Estate: Hanley Wood – NarrativeTransforming Data into Stories
BI: Leading Fast Food Chain – Data InTransforming Data into Stories
BI: Leading Fast Food Chain – Store Level Report January Promotion Falling Behind Region The launch of the bagels and cream cheese promotion began this month. While your initial sales at the beginning of the promotion were on track with both your ad co-op and the region, your sales this week dropped from last week’s 142 units down to 128 units. Your morning guest count remained even across this period. Taking better advantage of this promotion should help to increase guest count and overall revenue by bringing in new customers. The new item with the greatest growth opportunity this week was the Coffee Cake Muffin. Increasing your sales by just one unit per thousand transactions to match Sales in the region would add another $156 to your monthly profit. That amounts to about $1872 over the course of one year.Transforming Data into Stories
Education: Standardized Testing – Data InTransforming Data into Stories
Education: Standardized Testing – Study RecommendationsTransforming Data into Stories
Marketing Services & Digital Media: Data InTransforming Data into Stories
Marketing Services & Digital Media: NarrativeTransforming Data into Stories

Bah…:-(

PS Slideshare appears to have a new(?) feature – Saved Files – that keeps a copy of files you have downloaded. Or does it? If I save a file and someone deletes it, will the empty shell only remain in my “Saved Files” list?


Categories: Faculty

Questioning Election Data to See if It Has a Story to Tell

Sun, 05/05/2013 - 7:38pm

I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…

So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?

Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):

IW council - notice of poll

Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:

scraperwiki candidates

So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.

In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)

Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:

scraperwiki CSV export

In OpenRefine, we can Create a New Project and then import the data directly:

openrefine import from URL

The data is in comma separated CSV format, so let’s specify that:

import as csv comma separated

We can then name and create the project and we’re ready to start…

…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.

Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.

mapit postcode lookup

And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:

mapit json

Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT

Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT

The %20 is a character encoding for a space. In this case, we can also use a +.

So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)

Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.

To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.

What we want to do then is to create a new, derived column based on the address:

Add derived column

And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:

grab a postcode

Here’s the result…

postcodes...

Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:

col from URL

When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:

get data from URL

The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.

When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…

lots of data

We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:

that's handy...

We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:

filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )[0]['name']

to create a new column containing the electoral division:

parse out the electroal division

Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:

inward or out

If we collapse down the spare columns, we get a clearer picture:

collapse...

Like this:

summary data

If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):

facet on inout

We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…

inout facet filter

Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)

Hmm.. labour out

There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson's Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?

One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:

'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')

os postcode lookup

The data, as imported from the Ordnance Survey, looks something like this:

sdata

As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.

Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.

The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:

easting northing lat -long

Here’s what the returned lat/long data looks like:

lat-long json

We can then create a new column derived from this JSON data by parsing it as follows
parse latlong to lat

A similar trick can be used to generate a column containing just the longitude data.

We can then export a view over the data to a CSV file, or direct to Google Fusion tables.

postcode lat long export

With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222

Fusion table edit column

Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:

fusion table config cols to location type

We can inspect the location data using a more convenient “natural” view over it…

fusion table add map

By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:

havenstreet candidates

(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)

We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:

refine filter by division

Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:

postcode missing...

Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…

So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.


Categories: Faculty

A Wrangling Example With OpenRefine: Making “Oven Ready Data”

Fri, 05/03/2013 - 7:33pm

As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:

onthewight results

There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:

Slightly tidier

Or how about something like this, that shows the size of the electorate for each ward:

turnout

So how can we get from the OnTheWight results into a ready data format?

Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-v); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:

Paste data into OpenRefine

* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:

_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714

_Votes won by figures - filled in
Lots of the 'Votes won by figures' had the wrong number in them. It's one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn't possible.

'Postal votes (inc)' line inserted between 'Total votes cast' and 'Papers spoilt'

Deleted an empty row from Ventnor West

The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.

import data as TSV no blanks

Here’s what we end up with:

ELection data raw import

The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:

COpy a column

Let’s define the new column as having exactly the same value as the original column:

Create new col as copy of old

Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)

The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:

Facet the numbers

Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.

filter and blank

Then remove the numeric facet filter:

filter update

Here’s the result, much tidier:

Much tidier

Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:

Selectt OpenRefine filter

Simply state the value you want to filter on and blitz the matching rows…

CHoose rows then blitz them

…then remove the filter:

then remove the filter

We can now fill down a the blanks in the Electoral Division column:

Fill down on Electoral Division

Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.

Filled down - now flag unwanted row

Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:

facet on flagged row

Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.

Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:

Remove header rows

And then this time just retain the non-numeric rows.

Electoral ward properties

Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:

Rename column

Like so:

columns renamed

Now let’s just make those comma mangled numbers into numbers, by transforming them:

transform the cells by removeing commas

The transform we’re going to use is to replace the comma by nothing:

replace comma

Then convert the values to a number type.

then convert to number

We can the do the same thing for the Number on Roll column:

reuse is good

We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.

Final stages of electroal division data

There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.

a little more tidying to do

Having done that, we can take the opportunity to export the data.

openrefine exporter

Using the custom tabular exporter, we can select the columns we wish to export.

Export column select

Then we can export the data to the desktop as a file in a variety of formats:

OPenrefine export download

Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:

OPenRefine upload to goole

Here’s the data I uploaded.

If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)

change filter settings

When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:

reorder columns

Here’s what I uploaded to a Google spreadsheet:

Spreadsheet

[OpenRefine project file]

So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…

Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.


Categories: Faculty

UK Local Elections 2013 – Live Data, Live Results Maps…

Thu, 05/02/2013 - 10:47am

I wonder whether this will be the last round of elections without a national live data feed from somewhere pushing out the results in a standardised form? So far, here are the local “live election data” initiatives I’ve spotted/had pointed out to me:

Lincolnshire

The Lincolnite – Lincolnshire Local Elections 2013, described here: The Lincolnite to cover elections live with interactive map. (Ex-?) University of Lincoln developer Alex Bilbie (@alexbilbie), who built the Lincolnshire map app, describes a little of the process behind it here Developing an interactive county council election map (part one).

Lincolnshore live election map

Warwickshire

2013 Elections: Warwickshire area, described here: 2013 Elections – In Real Time.

warwickshire live election map

The Warwickshire team are also making shapefiles/KML files (for the plotting of boundary line maps) and live results data (via Google Fusion Tables) too, as well as making data available about previously elected candidates: 2013 Elections – In Real Time

Here’s the map after the fact… I like the clear statement of seats by part in the bottom left corner too…

Warwickshire

Surrey

Surrey has the Surrey elections dashboard (via @BenUnsworth) that will switch to a live map as the results come in, but currently offers a search box that accepts a postcode and then tells you who your candidates are and where you can vote:

Surrey elections dashboard

Kent
This looks pretty, from Kent County Council:

Kent live election results

I was looking forward to seeing how this view played out once the results started to come in, but, erm, oops?!

oops - Kent

Managing the bursty load on an election results service server is probably something worth building into the planning…

Bristol
Bristol opted for a tabbed display showing a range of views over their council elecitons results. A simple coloured symbol maps shows the distribution of seats by ward across the parties:

Bristol election map

Bristol also provided a view over the turnout:
Bristol turnout

(Would it be useful to also be able to see this as percentage turnouts? Or to depict the proportional turnout on a map to see if any geographical reasons jump out as a source of possible differences?)

Cumbria
Cumbria County Council show how to make use of boundary files to mark or <choropleth election maps relate the party affiliation of the candidate taking each particular seat by electoral division area:

Cumbria election map

Cumbria also provided a view of seat allocations in districts; I don’t understand the scale they used to on the x-axes though? It differs from district to district. For example, it looks to me as if more seats went to Conservatives in Eden than in Carlise? Or is the scale related to the percentage of seats in the district? I’d class as “infographic-standard”, i.e. meaningless as a visualisation;-)
Cumbria district summary  - bad scales?

Norfolk

Norfolk’s election map looks quite, erm, “child-friendly” (chunky?! kids TV?) to me?

Norfolk election map

Norfolk also produced a graphic showing how seats might be distributed in the chamber:

Norfolk seats

I think one of the major issues with this sort of graphic is how you communicate the possible structurings of the chamber based on what sort of affiliations and groupings play out?

Wales Online

Wales online have a nice clean feel to their results map for Anglesey, but what’s going on with the legend? They don’t make it easy to get the branding into the screengrab either?!

Wales online report - Anglesey

National Reporting

THe Telegraph produced a map showing results of the elections at national scale based on control of councils by party:

Telegraph

And ever helpful, the Guardian datablog made the data available (will they do data broken down at seat level too, I wonder?) Here’s the map they produced:

guardian elecotion map

Hmmm… do I recognise that sort of layout? Ah, I know, it reminds me of this example of Data Journalists Engaging in Co-Innovation… around boundary changes.

Other…
For lists of current councillors, see OpenlyLocal, which has data available via a JSON API relating to current councillors and their affiliations. (It would be good if a frozen snapshot of this could be grabbed today, for comparison with the results following today’s election?)

This may also be of interest… UK Data Service British Local Election Database, 1889-2003 and Andrew Teale’s Local Elections Archive Project.

Data relating to general elections can be found on the Electoral Commission website: General Elections. TheyWorkForYou provide an API over current MPs by constituency, and MySOciety also produce the MapIt service for accessing constituency and electoral division boundary line data files.

If you’re interested in doing something data related around the election, or would like to learn how to do something with the data generated by the election, check out this informal resource co-ordination document. If you’re interested in checking out your local council website to see whether they publish any #opendata that would help you generate you own live maps, dashboards or consoles, the School of Data post “wot I wrote” on Proving the Data – A Quick Guide to Mapping England and Wales Local Elections may provide you with a quick start guide to making use of some of it…

If you know of any other councils or local presses publishing election related data warez, maps, live data feeds, etc, please post a link and brief description in the comments, and I’ll try to keep this post up to date…


Categories: Faculty

Boundary Files for Electoral Wards Covered by a Particular Geography

Thu, 05/02/2013 - 5:02am

A week or so ago, I went looking for boundary lines for UK electoral wards, with half a mind towards trying to do something with them related to this week’s local council elections. One source I came across was the UK 2011 Census, (2011 Census geography products for England and Wales) which publishes the appropriate data from across England and Wales in large single shapefiles. For the amateur cartographer (which I very much am!) wanting to work at a local level, this presents something of a challenge: not only do I have to find a way of downloading and opening the large dataset, I also need to find a way of extracting from it the data for my local area, which is what I actually want. (Getting the local data from the national dataset can be blocking, in other words.)

Another source of data is MySociety’s MapIt service, which provides data about various geographies, including electoral wards, covered by other geographical areas:

mapit areas covered

as well as boundary files for each geography in a variety of formats:

Mapit Geometry

Unfortunately, the MapIt API doesn’t (yet?!) support the ability to generate a single file that contains boundary data for all the electoral wards in a single council or local authority area. So here’s a quick hack, posted as a view on Scraperwiki, that generates a single KML file for the electoral wards contained by a council area (including the wider boundary of that council area) – KML Merge Test. (There are probably better ways of doing this?! I would if I should try to make sense of the MapIt code to see if I can work out how to submit a proper patch…)

import scraperwiki,simplejson,urllib2 from lxml import etree from copy import deepcopy #--via @mhawksey # query string crib https://views.scraperwiki.com/run/python_querystring_cheat_sheet/? import cgi, os qstring=os.getenv("QUERY_STRING") key='65791' #Use the Isle of Wight as a default typ='UTE' #typs are: #CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency) if qstring!=None: get = dict(cgi.parse_qsl(qstring)) if 'key' in get: key=get['key'] if 'typ' in get: typ=get['typ'] #--- #Get a stub KML file for the local council level url='http://mapit.mysociety.org/area/'+str(key)+'.kml' xmlraw = urllib2.urlopen(url).read() xml=etree.fromstring(xmlraw) #Get the list of electoral wards covered by that council area wards=simplejson.load(urllib2.urlopen('http://mapit.mysociety.org/area/'+str(key)+'/covers?type='+typ)) #Get the KML for each ward, extract the Placemark data, and add it to our comprehensive KML tree for ward in wards: url='http://mapit.mysociety.org/area/'+ward+'.kml' xmlraw = scraperwiki.scrape(url) xml2=etree.fromstring(xmlraw) p= xml2.xpath('//geo:Placemark',namespaces={'geo':'http://www.opengis.net/kml/2.2'}) xml.append( deepcopy(p[0] )) scraperwiki.utils.httpresponseheader("Content-Type", "text/xml") print etree.tostring(xml)

The key value is the ID of the council area within which you want to find the electoral wards. So for example the Isle of Wight parliamentary constituency page -http://mapit.mysociety.org/area/65791.html – gives us the ID 65791, which we pass as an argument to the Scraperwiki view. The subdivision we want to grab data for is given by the typ parameter:

CTY (county council), CED (county ward), COI (Isles of Scilly), COP (Isles of Scilly parish), CPC (civil parish/community), CPW (civil parish/community ward), DIS (district council), DIW (district ward), EUR (Euro region), GLA (London Assembly), LAC (London Assembly constituency), LBO (London borough), LBW (London ward), LGD (NI council), LGE (NI electoral area), LGW (NI ward), MTD (Metropolitan district), MTW (Metropolitan ward), NIE (NI Assembly constituency), OLF (Lower Layer Super Output Area, Full), OLG (Lower Layer Super Output Area, Generalised), OMF (Middle Layer Super Output Area, Full), OMG (Middle Layer Super Output Area, Generalised), SPC (Scottish Parliament constituency), SPE (Scottish Parliament region), UTA (Unitary authority), UTE (Unitary authority electoral division), UTW (Unitary authority ward), WAC (Welsh Assembly constituency), WAE (Welsh Assembly region), WMC (UK Parliamentary constituency)

So for example, here’s a link to an aggregate KML file for Unitary authority electoral divisions (UTE) on the Isle of Wight – https://views.scraperwiki.com/run/kml_merge_test/?key=65791; and here’s one for Unitary authority wards (UTW) in Milton Keynes: https://views.scraperwiki.com/run/kml_merge_test/?key=2588&typ=UTW

If you save the resulting file as a .kml file (for example, as kml_merge_test_mk.kml) you can then load it into something like Google Fusion tables to view it:

GOogle fusion table map

Note that the MapIt API is rate limited (I think), so be gentle ;-)


Categories: Faculty

Simple Map Making With Google Fusion Tables

Wed, 05/01/2013 - 5:03pm

A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…

Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:

data in a fusion table

Now let’s map it:-)

First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…

tweak the column

Then we say we want the column to be recognised as a column type:

change name make location

Computer says yes, highlighting the location type cells with a yellow background:

fusion table.. yellow...

As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)

map tab...

Once the points have been geocoded, they’re placed onto a map:

mapped

We can now publish the map in preparation for sharing it with the world…

publish map

We need to change the visibility of the map to something folk can see!

privacy and link

Public on the web, or just via a shared link – your choice:

make seeable

Here’s my map:-)

The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.


Categories: Faculty

A Simple OpenRefine Example – Tidying Cut’n'Paste Data from a Web Page

Wed, 05/01/2013 - 4:23pm

Here’s a quick walkthrough of how to use OpenRefine to prepare a simple data file. The original data can be found on a web page that looks like this (h/t/ The Gravesend Reporter):

polling station list

Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.

Here’s how I want the data to look after I’ve cleaned it:

data in a fusion table

Can you see what needs to be done? Somehow, we need to:

- remove the blank lines;
- generate a second column containing the name of the ward each address applies to;
- remove the colon from the ward name;
- remove the rows that contained the original ward names.

If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:

polling stations

We can also paste the data into a new OpenRefine Project:

paste data into OpenRefine

We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:

OpenRefine parse line data

But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?

OpenRefine data in - more cleaning required

Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.

cleaning data part 1

Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.

open refine leave the data the same

If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.

delete the filter

Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.

fill down

If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.

reset filter, star rows, then remove them...

Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…

we're going to tidy

…by replacing the colon with nothing (an empty string).

tidy the column

Here’s the data – neat and tidy:-)

Neat and tidy...

To finish, let’s export the data.

prepare to export

How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).

upload to fusion table

And here it is:-)

data in a fusion table

So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)


Categories: Faculty

A Quick Peek at Some Charities Data…

Wed, 05/01/2013 - 7:06am

As part of the Maximising Your Data Impact launch event of the Civil Society Data Network (reviewed in part by the Guardian here: Open data and the charity sector: a perfect fit), a series of mini-data expeditions provided participants with the opportunity to explore a range of data related questions relevant to the third sector. This was my first data exploration, and rather than deep dive into the data (which I probably should have done!) I was rather more interested in getting a feel for how potential data users actually discussed data related questions.

The topic my group was exploring was how to decide whether it made more sense to make a donation to a small or large charity. Unpacking this question a little forces a consideration of what the donor perceives to be a good use of their donation. To try to make the question more tractable, we also focussed on a particular scenario: how might a donor wishing to contribute to a charity related in some way to the hospice movement draw comparisons between them in order to inform their decision.

A scan of the Charity Commission website reveals that information is available for many charities relating to the size of the organisation as given by the number of staff or number of volunteers, as well as a certain amount of financial reporting:

charity commission comparison

Unfortunately, there doesn’t seem to be a way of comparing charities across a sector, nor does the data appear to be available as such. However, OpenCharities does make the data available in a rather more convenient form (that is, in a machine readable form) at web addresses/URLs of the form http://opencharities.org/charities/OPENCHARITIESID.json.

opencharities json

OpenCharities also publish a CSV file (a simple text based tabular data file) that contains crude summary information about UK registered charities, including such things as the charity name, its OpenCharities ID number, a summary description of each charity’s major activities, its registered address, and its social media profile information. If we grab this data, and pull from it the charities that are of interest to us, we can then use the OpenCharities IDs to create URLs from which we can pull the more detailed data.

Grabbing the CSV data file as charities.csv, we can filter out the rows containing items relating to hospices. From a Linux/Mac terminal command line, we can use the grep tool to grab rows that mention “hospice” somewhere in the line, and then create a new file using those rows appended to the column header row from the original file, putting the result into the file hospices.csv:

grep hospice charities.csv > tmp.csv
head -1 charities.csv | cat tmp.csv > hospice_charities.csv

and then upload the resulting filtered file to a Google spreadsheet.

Alternatively, we could load the whole OpenCharities CSV file into a tool such as OpenRefine and then filter the rows using the text filter on an appropriate column to select just rows mentioning “hospice” within a given column. Using the Custom Tabular Export, we could then upload the data directly to Google Spreadsheet.

Having got a list of hospice related charities into a Google Spreadsheet, we can Share the spreadsheet and also publish it (form the File menu). Publishing Google Spreadsheets makes the data available in a variety of formats, such as CSV data, via a web location/URL.

Our recipe so far is as follows:

- get CSV file of charities on OpenCharities;
- filter the file to get charities associated with hospices;
- upload the filtered file to Google Spreadsheets;
- publish the spreadsheet so we have it available as CSV at known URL.

The next step is to grab the more detailed data down from OpenCharities using the OpenCharities ID to construct the web addresses for where we can find that data for each hospice. We could use a tool such as OpenRefine to do this, but instead I’m going to write a short programme on Scraperwiki (OpenCharities Scraper) using the Python programming language to do the task.

#We need to load in some programme libraries to make life easier import scraperwiki,csv,urllib2,json #Here's where we load in the spreadsheet data published on Google Spreadsheets as a CSV file charities = csv.DictReader((urllib2.urlopen('https://docs.google.com/spreadsheet/pub?key=0AirrQecc6H_vdFVlV0pyd3RVTktuR0xmTTlKY1gwZ3c&single=true&gid=1&output=csv'))) #This function will grab data about a charity from OpenCharities given its OpenCharities ID def opencharitiesLookup(id): url = 'http://opencharities.org/charities/'+id+'.json' jsondata = json.load(urllib2.urlopen(url)) return jsondata #This routine cycles through each row of data/charity pulled from the spreadsheet for charity in charities: #print charity #For each charity, I'm going to pull out several data fields that we'll then save to a database of our own #We start with some info from the spreadsheet - charity ID, name and activity summary data={'cid':charity['charity_number']} for tmp in ['title','activities']: data[tmp]=charity[tmp] #Here's where we pull in the more detailed data for each charity jdata = opencharitiesLookup(charity['charity_number']) #Then we start to go hunting for the data... chdata = jdata['charity'] fdata = chdata['financial_breakdown'] #The data will include employee and volunteer numbers.. for tmp in ['volunteers','employees']: data[tmp] = chdata[tmp] #...as well as financial information for tmp in ['assets','spending','income']: if fdata != None and tmp in fdata: for tmp2 in fdata[tmp]: data[tmp+'_'+tmp2] = fdata[tmp][tmp2] #print data #Here's where we save all the data for a charity to a database scraperwiki.sqlite.save(unique_keys=['cid'], table_name='hospices', data=data)

When we run the scraper, it looks up the hospice data in the Google Spreadsheet, gets the richer data from OpenCharities, and pops it into a local database on Scraperwiki from where we can download the data:

scraperwiki opencharities

We can now download this enriched data as a CSV file and then, from Google Drive:

open a fusion table from drive

upload it to Google Fusion tables.

The recipe has now been extended as follows:

- pull the list of hospices into Scraperwiki from the Google Spreadsheet
- for each hospice, create an OpenCharities URL that points to the detailed data for that charity
- for each hospice, grab the corresponding data from that OpenCharities URL
- for each hospice, pull out the data elements we want and pop it into a Scraperwiki database
- download the database as a CSV file that now contains detailed data for each charity
- upload the detailed data CSV file to Google Fusion Tables

With the data in Google Fusion Tables, we can now start to analyse it and see what stories it might have to tell:

import csv to fusion table

Here it is…

data in fusion tables

We can then start to build charts around the data:

fusion tables add chart

Using the chart builder, you can specify which quantities to plot against each axis:

building up a fusion chart

(?I don’t know how to associate the title column (name of each charity) with points on the chart, so that when we hover over a point we can see which charity it relates to?)

We can also define out own calculations, much as you would in a spreadsheet.

fusion table add formula

For example, we might be interesting in knowing the ratio of income received from voluntary contributions versus the amount spent soliciting voluntary contributions. If this ratio is greater than 1, more money is spent soliciting these contributions than is received as a result. If the ratio is small (close to zero) then either the money spent soliciting voluntary contributions is effective, or voluntary income is being generated by other other means…

fusion table calc formula

We can then use one or more filters to explore which hospices meet different criteria. For example, how about organisations that seem to get a good voluntary income return versus spend on generating voluntary income, with the proviso that the voluntary income is less than £250,000:

build up filter

Along similar lines, we might want to calculate the ratio of volunteers to employees, and then then view the data for organisations with a certain number of employees (10 to 50, for example), or a certain number of volunteers (less than 500, say).

This is now the point at which story making – story generation, and data story telling – can start to kick in, based on the questions we want to ask of the data, or the stories we want to try to get it to tell us…


Categories: Faculty

Viz to Nothing… Forum Structure

Thu, 04/25/2013 - 8:28am

As a “shot to nothing”, it being that snooker time of year. A minimal netwrok visualisation of some of the #octel forum activity, which has a structure of: forum(-forum)-topic-reply. Replies can be threaded in a topic, in which case the replies have the same parent but also carry an incremental “menu_order” attribute that gives the accession order of the reply within that thread (I think?).

A post_type attribute identifies whether an entry is a forum, topic or reply.

I grabbed three columns of data defining an edge list – parentID, postID, post_type (forum, topic or reply), and threw the data into Gephi. Different layout types bring out different structural elements, but I just wanted something quick and dirty to look at the data in “unthreaded” form to see how we might make use of the thread/menu_order data.

Here’s an example – the whole structure (not very compellingly laid out, it has to be said):

forums, topics, posts

We can also filter by edge type to get a hint of the actual structure, for example, the forum structure (i.e. how forums and subforums (child forums) relate to each other):

forum filter

Or the topic structure (how topics relate to forums):

topic filter

Or the reply structure (how replies fall into topics):

reply filter

Note that this is the one we need to work on, in respect of finessing the data to allow us to see the hierarchical nature of the thread structure within a topic.

Anyway, as I said, a shot to nothing, and something to ponder representation wise as I walk the dog…;-)


Categories: Faculty

Fragmentary Observations from the Outside About How FutureLearn’s Developing

Thu, 04/25/2013 - 5:46am

I’m outside the loop on all matters FutureLearn related, so I’m interested to see what I can pick up from fragments that do make it onto the web.

So for example, from a presentation by Hugh Davis to the M25 Libraries conference April 2013 about Southampton’s involvement with FutureLearn, Collaboration, MOOCs and Futurelearn, we can learn a little bit about the FutureLearn pitch to partners:

FutureLEarn Overview

More interesting, I think, is this description of what some of the FutureLearn MOOCs might look like:

MOOC Structure

“miniMOOCs” containing 2 to 3 learning units, each 2-6 hours of study time, broken into 2-3 self-contained learning blocks (which suggests 1-2 hours per block).

So I wonder, based on the learning block sequence diagram, and the following learning design elements slide:

learning design

Will the platform be encouraging a learning design approach, with typed sequences of blocks that offer templated guides as to how to structure that sort of design element? Or is that way off the mark. (Given the platform is currently being built, (using Go Free Range for at least some of the development, I believe), it’s tricky to see how this is being played out, given courses and platform both need to ready at the same time, and it’s hard to write courses using platform primitives if the platform isn’t ready yet?)

Looking elsewhere (or at least, via @patlockley), we may be able to get a few more clues about the line partners are taking towards FutureLearn course development:

futurelearn job ad - LEeds

Hmm, I wonder – would it be worth subscribing to jobs feeds from the partner universities over the next few months to see whether any other FutureLearn related posts are being opened up? And does this also provide an opportunity for the currently rather sparse FutureLearn website to start promoting those jobs ads? And come to that, how come the jobs that have been appointed at FutureLearn weren’t advertised on the FutureLearn website…?

Because jobs have been appointed, as LinkedIn suggests… Here’s who’s declaring an association with the company at the moment:

futurelearn on linkedIN

We can also do a slightly broader search:

futurelearn search

There’s also a recently closed job ad with a role that doesn’t yet appear on anyone’s byline:

global digital marketing sstrategist

So what roles have been filled according to this source?

  • CEO
  • Head of Content
  • Head of UK Education & HE Partnerships
  • CTO
  • Senior Project Manager / Scrum Master (Contract)
  • Agile Digital Project Manager
  • Product manager
  • Marketing and Communications Assistant
  • Interim HR Consultant
  • Learning Technologist
  • Commercial and Operations Director for Launch
  • Global Digital Marketing Strategist

Here’s another one, Academic Lead [src].

By the by, I also notice that the OU VC, Martin Bean, has just been appointed as a director of FutureLearn Ltd.

Exciting times, eh…?!;-)

Related: OU Launches FutureLearn Ltd

PS v loosely related (?!) – (Draft) Coursera data export policy

PPS I also noticed this the other day – OpenupEd (press release) an EADTU co-ordinated portal that looks like a clearing house for OER powered MOOCs from universities across the EU (particularly open universities, including, I think, The OU…;-)


Categories: Faculty

When Open, Cross-Platform Tools Aren’t

Wed, 04/24/2013 - 5:43am

One of the myths I tell myself is that by trying to only use tools that are free to download, based on open source code and work cross-platform in the sense of working on both Windows and Macs (I tend to assume if it runs on a Mac, it’ll also run on Linux), I’m choosing tools to use tools that anyone will be able to use.

This casual thinking is compounded both by working in the education sector, and by the “personal research: label I apply to much of what I do. That is, I don’t work for a private company to use these tools to produce something that will be sold for profit.

So when I read Martin Belam’s review of the FT’s Emily Cadnam’s thoughts on data journalism in a news:rewired panel last week, I was bumped into rethinking things:

Along the way she made a little point with a big implication, certainly one that doesn’t always occur to me. Because the FT is a paid for product, they are actually restricted from using a lot of open source and free tools from the web, which have licenses that forbid commercial use. As an unintended consequence of having a website with a subscription model, they’ve had to make their own versions of several fundamental tools that others might take for granted.

Hmmm… so which of the tools and techniques that I try to advocate are actually closed for journalistic use in commercial contexts?

PS Thinks: when I download and use a software tool, accepting terms and license of use conditions as I do so, under what legal framework does the tool publisher claim ownership of the software application and the right to issue and enforce the license conditions or terms of use? Copyright applies to the source code, and I guess the compiled code? So to the extent that a musician may own copyeight over an expression (recording) of a song, maybe a software publisher owns copyright in the executable version of a piece of software, and running that application is like listening to a piece of recorded music. And just as you are limited in the extent to which you can play recorded music (not to a large audience, for example?) presumably similar conditions can be made to apply to the use to which you put a piece of software? (See for example, The Closed Route to Open Data.)

If the route is by copyright, then is there any notion of a fair dealing exception in the use of tools for the purpose of reporting the news?! (A ludicrous thought, but I thought I’d capture it anyway…;-)

PPS Hmmm, I wonder.. I can imagine patent trolls thinking this through…:

  • teach a man to fish, get a royalty payment each time he catches one;
  • sell a man a fishing rod, get a royalty payment each time he catches a fish;
  • sell a man a fishing rod, get a royalty payment each time he uses it;
  • teach a man the idea of fishing, get a royalty payment every time he catches a fish;
  • teach a man the idea of fishing, get a royalty payment every time he catches anything (use of idea in derived work..);

There must be an edu-startup killer trollable patent in there somewhere?!;-)


Categories: Faculty

Mapping Corporate Networks With OpenCorporates

Tue, 04/23/2013 - 5:17am

I was due to be at #odw13 today, but circumstances beyond my control intruded…

The presentation I was down to give related to some of the things we could do with company data from OpenCorporates. Here’s a related thing that covers some of what I was intending to talk about…

(I’m experimenting with a new way of putting together presentations by actually writing notes for each slide. Please let me know via the comments whether you think this approach makes my slidedecks any easier to understand!)


Categories: Faculty

A Few More Thoughts on the Forensic Analysis of Twitter Friend and Follower Timelines in a MOOCalytics Context

Mon, 04/22/2013 - 10:35am

Immediately after posting Evaluating Event Impact Through Social Media Follower Histories, With Possible Relevance to cMOOC Learning Analytics, I took the dog out for a walk to ponder the practicalities of constructing follower (or friend) acquisition charts for accounts with only a low number of followers, or friends, as might be the case for folk taking a MOOC or who have attended a particular event. One aim I had in mind was to probe the extent to which a MOOC may help developing social ties between folk taking a MOOC, whether MOOC participants know each other prior taking the MOOC, or whether they come to develop social links after taking the MOOC. Another aim was simply to see whether we could identify from changes in velocity or makeup of follower acquisition curves whether particular events led either to growth in follower numbers or community development between followers.

To recap on the approach used for constructing follower acquisition charts (as described in Estimated Follower Accession Charts for Twitter, and which also works (in principle!) for plotting when Twitter users started following folk):

  • you can’t start following someone on Twitter until you join Twitter;
  • follower lists on Twitter are reverse chronological statements of the order in which folk started following the corresponding account;
  • starting with the first follower of an account (the bottom end of the follower list), we can estimate when they started following the account from the most recent account creation date seen so far amongst people who started following before that user.

A methodological problem arises when we have a low number of followers, because we don’t necessarily have enough newly created (follower) accounts starting to follow a target account soon after the creation of the follower account to give us solid basis for estimating when folk started following the target account. (If someone creates a new account and then immediately uses it to follow a target account, we get a good sample in time relating to when that follower started following the target account…If you have lots of people following an account there’s more of a chance that some of them will be quick-after-creation to start following the target account.)

There may also be methodological problems with trying to run an analysis over a short period of time (too much noise/lack of temporal definition in the follower acquisition curve over a limited time range).

So with low follower numbers, where can we get our timestamps from?

In the context of a MOOC, let’s suppose that there is a central MOOC account with lots of followers, and those followers don’t have many friends or followers (certainly not enough for us to be able to generate smooth – and reliable – acquisition curves).

If the MOOC account has lots of followers, let’s suppose we can generate a reasonable follower acquisition curve from them.

This means that for each follower, fo_i, we can associate with them a time when they started following the MOOC account, fo_i_t. Let’s write that as fo(MOOC, fo_i)=fo_i_t, where fo(MOOC, fo_i) reads “the estimated time when MOOC is followed by fo_i”.

(I’m making this up as I’m going along…;)

If we look at the friends of fo_i (that is, the people they follow), we know that they started following the MOOC account at time fo_i_t. So let’s write that as fr(fo_i, MOOC)=fo_i_t, where fr(fo_i, MOOC) reads “the estimated time when fo_i friends MOOC”.

Since public friend/follower relationsships are symmetrical on Twitter (if A friends B, then B is at that instant followed by A), we can also write fr(fo_i, MOOC) = fo(MOOC, fo_i), which is to say that the time when fo_i friends MOOC is the same time as when MOOC is followed by fo_i.

Got that?!;-) (I’m still making this up as I’m going along…!)

We now have a sample in time for calibrating at least a single point in the friend acquisition chart for fo_i. If fo_i follows other “celebrity” accounts for which we can generate reasonably sound follower acquisition charts, we should be able to add other timestamp estimates into the friend acquisition timeline.

If fo_i follows three accounts A,B,C in that order, with fr(fo_i,A)=t1 and fr(fo_i,C)=t2, we know that fr(fo_i,B) lies somewhere between t1 and t2, where t1 < t2, let’s call that [t1,t2], reading it as [not earlier than t1, not later than t2]. Which is to say, fr(fo_i,B)=[t1,t2], or “fo_i makes friends with B not before t1 and not after t2″, or more simply “fo_i makes friends with B somewhen between t1 and t2″.

Let’s now look at fo_j, who has only a few followers, one of whom is fo_i. Suppose that fo_j is actually account B. We know that fo(fo_j,fo_i), and furthermore that fo(fo_j,fo_i)=fr(fo_i,fo_j). Since we know that fr(fo_i,B)=[t1,t2], and B=fo_j, we know that fr(fo_i,fo_j)=[t1,t2]. (Just swap the symbols in and out of the equations…) But what we now also have is a timestamp estimate into the followers list for fo_j, that is: fo(fo_j,fo_i)=[t1,t2].

If MOOC has lots of friends, as well as lots of followers, and MOOC has a policy of following back followers immediately, we can use it to generate timestamp probes into the friend timelines of its followers, via fo(MOOC,X)=fr(X,MOOC), and its friends, via fr(MOOC,Y)=fo(Y,MOOC). (We should be able to use other accounts with large friend or follower accounts and reasonably well defined acquisition curves to generate additional samples?)

We can possibly also start to play off the time intervals from friend and follower curves against each other to try and reduce the uncertainty within them (that is, the range of them).

For example, if we have fr(fo_i,B)=[t1,t2], and from fo(B,fo_i)=[t3,t4], if t3 > t1, we can tighten up fr(fo_i,B)=[t3,t2]. Similarly, if t2 < t4, we can tighten up fo(B,fo_i)=[t3,t2]. Which I think in general is:

if fr(A,B)=[t1,t2] and fo(B,A)=[t3,t4], we can tighten up to fr(A,B) = fo(B,A) = [ greater_of(t1,t3), lesser_of(t2,t4) ]

Erm, maybe? (I should probably read through that again to check the logic!) Things also get a little more complex when we only have time range estimates for most of the friends or followers, rather than good single point timestamp estimates for when they were friended or started to follow…;-) I’ll leave it as an exercise for the reader to figure hout how to write that down and solve it!;-)]

If this thought experiment does work out, then a several rules of thumb jump out if we want to maximise our chances of generating reasonably accurate friend and follower acquisition curves:

- set up your MOOC Twitter account close to the time you want to start using it so it’s creation date is as late as possible;
- encourage folk to follow the MOOC account, and follow back, to improve the chances of getting reasonable resolution in the follower acquisition curve for the MOOC account. These connections also provide time-estimated probes into follower acquisition curves of friends and friend acquisition curves of followers;
- consider creating new “fake” timestamp Twitter accounts than can immediately on creation follow and be friended by the MOOC account to place temporal markers into the acquisition curves;
- if followers follow other celebrity accounts (or are followed (back) by them), we should be able to generate timestamp samples by analysing the celebrity account acquisition curves.

I think I need to go and walk the dog again.

PS a couple more trivial fixed points: for a target account, the earliest time at which they were first followed or when they first friended another account is the creation date of the target account; the latest possible time they acquired their most recent friend or follower is the time at which the data was collected.


Categories: Faculty