Data Collection

This page really needs one of those old-school "under construction" images (and, for this week end, an "under snow" one too).

A warning

The first thing to note is that Twitter does not guarantee to return all matching Tweets, whatever method is used (unless you have direct access to their "fire hose", which I don't). There is no public information, to my knowledge, about exactly how Twitter determines what tweets to include or exclude; the best I have found is the documentation on Twitter Search Best Practises. This filtering should be taken into account when looking at the analysis presented here. You can see the result of this when comparing the values I get for most-popular retweets with those from Twitter; my values are up to 10 percent less for the number of retweets.

What is a sensible search?

A few years ago there was discussion amongst the attendees whether to use #AAS or #AAS<meeting number>, but fortunately the AAS have been promoting the use of the latter for the last few meetings. This makes data collection easier (assuming people read the tweets of the AAS Office!). As I also wanted to follow the participation in the first ever hack day at AAS, I decided to track the following three terms: aas221, aas and 221, and hackaas. Note that case does not matter, that I am not actually searching on the hash tags, and having the middle term means that I would select a tweet saying "The 221st meeting of the AAS was awesome!". An open question is how much irrelevant material was returned by this approach; anecdotal evidence suggests the rate is low but I should try to quantify this.

As an aside, the use of #AAS for the meetings, although having the advantage of saving 3 charaters, can lead to a lot of noise due to unrelated tweets. I did not try it this year but it has been the case in previous years.

For users following along to a meeting, I would suggest using the search #aas<meeting number> -RT to hide re-tweets, since they form a significant fraction of the volumne, but as I am interested in seeing what is re-tweeted, and by whom, I've left these in.

How did I search?

Unlike the AAS 219 meeting, my primary analysis was based on a search using the Twitter Streaming API. The code to do this is available at my astrosearch bitbucket account. The process requires running the astroserver, which acts as the database, and then astrosearch which deals with the Twitter search. Building it is likely to be non trivial since it is writen in Haskell - which most Astronomers will not have installed on their system - and several patched modules, which are listed in the README. In previous searches this approach was not robust; rather than fixing the issues I ended up treating the search service as a daemon which would be automatically restarted whenever it crashed. This only happened once during the run, much later than the main conference, and lead to a down time of less than 2 seconds.

Unlike previous meetings I did not run a search using the Twitter REST API with my grabtweets code since it was not needed. However, I did take advantage of the TAGSExplorer archive and visualization system which does use the REST API.

I did not use the Archivist site since this service seems to have evolved since I last used it, and didn't offer something that met my needs.

When was the search run

The search started at 2013-01-02 14:20:56.941684 UTC and ended around Fri Feb 1 20:34:01 EST 2013 (apologies for the mix in precision, time zone, and format). As explained below there are actually two tweets included in the dataset that were made before the search started; they have been left in since they are not going to sigificantly bias the results of any analysis I present.


Since I used the Twitter Streaming API, the search continually produced results, which were written to disk as JSON (in previous versions I had tried converting them to Haskell data structures but to simplify things I did no processing in the search program). At intervals I would process all the matches, creating a RDF graph of the results, which was passed to a 4store instance. This instance was then queried using SPARQL to produce the results, as described in the Analysis section.

I used the 1.1 POST statuses/filter API for the search, using the track parameter. When transforming the JSON from Twitter, two types were processed: tweets and retweets (although you can see other message types, I didn't). The retweets include full information on the original tweet; as well as letting me link the two tweets together in the RDF this let me find a few tweets which had not been matched by Twitter. Two of these are due to re-tweets of messages which were sent before the search was started.

Missing tweets and the AAS

Overall, 30 "missing" tweets were found, so it is not a huge number, but they do indicate a phenomemon observed at this (and I have seen this with AAS meetings), that the AAS twitter accounts do not seem to have achieved enough "twitter-cred" to be included in searches:

Below are the missing tweets, grouped by author, and excluding the two that were from before the main search started.

Note that all the posts from AAS_Office and AAS_Press were missed by the search (i.e. they were only found because they were retweeted). This means I will have missed any posts from these two accounts that were not retweeted.

AAS Executive Office
AAS Press Office

So, what should the AAS Twitter accounts do?

So, it looks like the AAS accounts need to improve there "Twitteriness", presumably by tweeting regularly outside the conference, including being involved in conversations (i.e. reply to and being replied to by other accounts), although this is a guess on my part. I wonder whether other scholarly societies see this (or have seen this)?


To write.


The data collection and analysis is written in Haskell, using version 7.4.2 of the ghc Haskell compiler, and uses a bunch of packages from the Haskell package database (hackage).

The visualizations presented on this web site use the d3.js Javascript library to create groovy data-driven documents. I have also used Gephi and BioFabric to visualize and explore the user network (i.e. the hair ball and matrix views).

Last, but not least, thank you to all the Astronomers who uses Twitter to discuss the meeting, and those that followed along.

Valid HTML 4.01 Strict!