Data Collection

This page really needs one of those old-school "under construction" images (and, for this week end, an "under snow" one too).

A warning

The first thing to note is that Twitter does not guarantee to return all matching Tweets, whatever method is used (unless you have direct access to their "fire hose", which I don't). There is no public information, to my knowledge, about exactly how Twitter determines what tweets to include or exclude; the best I have found is the documentation on Twitter Search Best Practises. This filtering should be taken into account when looking at the analysis presented here. You can see the result of this when (for AAS221) I compared the values for most-popular retweets with those from Twitter; my values are up to 10 percent less for the number of retweets.

What is a sensible search?

A few years ago there was discussion amongst the attendees whether to use #AAS or #AAS<meeting number>, but fortunately the AAS have been promoting the use of the latter for the last few meetings. This makes data collection easier (assuming people read the tweets of the AAS Office!). As I also wanted to follow the participation in the AAS hack day and the new effort to track poster comments, I decided to track the following four terms: aas223, aas and 223, hackAAS, AASviz, and - as of Saturday, Jan 4 - astroamb2014. Note that case does not matter, that I am not actually searching on the hash tags, and having the middle term means that I would select a tweet saying "The 223rd meeting of the AAS was awesome!". An open question is how much irrelevant material was returned by this approach; anecdotal evidence suggests the rate is low but I should try to quantify this.

As an aside, the use of #AAS for the meetings, although having the advantage of saving 3 charaters, can lead to a lot of noise due to unrelated tweets. I did not try it this year but it has been the case in previous years.

For users following along to a meeting, I would suggest using the search #aas<meeting number> -RT to hide re-tweets, since they form a significant fraction of the volumne, but as I am interested in seeing what is re-tweeted, and by whom, I've left these in.

How did I search?

As with the AAS 221 meeting, my primary analysis was based on a search using the Twitter Streaming API. The code to do this is available at my astrosearch bitbucket account. The process requires running the astroserver, which acts as the database, and then astrosearch which deals with the Twitter search. Building it is likely to be non trivial since it is writen in Haskell - which most Astronomers will not have installed on their system - and several patched modules, which are listed in the README. In previous searches this approach was not robust; rather than fixing the issues I ended up treating the search service as a daemon which would be automatically restarted whenever it crashed.

Unlike previous meetings I did not run a search using the Twitter REST API with my grabtweets code since it was not needed.

I did not use the Archivist site since this service seems to have evolved since I last used it, and didn't offer something that met my needs.

When was the search run

The search started at 2014-01-03 14:05 UTC.


The data collection and analysis is written in Haskell, using version 7.6.3 of the ghc Haskell compiler, and uses a bunch of packages from the Haskell package database (hackage).

The visualizations presented on this web site use the d3.js Javascript library to create groovy data-driven documents. I have also used Gephi and BioFabric to visualize and explore the user network (i.e. the hair ball and matrix views).

Last, but not least, thank you to all the Astronomers who uses Twitter to discuss the meeting, and those that followed along.

Valid HTML 4.01 Strict!