Data Collection

The first thing to note is that Twitter does not guarantee to return all matching Tweets, whatever method is used (unless you have direct access to their "fire hose", which I don't). There is no public information, to my knowledge, about exactly how Twitter determines what tweets to include or exclude; the best I have found is the documentation on Twitter Search Best Practises. This filtering should be taken into account when looking at the analysis presented here.

The second important note is that the AAS declared, prior to the meeting, that the official hash tag was #aas219, which means that I did not try to track other tags (such as #aas). The AAS also provided a page of links, blogs, tweets, and other resources about the meeting, which you should go and look at.

I used three techniques to get the Tweets:

  1. I set up an archive using the Archivist on-line web site. This should result in the same results as my REST-based search (method 2), since the volume of Tweets was probably not enough to cause this site to miss Tweets based on their collection strategy.

    This site provides basic analysis of the Tweets, some of which I repeat, but you can not get access to the raw data due to the Twitter terms of service.

  2. A search for the tag #aas219 (case insensitive) using the Twitter REST API. The search was periodic (normally every two hours but less frequent outside office hours), and is the same approach used in my previous studies (AAS 218 and AAS 215).

    The code to do this is available at my grabtweets bitbucket account; the program is grabtweets and the version used was changeset c75dd4d1bbe4.

    One thing to note with the REST API is that you can suddenly get a group of tweets sent to you that you already have; this appears to be because Twitter is sending updated metadata for these tweets, but I haven't investigated the reasons why.

    An improvement over the code used during AAS 218 is that the REST API now supports the include_entities option (or I just wasn't aware if it in the summer of 2011) which adds in extra information to the data to indicate who is referenced in the tweet, the hash tags used, and URLs that are included. This simplifies some down-stream processing.

  3. I used the Twitter Streaming API to search for the term aas219 (again case insensitive), using the track method. I was interested to see if it returned different results (it does), and whether the extra information it provided over the REST API was useful (yes, but not essential).

    The code to do this is available at my astrosearch bitbucket account. The process requires running the astroserver, which acts as the database, and then astrosearch which deals with the Twitter search. The revision used was changeset a07b88b1b044, although you would also need your own Twitter OAuth credentials and build against my fork of the twitter-enumerator package.

    Unfortunately the program wasn't completely robust which meant that it crashed several times; one of these resulted in no collection over the Wednesday evening and Thursday morning of the meeting (as indicated by the red regions in the Twitter frequency graph).

    The Streaming API provides more information, per Tweet, than the REST API; the most interesting for this analysis is an indication of the number of followers of a user and whether a message was actually a Re-Tweet of another tweet, which enables the separation in the histograms seen in the frequency graph. The determination of whether a particular message is a re-tweet or not appears to be down to whether the client indicates this rather than some clever processing by Twitter; this means that the re-tweet values - e.g. as shown in the frequency graph - are a lower limit.

How different are the REST and Streaming API results?

I expected to get more matches with the Streaming API than the REST API because:

  1. The Twitter documentation points out that the streaming API does not apply as much filtering and ranking as the search API.

  2. The search terms aren't the same; the streaming API search looks for the text aas219 whereas the REST API search uses the hashtag #aas219 (both case insensitive).

As a rough guide, for the dataset I have amassed, I find 6453 Tweets in total, 6178 from the Streaming API and 4119 from the REST API. As the two searches didn't cover the same time ranges they can't be directly compared.

Merging the results

The two data sets were merged based on the id values given by Twitter to users and tweets. The data is stored in a Resource Description Framework (RDF) graph, which is then queried to extract data. The use of RDF as the data representation supports more flexibility than provided by a traditional relational database, but nothing presented here actually relies on this extra power.

Credits

Thanks to the people who provide the archivist store.

The data collection and analysis is mostly written in Haskell, using version 7.0.4 of the ghc Haskell compiler, and uses a bunch of packages from the Haskell package database (hackage).

The visualizations presented on this web site use the d3.js Javascript library to create groovy data-driven documents. I plan to do some more analysis, using Gephi, but I would not hold your breath for this.

Last, but not least, thank you to all the Astronomers who uses Twitter to discuss the meeting, and those that followed along.


Valid HTML 4.01 Strict!