2. Collecting data

These are basic directions to collect and preprocess data so you can do something awesome with them. These directions aren’t comprehensive; you may need to consult other sources to fill in some gaps.

Note

You may not need to do this. If you have collaborators who are already collecting data, you should probably in fact not do this. Just use their data.

2.1. Wikipedia

Use the scripts wp-get-dumps and wp-get-access-logs. No authentication is needed, but you may wish to communicate with Wikimedia and/or mirror admins if you are planning a large download.

These scripts use rsync, so setting the environment variable RSYNC_PROXY may be needed depending on your firewall.

2.2. Twitter

These instructions will help you collect and archive tweets as they appear in the Streaming API. QUAC currently cannot acquire past tweets.

2.2.1. Set up authentication

You need both a user account and an application, as well as four different authentication parameters, to access the streaming API using OAuth.

  1. Create an account on Twitter to use for collection. (I suggest you do not use your normal Twitter account, if you have one.)
  2. Create a Twitter application (https://dev.twitter.com/apps; sign in as the user above).
  3. Click Create my access token.
  4. The four authentication parameters are on the Details tab (you may need to reload it after the above step).
    • consumer key
    • consumer secret
    • access token
    • access secret

2.2.2. Run the collector

  1. Create directories to hold the collected tweets (e.g., tweets) and your configuration and logs (e.g., config).

  2. In config, create a file sample.cfg; look through the options in default.cfg and add to sample.cfg the ones that need to be customized.

    Warning

    Because this file will contain authentication secrets, ensure that it has appropriate permissions.

  3. Run the collector for a while, e.g.:

    $ collect --verbose --config /path/to/sample.cfg
    

    (Type Control-C to stop.)

2.2.3. Build the TSV files

$ $QUACBASE/misc/parse.sh 1 /path/to/tweets

2.2.4. Doing it seriously

The above will get you a few tweets to play with. If you want to actually collect tweets in a serious and reliable way (i.e., without gaps):

  1. Run collect with the --daemon option, and set up logcheck to watch the log files and e-mail you if something goes wrong.

  2. Set up a cron job to build the TSVs regularly, e.g.:

    27 3 * * *  nice bash -l -c '$QUACBASE/misc/parse.sh 4 /path/to/tweets >> /path/to/logs/parse.log'