2. Collecting data¶
These are basic directions to collect and preprocess data so you can do something awesome with them. These directions aren’t comprehensive; you may need to consult other sources to fill in some gaps.
Note
You may not need to do this. If you have collaborators who are already collecting data, you should probably in fact not do this. Just use their data.
2.1. Wikipedia¶
Use the scripts wp-get-dumps
and wp-get-access-logs
. No authentication
is needed, but you may wish to communicate with Wikimedia and/or mirror admins
if you are planning a large download.
These scripts use rsync
, so setting the environment variable
RSYNC_PROXY
may be needed depending on your firewall.
2.2. Twitter¶
These instructions will help you collect and archive tweets as they appear in the Streaming API. QUAC currently cannot acquire past tweets.
2.2.1. Set up authentication¶
You need both a user account and an application, as well as four different authentication parameters, to access the streaming API using OAuth.
- Create an account on Twitter to use for collection. (I suggest you do not use your normal Twitter account, if you have one.)
- Create a Twitter application (https://dev.twitter.com/apps; sign in as the user above).
- Click Create my access token.
- The four authentication parameters are on the Details tab (you may need to
reload it after the above step).
- consumer key
- consumer secret
- access token
- access secret
2.2.2. Run the collector¶
Create directories to hold the collected tweets (e.g.,
tweets
) and your configuration and logs (e.g.,config
).In
config
, create a filesample.cfg
; look through the options indefault.cfg
and add tosample.cfg
the ones that need to be customized.Warning
Because this file will contain authentication secrets, ensure that it has appropriate permissions.
Run the collector for a while, e.g.:
$ collect --verbose --config /path/to/sample.cfg
(Type Control-C to stop.)
2.2.3. Build the TSV files¶
$ $QUACBASE/misc/parse.sh 1 /path/to/tweets
2.2.4. Doing it seriously¶
The above will get you a few tweets to play with. If you want to actually collect tweets in a serious and reliable way (i.e., without gaps):
Run
collect
with the--daemon
option, and set uplogcheck
to watch the log files and e-mail you if something goes wrong.Set up a cron job to build the TSVs regularly, e.g.:
27 3 * * * nice bash -l -c '$QUACBASE/misc/parse.sh 4 /path/to/tweets >> /path/to/logs/parse.log'