Reddit Data
Motivation
Reddit is a treasure trove of data. It is a platform where users can submit content, comment on posts, and vote on submissions. It is a great source of data for a variety of reasons.
Reddit API
There are a lot of tutorials on how to use the Reddit API. I will not go into the details of how to use the API, but I will share some tips that I learned.
The general idea is use praw to authenticate and then use the API to get the data. You need to create an app to get the credentials (“client_id” and “secret”).
The critical problem I met was that Reddit has the rate limit and the date of data is not far from today. The data I got is only about 1k posts which is the latest at the time of my request. My longitudinal study needs more data and the data should be from the past.
Project Arctic Shift
Luckily, I found a project called Project Arctic Shift which aims to archive all the posts and comments on Reddit. This Github repository is self-explanatory so I only mention several points that I think are important.
- Do not unzip the .zst file since it takes too much storage. Use code to read it and extract the data you need.
- If you are extracting the data and save in csv format in Python, be careful with the “selftext” field. It contains the main text of the post/comment. If it is too long, the file will be corrupted which create break lines in the rows.