Make your own Reddit Dataset
Antone Christianson-Galina
Co-Founder of Pathfinder Data Strategy. Vice Chair of the Barnes Housing Trust Fund.
I have had a lot of fun working with Reddit data. If you want to build your own Reddit data set, all you need is Python. In this post, I'll explain how you can use an API (a tool to retrieve data) and use it to pull data from Reddit.
Harvesting data from Reddit using python is straightforward with their well-documented API, called PRAW:[https://praw.readthedocs.io/en/stable/getting\_started/quick\_start.html ]. For documentation on how API calls work in general, you may consult the following: [https://tray.io/blog/how-do-apis-work ].
To call the Reddit API, you must first set up a connection. The most straightforward way is to register an application with Reddit via OAuth. I registered a web application and stated “Research” as the purpose.
As part of this registration, I was granted a client id and a client secret.?The following expression shows how it was implemented in python:
???reddit = praw.Reddit(client_id='xxxxx,
????????????????????????client_secret=xxxx
????????????????????????user_agent=xxxx
Then I needed to specify the thread to call for the API. Each reddit thread has a unique key that you can pull from the URL. That key becomes the logical entity I named “submission”
submission = reddit.submission(key from the url)
This uses the Praw API to pull data related to a thread.
This returns an unstructured forest of comments. At the top, is the first post in the thread, which can have multiple comments, which then can each have other comments. Therefore, it was necessary to?structure the data.
I sorted the comments to get the oldest first with the specification:
领英推荐
submission.comment_sort = 'old' _
Then, I created a for loop to iterate through each comment. For each comment, it harvests the time that the comment was created and appends it to an array.?
for comment in submission.comments.list():
???????time = comment.created
???????ltime.append(comment.created_utc)_
In the for loop, I implemented either tokenisation or stemming depending on the experiment.?
Tokenisation simply turns the sentence string into a list of words. In natural language processing, this is called the “bag of words” technique. This allows for easy processing but totally ignores sentence structure.
Stemming removes word endings. For example, “carefully” and “careful” both become the word “careful”. I used a python implementation of the Porter stemmer for my code, developed by Martin Porter in 1979 and still maintained by him. [https://tartarus.org/martin/PorterStemmer/ ]
stemmer = PorterStemmer()
for w in words(entry):
w= stemmer.stem(w).lower()
DevOps Manager | MSc-DS Candidate | LBMC, PC
2 年Bryan Terrill this is related to what we were speaking about at the managers meeting. Antone is a great resource for pulling datasets from Reddit!