Make your own Reddit Dataset

Make your own Reddit Dataset

I have had a lot of fun working with Reddit data. If you want to build your own Reddit data set, all you need is Python. In this post, I'll explain how you can use an API (a tool to retrieve data) and use it to pull data from Reddit.

Harvesting data from Reddit using python is straightforward with their well-documented API, called PRAW:[https://praw.readthedocs.io/en/stable/getting\_started/quick\_start.html ]. For documentation on how API calls work in general, you may consult the following: [https://tray.io/blog/how-do-apis-work ].

To call the Reddit API, you must first set up a connection. The most straightforward way is to register an application with Reddit via OAuth. I registered a web application and stated “Research” as the purpose.

As part of this registration, I was granted a client id and a client secret.?The following expression shows how it was implemented in python:

???reddit = praw.Reddit(client_id='xxxxx,

????????????????????????client_secret=xxxx

????????????????????????user_agent=xxxx

Then I needed to specify the thread to call for the API. Each reddit thread has a unique key that you can pull from the URL. That key becomes the logical entity I named “submission”

submission = reddit.submission(key from the url)

This uses the Praw API to pull data related to a thread.

This returns an unstructured forest of comments. At the top, is the first post in the thread, which can have multiple comments, which then can each have other comments. Therefore, it was necessary to?structure the data.

I sorted the comments to get the oldest first with the specification:

submission.comment_sort = 'old' _

Then, I created a for loop to iterate through each comment. For each comment, it harvests the time that the comment was created and appends it to an array.?

for comment in submission.comments.list():

???????time = comment.created

???????ltime.append(comment.created_utc)_

In the for loop, I implemented either tokenisation or stemming depending on the experiment.?

Tokenisation simply turns the sentence string into a list of words. In natural language processing, this is called the “bag of words” technique. This allows for easy processing but totally ignores sentence structure.

Stemming removes word endings. For example, “carefully” and “careful” both become the word “careful”. I used a python implementation of the Porter stemmer for my code, developed by Martin Porter in 1979 and still maintained by him. [https://tartarus.org/martin/PorterStemmer/ ]

stemmer = PorterStemmer()

for w in words(entry):

w= stemmer.stem(w).lower()

Olivia Albertson

DevOps Manager | MSc-DS Candidate | LBMC, PC

2 年

Bryan Terrill this is related to what we were speaking about at the managers meeting. Antone is a great resource for pulling datasets from Reddit!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了