登录查看更多内容

Make your own Reddit Dataset

Antone Christianson-Galina

Co-Founder of Pathfinder Data Strategy. Vice Chair of the Barnes Housing Trust Fund.

发布日期: 2022年2月18日

I have had a lot of fun working with Reddit data. If you want to build your own Reddit data set, all you need is Python. In this post, I'll explain how you can use an API (a tool to retrieve data) and use it to pull data from Reddit.

Harvesting data from Reddit using python is straightforward with their well-documented API, called PRAW:[https://praw.readthedocs.io/en/stable/getting\_started/quick\_start.html ]. For documentation on how API calls work in general, you may consult the following: [https://tray.io/blog/how-do-apis-work ].

To call the Reddit API, you must first set up a connection. The most straightforward way is to register an application with Reddit via OAuth. I registered a web application and stated “Research” as the purpose.

As part of this registration, I was granted a client id and a client secret.?The following expression shows how it was implemented in python:

???reddit = praw.Reddit(client_id='xxxxx,

????????????????????????client_secret=xxxx

????????????????????????user_agent=xxxx

Then I needed to specify the thread to call for the API. Each reddit thread has a unique key that you can pull from the URL. That key becomes the logical entity I named “submission”

submission = reddit.submission(key from the url)

This uses the Praw API to pull data related to a thread.

This returns an unstructured forest of comments. At the top, is the first post in the thread, which can have multiple comments, which then can each have other comments. Therefore, it was necessary to?structure the data.

I sorted the comments to get the oldest first with the specification:

Open Data Science Conference (ODSC) 9 个月前

Time Series Analysis with SARIMAX, LSTM, and FB…

Muhammad Aftab Ahmed 1 年前

Mastering the Ingestion Phase of Retriever Augmented…

Snigdha Kakkar 7 个月前

submission.comment_sort = 'old' _

Then, I created a for loop to iterate through each comment. For each comment, it harvests the time that the comment was created and appends it to an array.?

for comment in submission.comments.list():

???????time = comment.created

???????ltime.append(comment.created_utc)_

In the for loop, I implemented either tokenisation or stemming depending on the experiment.?

Tokenisation simply turns the sentence string into a list of words. In natural language processing, this is called the “bag of words” technique. This allows for easy processing but totally ignores sentence structure.

Stemming removes word endings. For example, “carefully” and “careful” both become the word “careful”. I used a python implementation of the Porter stemmer for my code, developed by Martin Porter in 1979 and still maintained by him. [https://tartarus.org/martin/PorterStemmer/ ]

stemmer = PorterStemmer()

for w in words(entry):

w= stemmer.stem(w).lower()

Serverside Chats

30 位关注者

Olivia Albertson

DevOps Manager | MSc-DS Candidate | LBMC, PC

2 年

Bryan Terrill this is related to what we were speaking about at the managers meeting. Antone is a great resource for pulling datasets from Reddit!

3 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Make your own Reddit Dataset

Antone Christianson-Galina

Co-Founder of Pathfinder Data Strategy. Vice Chair of the Barnes Housing Trust Fund.

领英推荐

Serverside Chats

30 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

#ArtificialIntelligence No 65: Why R lost the R vs Python wars and what that tells you about where AI is going

Telegram Bot for Monitoring Summarizing and Sending Periodic Qverviews of Channel Posts

Llama 2, ChatGPT for Web Scraping, & Latest Python News

Understanding Vector Autoregression (VAR) and Vector Moving Average (VMA) Models: A Comprehensive Guide with Code Examples

NuminaMath 7B TIR: A New Era in AI-Powered Mathematical Problem-Solving

Library related interview questions along with brief answers:

Python library & It's Uses

RAG AI with Neo4j

Text Classification with Hugging Face's BERT Model in Langchain

Unleashing the Power of Python Libraries: A Quick Guide for Data Scientists

领英推荐

Serverside Chats

30 位关注者

Why old-school data architecture will power AI

2024年11月12日

AI Can Be Confusing—This is How It’s Actually Being Used Today

2024年10月25日

The Evolution of BI: Dutch East India Company

2024年7月18日

What Meji-Era Japan Can Teach Us About Adapting to Technological Change

2024年4月13日

Analytics Projects that Won't Blow Up: Using Continuous Integration

2020年6月3日

Do Analytics like a Software Developer

2020年5月20日

3 Rules for Automating Complexity

2020年5月13日

How Psychometric Targeting Works

2018年6月25日

Chimps Can Make Better Business Decisions than Humans

2018年6月16日

Information Spread in Social Networks

2018年6月13日

社区洞察

其他会员也浏览了

#ArtificialIntelligence No 65: Why R lost the R vs Python wars and what that tells you about where AI is going

Telegram Bot for Monitoring Summarizing and Sending Periodic Qverviews of Channel Posts

Llama 2, ChatGPT for Web Scraping, & Latest Python News

Understanding Vector Autoregression (VAR) and Vector Moving Average (VMA) Models: A Comprehensive Guide with Code Examples

NuminaMath 7B TIR: A New Era in AI-Powered Mathematical Problem-Solving

Library related interview questions along with brief answers:

Python library & It's Uses

RAG AI with Neo4j

Text Classification with Hugging Face's BERT Model in Langchain

Unleashing the Power of Python Libraries: A Quick Guide for Data Scientists