Spooky Author identification - A machine learning competition by Kaggle
Spooky Bus stop at acadia national park... Photo credit - Andy Thrasher-Public Domain-Flickr

Spooky Author identification - A machine learning competition by Kaggle

This is the follow-up to the article on Kaggle and Machine learning that I had published previously. As most of you know, Kaggle conducts machine learning competitions. One such competition that Kaggle had recently conducted is called "Spooky Author identification". This had ended on December 15th and the results were published. I had entered this competition under the team name of "Johnscreekers". My team was placed 41st out of 1245 teams that had competed in this tournament - not bad for a first time entry!

The goal of this competition was to create a machine learning model that predicts the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. We are given several thousand lines of sample text from the books of each of these authors. We are expected to train our machine learning model using that sample text. We are also given a few thousand lines of text to be used as test data once the model is trained.

So first, we train the machine learning model using the sample text. Then we test the trained machine learning model using the test data and submit the results matrix to Kaggle. They will then verify and determine how many predictions were correct.

The participants of the competition can use the language of their choice to code for the competition. Most people used either Python or R language in this competition. I have used Python, because I am quite comfortable with that language and it has extensive machine learning support. I have used the following libraries that provide machine learning support for Python -

  • nltk - This is a great library for natural language processing tasks and amazingly easy to use. It has dozens of functions to take care of most natural language processing needs. It is well documented.
  • scikit-learn (also known as sklearn) - This library provides the machine learning foundation for Python and the main machine learning facilitator in Python. It has a vast collection of algorithms. It is a really impressive library and makes machine learning in Python a joy.
  • keras - As you know, the use of neural networks as a main tool of choice for machine learning, is increasing by the day. You might also have noticed that Google's TensorFlow is fast becoming the de facto standard framework for neural networks. But TensorFlow takes a bit of getting used to, and not exactly easy to get started with. This is where Keras comes in. Keras is a high level neural networks API for TensorFlow. It makes using TensorFlow a breeze. That makes it indispensable for any machine learning project.

Before entering the competition, I thought I would need to spend 2-3 hours a day but I had ended up spending more like 6-8 hours a day. But I think it was time well spent. I had thoroughly enjoyed taking part in this competition and learnt a lot about machine learning. I will post the source code for my machine learning model for this competition on Github in a few days and update it here.

I once again encourage those of you that are interested in Machine Learning to check out Kaggle website.

Deepal Jayagoda

Sr. Staff Architect, GE Digital, Atlanta

7 年

Well done...congrats

回复

要查看或添加评论,请登录

Ramakrishna Nemani的更多文章

  • Number systems and their Bases

    Number systems and their Bases

    The topic of 'Number systems and their bases' has always fascinated me. Over the weekend, while looking for something…

    1 条评论
  • Kaggle and Machine Learning

    Kaggle and Machine Learning

    I first came across Kaggle about two years ago, while searching for some datasets. I had really liked it at that time.

    2 条评论

社区洞察

其他会员也浏览了