登录查看更多内容

Tokenizer Speed Testing With Bling Fire

Eric Novinson

Freelance Writer at Self Employed

发布日期: 2019年4月24日

Recently, engineers at Microsoft released the Bling Fire tokenizer and put it up on Github. This tokenizer was created by the team behind the Bing search engine and it is clearly faster than other tokenizers, making it very useful for natural language processing. It is also easy to install, because you can just type in pip install blingfire if you already have a Python environment installed on your system.

If you don't have Python installed, Anaconda is probably the easiest way to set up Python on Windows. After reading about the Bling Fire release on Hacker News, I decided to test the Bling Fire tokenizer to see how my results compared to the benchmarks, using Keras as well as Spacy and NLTK.

The main conclusion was that it does appear that Bling Fire is about 20 times as fast as the NLTK tokenizer. I also compared Bling Fire to SpaCy and the Keras tokenizer, but may have been using these tokenizers incorrectly, so the results may not be conclusive.

SpaCy was taking more time to process the text than NLTK, indicating that the script was doing other things besides tokenization. The Bing engineers already compared Bling Fire to SpaCy and Bling Fire was about 10 times faster in their tests, and that is a better benchmark for comparison purposes.

As for Keras, my tests showed that Bling Fire and Keras frequently took about the same amount of time to tokenize texts, although Keras took twice as long on some documents. I was using the Keras text to word sequence function to run these experiments. Again, this result is inconclusive and it would be better for the Bing engineers to perform the comparison between Bling Fire and Keras.

To perform these comparisons, I had to learn how to record time increments in milliseconds using the datetime module. A simple comparison of how many seconds it took each script to run would not have shown any differences, unless the documents were extremely long or there was a large number of them.

This result shows that NLTK is still very useful for personal projects, and for larger projects the NLTK tokenizer probably is not the bottleneck that is slowing down the script. The speed difference is probably more important for programs that need to search through a huge number of documents, such as a search engine.

Other than that, the Bling Fire results are slightly different from the NLTK results. From the Github page, two of the biggest differences are that Bling Fire splits up hyphenated words, and can also split up a file path by converting each directory in the path into an individual token.

When I tested this with SpaCy, it was also able to split up hyphenated words, giving the hyphen its own token, but returned an error when I included a file path in the input. Keras split up the hyphenated word and removed the hyphen, and also returned an error when a file path was included. So Bling Fire may have additional advantages over other tokenizers beyond processing speed.

I also uploaded the script I used to perform these experiments on my Gitlab portfolio. The project is called Tokenizer Speedtest, and it only contains this script. To use it, open up the script in a development environment such as Spyder and run it. Then, cut and paste a text article into the Python console to see how long it takes for each tokenization function to perform its work. Spyder also comes with Anaconda, so it does not require any additional effort to install it.

要查看或添加评论，请登录

Eric Novinson的更多文章

Does Safeway Take Apple Pay?

2023年11月13日

Does Safeway Take Apple Pay?

If you haven't heard of Safeway before, it's a large, traditional grocery store chain that has many locations in…

4 条评论
Does Dollar General Take Apple Pay?

2023年10月20日

Does Dollar General Take Apple Pay?

If you're not familiar with Dollar General, it's a discount store that typically sets up shops in rural areas. Some of…

2 条评论
Why Fintechs Developed Surcharging as a Service

2020年3月18日

Why Fintechs Developed Surcharging as a Service

When you visit a gas station you often see a sign offering a discount if you pay for gas with cash or a debit card. The…

3 条评论
The Benefits of Flexible Gift Cards as an Employee Reward

2020年3月17日

The Benefits of Flexible Gift Cards as an Employee Reward

Employers can give gift cards to their staff as a bonus, and they often do this during the holiday season. While their…

3 条评论
Benefits and Disadvantages of Installment Loans

2020年2月17日

Benefits and Disadvantages of Installment Loans

Installment loans have emerged as an alternative to credit cards. They're available through platforms that allow…
Why Techfins and Fintechs Have Different Goals

2020年1月2日

Why Techfins and Fintechs Have Different Goals

Big tech firms have been launching their own financial services recently. These tech firms don't operate like fintechs,…
15 Ways to Grow Your Audience With LinkedIn Marketing

2019年12月20日

15 Ways to Grow Your Audience With LinkedIn Marketing

This article is about ways to improve your marketing skills on LinkedIn. I recently contacted some of my colleagues on…

15 条评论
How LinkedIn's Comment Relevance Filter Might Work

2019年12月11日

How LinkedIn's Comment Relevance Filter Might Work

After commenting on a LinkedIn post recently, I noticed that the comments were sorted by relevance. Comments from my…

4 条评论
How Emojis Could Increase Views on LinkedIn Posts

2019年12月9日

How Emojis Could Increase Views on LinkedIn Posts

Emojis liven up posts because they represent strong emotions. They can attract readers and help with marketing.

3 条评论
How the Pillar Page SEO Strategy Could Improve the Presentation of LinkedIn Blog Posts

2019年12月7日

How the Pillar Page SEO Strategy Could Improve the Presentation of LinkedIn Blog Posts

Search engine optimization experts use pillar pages to organize their content and create highly authoritative pages…

See all articles

Tokenizer Speed Testing With Bling Fire

Eric Novinson

Freelance Writer at Self Employed

Eric Novinson的更多文章

社区洞察

其他会员也浏览了

I Created a Machine Learning Model with Auto Data Ingestion Using ChatGPT and Python!

Python: The Unstoppable Rise in Artificial Intelligence Development

Bigbird, TensorFlowJS and LinkedIn — Web models for your network.

Llama 2, ChatGPT for Web Scraping, & Latest Python News

NuminaMath 7B TIR: A New Era in AI-Powered Mathematical Problem-Solving

The 10 Most Important Python Libraries for Machine Learning in 2024

LLMs Made Accessible: A Beginner's Unified Guide to Local Deployment via Python

Common AI Prompt Engineering Interview Question 11: How do you implement a decision tree, random forest, or other specific ML algorithms in Python?

Go boldly into the unknown with LLM-assisted Python as your guide

Eric Novinson的更多文章

Does Safeway Take Apple Pay?

Does Dollar General Take Apple Pay?

Why Fintechs Developed Surcharging as a Service

The Benefits of Flexible Gift Cards as an Employee Reward

Benefits and Disadvantages of Installment Loans

Why Techfins and Fintechs Have Different Goals

15 Ways to Grow Your Audience With LinkedIn Marketing

How LinkedIn's Comment Relevance Filter Might Work

How Emojis Could Increase Views on LinkedIn Posts

How the Pillar Page SEO Strategy Could Improve the Presentation of LinkedIn Blog Posts

社区洞察

其他会员也浏览了

I Created a Machine Learning Model with Auto Data Ingestion Using ChatGPT and Python!

Python: The Unstoppable Rise in Artificial Intelligence Development

Bigbird, TensorFlowJS and LinkedIn — Web models for your network.

Llama 2, ChatGPT for Web Scraping, & Latest Python News

NuminaMath 7B TIR: A New Era in AI-Powered Mathematical Problem-Solving

The 10 Most Important Python Libraries for Machine Learning in 2024

LLMs Made Accessible: A Beginner's Unified Guide to Local Deployment via Python

Common AI Prompt Engineering Interview Question 11: How do you implement a decision tree, random forest, or other specific ML algorithms in Python?

Go boldly into the unknown with LLM-assisted Python as your guide