Tokenizer Speed Testing With Bling Fire

Recently, engineers at Microsoft released the Bling Fire tokenizer and put it up on Github. This tokenizer was created by the team behind the Bing search engine and it is clearly faster than other tokenizers, making it very useful for natural language processing. It is also easy to install, because you can just type in pip install blingfire if you already have a Python environment installed on your system.

If you don't have Python installed, Anaconda is probably the easiest way to set up Python on Windows. After reading about the Bling Fire release on Hacker News, I decided to test the Bling Fire tokenizer to see how my results compared to the benchmarks, using Keras as well as Spacy and NLTK.

The main conclusion was that it does appear that Bling Fire is about 20 times as fast as the NLTK tokenizer. I also compared Bling Fire to SpaCy and the Keras tokenizer, but may have been using these tokenizers incorrectly, so the results may not be conclusive.

SpaCy was taking more time to process the text than NLTK, indicating that the script was doing other things besides tokenization. The Bing engineers already compared Bling Fire to SpaCy and Bling Fire was about 10 times faster in their tests, and that is a better benchmark for comparison purposes.

As for Keras, my tests showed that Bling Fire and Keras frequently took about the same amount of time to tokenize texts, although Keras took twice as long on some documents. I was using the Keras text to word sequence function to run these experiments. Again, this result is inconclusive and it would be better for the Bing engineers to perform the comparison between Bling Fire and Keras.

To perform these comparisons, I had to learn how to record time increments in milliseconds using the datetime module. A simple comparison of how many seconds it took each script to run would not have shown any differences, unless the documents were extremely long or there was a large number of them.

This result shows that NLTK is still very useful for personal projects, and for larger projects the NLTK tokenizer probably is not the bottleneck that is slowing down the script. The speed difference is probably more important for programs that need to search through a huge number of documents, such as a search engine.

Other than that, the Bling Fire results are slightly different from the NLTK results. From the Github page, two of the biggest differences are that Bling Fire splits up hyphenated words, and can also split up a file path by converting each directory in the path into an individual token.

When I tested this with SpaCy, it was also able to split up hyphenated words, giving the hyphen its own token, but returned an error when I included a file path in the input. Keras split up the hyphenated word and removed the hyphen, and also returned an error when a file path was included. So Bling Fire may have additional advantages over other tokenizers beyond processing speed.

I also uploaded the script I used to perform these experiments on my Gitlab portfolio. The project is called Tokenizer Speedtest, and it only contains this script. To use it, open up the script in a development environment such as Spyder and run it. Then, cut and paste a text article into the Python console to see how long it takes for each tokenization function to perform its work. Spyder also comes with Anaconda, so it does not require any additional effort to install it.

要查看或添加评论,请登录

Eric Novinson的更多文章

社区洞察

其他会员也浏览了