A Sentimental Invitation Part III-b

A Sentimental Invitation Part III-b

Recap

 To Recap, Part I, I was the evangelist, Part II, I constructed a viable simple Na?ve Bayesian Classifer, Part III-a, I introduced a bit of Algebra and motivated a) a definition of Bag of Words in terms of a commutative Syntactic Monoid and b) gave an example of a simple language in which it is clear that the language itself is not commutative.  In this instalment, I’ll introduce the notion of a Sentimentally Recognizable language.  

Disclaimer repeated and Prelude to Fun

Now, I warned you that Part I would be easy, Part II a bit less so, and Part III not so easy. But trust me, the math behind this doesn’t take a lot of time to learn.  But, to appease my friends who aren’t excited about math, or those who are but find this boring, I’ll serve up one little preview of a fun exercise.  I used to read the blog of one John Baez- a very talented mathematician and physicist at UCR.  I’m a bit biased as he’s into Category Theory.  In any event, he cracked me up (no pun intended) with his notion of a crackpot index.  It got me wondering: could we build an algorithm that has been trained on a supervised basis to detect crackpot behavior? We shall see, as I intend to hack out a Web App that will allow you, the reader, to copy and paste suspected crackpot writings and “train” the algo yourself.  Now, I warn you, and I know some of my friends out there are a bit mischievous at best, I will be checking to see if you copy and paste my work, this work, into the app.  I am not a crackpot.

Sentimentally Recognizable

For ease of referral, the language we introduced at the end of Part III-a consisted of all the sentences you could write of one of the two forms: 

  1. This <noun> is not good but is rather bad.
  2. This <noun> is not bad but is rather good.

Let’s call this language L.

An essay in L would not be very interesting.  As an example of such a fine piece of work would be:

“This book is not good but is rather bad. This dog is not bad but is rather good.  This car is not good but is rather bad.  This mathematician is not bad but is rather good.  Etc etc”

 We encoded this into the following: {n,g,b} would be the alphabet (if one is thinking of things in terms of language theory) or dictionary (if one is thinking of things in terms of sentiment). I’m simplifying as n could be any noun from some set of nouns, but we lose no mathematical generality by assuming one noun in what follows.  

 In Part III-a, I drew a rather ugly looking automaton that recognizes L.  The diagram below to the left is a much cleaner representation of the same automaton, but I have not included (the important) state that represents “Gibberish”.   An automaton by definition has the property that for every state and every letter, there is an “arrow” emanating from that state to another state and labeled with that letter.  Thus, as a convention here, anywhere you don’t see a letter-labelled arrow emanating from a state, this will mean there should be an arrow labelled with that letter going into (an invisible) “Gibberish” state.  This convention is driven purely by a desire to make the diagrams pretty and less cluttered.  The other point is that I carefully noted in Part III-a that the automaton is an automaton that recognizes L;  It’s not the minimal automaton, since we don’t really need two different accepting states (below left in green).  The automaton on the right, however, is the minimum automaton that recognizes L, and visually we can see that it is obtained by “collapsing” the two accepting states on the left as one state on the right.

 

I get (and someone can check) that the Syntactic Monoid of the language has the following properties (in particular there are 20 elements):

Notes:

  1. I worked out the members by hand.
  2. I got a bit lazy and built a program to compute relations.
  3. I have not listed all relations in the table.
  4. Member whose ID is 5 is a “zero” of the monoid.
  5. Member whose ID is 8 shows the core commutivity relation. I think all other elements of commutivity stem from this.

Denote  by LPos  the sublanguage of L consisting of all strings that have more copies of substrings “nbg” than “ngb”.  We can take this to mean that there are more sentences in a given document with positive sentiment than sentences with negative sentiment.   Suppose now we wanted to “recognize” LPos , but using some structure we have created from the automaton that recognizes L.  Just as in our simple example in Part III-a consisting of strings of the form gbgbggbbggg, LPos cannot be recognized by a finite state automaton.  An “infinite” state automaton that recognizes  LPos can be visualized in the following diagram, in which I have again use the convention that there is a unique invisible state “Gibberish” such that any missing labelled arrow in the diagram will mean that that letter should have an arrow pointing to the invisible “Gibberish” node. 


 

In the diagram, the green states denote states that accept a string, whereas the red states would be the corresponding states that would accept strings of negative sentiment (for a corresponding automaton that recognizes negative sentiment documents).  It is interesting that the diagram does give rise to an interesting symmetry and perhaps (or perhaps not) accidential/coincidental tessellation of the plane by hexagons.  Don’t know if there is much to that.  This in fact looks like an "unravelling" of the previous finite state automata and in fact is very reminiscent of a "covering" from topology.  One thing to note is that, unlike the case of the language in {g,b} with more gs than bs, we cannot “collapse” this diagram into one line (I leave that as an exercise).

Other Exercises Left to reader (with  M(LPos) denoting the Syntactic Monoid of  LPos):

  1. M(LPos) is noncommutative.
  2. M(LPos) contains a submonoid isomorphic to the integers Z under addition.
  3. What is M(LPos) ?
  4. There is homomorphism that is naturally associated with the constructions involved as follows:

This suggests defining a Sentimentally Recognizable Language as a sextuple

 such that the following commutative diagram follows:

We also require that L is recognizable, and, in particular, M(L) is finite.  Here, M(L) denotes Syntatic Monoid associated with a language L, the horizontal arrows are the homomorphisms from the free semigroups of the alphabets in question onto their respective Syntactic Monoids, and the arrows on the left mean inclusion functions at the set level.

Quick reversion to Bag of Words

Now, let’s suppose we were really na?ve and attempted to do a Bag of Words classification to figure out exactly the same sentiments that could be derived from the previous example L.  Suppose further that we knew g meant positive and b meant b.  Suppose we just wanted to count the number of occurances of g vis-à-vis b.  Well, good luck on that. It won’t work for the obvious reasons.  If you try to construct three automata that fit into the above framework, you’ll come up with a big fat zero. There is an obstruction to doing so.  Furthermore, in some sense, which I’ll try to hash out next instalment, the failure to do so I think…think, can be captured in the kernel of a homomorphism.  Maybe, maybe not.

Other kinds of noncommutivity

As we saw, noncommutivity of languages could lead to bad results if one uses a “commutative model” like Bag of Words.  I admit this one was contrived, and it is pretty much isomorphic (or at least homotopic) to the case of considering “negation”, ie, the case of saying NG where N means NOT and G is some adjective such as GOOD or BAD.  But you can also have noncommutivity in the following form:

X V Y

Vs.

Y V X

Where X is a noun as subject, V is a verb and Y is a noun as object.  For example, if I write Tyson Fury beat Wladimir Klitschko, that is positive for Tyson Fury and negative for Wladimir Klitschko. But if I write Wladimir Klitschko beat Tyson Fury, the sentiments are reversed. 

More to follow.

I am quickly realising the value of a good UX designer. A bit like Pininfarina.

回复

Works in Safari and IE for me Philipppos. Counterintuitively not on chrome. UX is proper art, I must say

回复

Took some sparse data on us airline tweets on Kaggle. Trained the base model and used WebGL. Doesn't work on all browsers; alas I'm not a UX rockstar. But you can see it here. https://toposglobeclient.azurewebsites.net

回复

well, it's been sometime since I continued with this collection. I've been a bit busy setting up some apps to monitor sentiment. Here's one for US Presidential Election Sentiment derived from Tweets. I'm using the free API, which is not ideal since the best approach is to go through GNIP, Twitter's service to deliver tweets robustly. That's the next step. https://toposuspresidentelectionclient.azurewebsites.net It may take some time to load so that you can see numbers and graphs, and it does depend on the server I'm using being up (I haven't worked on any contingency backups for this yet). this is using a client-server setup, with Websockets (thanks Artur) for realtime updates and smoothie.js, which is kind of a cool graphing javascript file.

回复

要查看或添加评论,请登录

Marc Nunes的更多文章

  • A Sentimental Invitation: Part Clinton

    A Sentimental Invitation: Part Clinton

    So, I've been staring at my API and HTLM/javascript output of sentiments on Trump, Clinton, Sanders and Brexit. If you…

    4 条评论
  • A Sentimental Invitation: Part Brexit

    A Sentimental Invitation: Part Brexit

    It's been a while since I wrote on this series. I won't talk about applying category theory and explaining inaccuracy…

    10 条评论
  • Getting the news before the news

    Getting the news before the news

    Firstly, and importantly, I hope there were no serious injuries in today's earthquake in California. Earthquakes are…

    3 条评论
  • Being on Time

    Being on Time

    Call me old skool. This will be short.

    8 条评论
  • Ideas

    Ideas

    This will be terse. I used to ponder for some time on the following simple question: how many ideas have been pondered…

    1 条评论
  • Subscribe, Mine, Publish

    Subscribe, Mine, Publish

    Mining Equipment Or Mining? A core grade school lesson you learn as a young child growing up in California is that it’s…

    1 条评论
  • Happy New Year from Amsterdam

    Happy New Year from Amsterdam

    4 条评论
  • Artificial Artificial Artificial Intelligence

    Artificial Artificial Artificial Intelligence

    Artificial Artificial Artificial Intelligence Everything has been thought of before, said my mad scientist Polish…

  • Systemic Risk

    Systemic Risk

    Keystone Cops As they say, people have short memories. People still talk about the 2008 Financial Crisis but it seems…

    12 条评论
  • A Sentimental Invitation Part III-a

    A Sentimental Invitation Part III-a

    A recap thus far. In Part I, wore the Evangelist Hat.

    1 条评论

社区洞察

其他会员也浏览了