What is Lexical Analysis? How it works.

First phase of the compilation (i.e. translation of one language to another) is known as the Lexical Analysis. This phase take input from the modified source code from the pre-processed phase of the complete toolchain in a form of sentence. In simple words, these lexical analysis convert a sequence of characters into a sequence of tokens. The lexical analyzer breaks this syntax into a series of tokens. It removes any extra space or comment written in the source code and it also do many other things, which we are going to discuss in these post.

The set of code or programs that perform Lexical Analysis in compiler design are called lexical analyzers or lexers. A lexer contains tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it generates an error. The role of Lexical Analyzer in compiler design is to read character streams from the source code, check for legal tokens, and pass the data to the syntax analyzer when it demands.


/*Basic algo*/
while(tokenizer.hasToken())
{
          if(tokenizer.isValidToken())
                  auto currToken = tokenizer;
                  //use token here 
          else
              // Invalid token throw error //  
}        

Basic term used in lexer

lexeme

A sequence of characters that are included in the source program according to the matching pattern of a token is known as lexeme . Lexeme is nothing but an instance of token.

token

Tokens in compiler design are nothing but the sequence of characters which represents a unit of information in the source program.

Pattern

A pattern is a description which is used by the token. In the case of a keyword which uses as a token, the pattern is a sequence of characters.

Lets take basic c code


int token()


{
		int a=10,b=20,c=30;
		printf("Numbers are a=%d b=%d c=%d",a,b,c);
	? ? ? ?return 0;	
}        

You can see the token generated by these source code by running following command using clang compiler

clang -fsyntax-only -Xclang -dump-tokens source.c &>out

Token here will be stored inside the file out you can see it by below mentioned command

>cat out
int 'int' [StartOfLine] Loc=<a.c:1:1
identifier 'token' [LeadingSpace] Loc=<a.c:1:5>
l_paren '(' Loc=<a.c:1:10>
r_paren ')' Loc=<a.c:1:11>
l_brace '{' [StartOfLine] L
o=<a.c:3:1>
int 'int' [StartOfLine] [LeadingSpace] Loc=<a.c:4:3>
identifier 'a' [LeadingSpace] Loc=<a.c:4:7>
equal '=' Loc=<a.c:4:8>
numeric_constant '10' Loc=<a.c:4:9>
comma ',' Loc=<a.c:4:11>
identifier 'b' Loc=<a.c:4:12>
equal '=' Loc=<a.c:4:13>
numeric_constant '20' Loc=<a.c:4:14>
comma ',' Loc=<a.c:4:16>
identifier 'c' Loc=<a.c:4:17>
equal '=' Loc=<a.c:4:18>
numeric_constant '30' Loc=<a.c:4:19>
semi ';' Loc=<a.c:4:21>
identifier 'printf' [StartOfLine] [LeadingSpace] Loc=<a.c:5:3>
l_paren '(' Loc=<a.c:5:9>
string_literal '"Numbers are a=%d b=%d c=%d"' Loc=<a.c:5:10>
comma ',' Loc=<a.c:5:38>
identifier 'a' Loc=<a.c:5:39>
comma ',' Loc=<a.c:5:40>
identifier 'b' Loc=<a.c:5:41>
comma ',' Loc=<a.c:5:42>
identifier 'c' Loc=<a.c:5:43>
r_paren ')' Loc=<a.c:5:44>
semi ';' Loc=<a.c:5:45>
return 'return' [StartOfLine] [LeadingSpace] Loc=<a.c:6:9>
numeric_constant '0' [LeadingSpace] Loc=<a.c:6:16>
semi ';' Loc=<a.c:6:17>
r_brace '}' [StartOfLine] Loc=<a.c:7:1>
eof '' Loc=<a.c:7:2>>


        


In simple way token will generated as below:-

Lexeme	 Token
int	     Keyword
token	Identifier
(	    Operator
)	    Operator
{	    Operator
int	    Keyword
a	    identifier
=	    Operator
10	    numeric_constant
,	    operator
b	    identifier
=	    operator
20      numeric_constant
c	    identifier
=       operator
30	    numeric_constant
;	    operator
printf	identifier
(	    operator
"____"	string_literal
,	    operator
a	    identifier
,	    operator
b	    identifier
,	    operator
c	    identifier
)	    operator
;	    operator
return	keyword
0	    numeric_constant
;	    operatorn


        

Following are the things for which token does not get generated

1.  Comments
2.  Macro
3.  Whitespace         

Let's discuss about architecture of lexical analyzer

Very basic work of the lexical analyzer is to read input characteristic and produce token out of it.

For performing its basic task lexical analyzer(lexer) scans whole source code of the program. After scanning is done it identifies each token one by one. Usually scanner are implemented to produce tokens only when it is requested by a parser.

Let's understand it more with the diagram:-


No alt text provided for this image

Let's understand it in detail:-

Lexical analyzer follows the following algo to generate the token.

  1. We have to first write the command for getting the next token. Let's assume that “Get next token” is a command which is sent from the parser to the lexical analyzer.
  2. Once compiler receive this command then the lexical analyzer phase of compiler scans the input until it find the next token.
  3. Once it find the next token it return to parser.

Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is present, then Lexical analyzer will correlate that error with the source file and line number.


Roles of the Lexical analyzer

Lexical analyzer performs below given tasks:

  • Helps to identify token into the symbol table
  • Removes white spaces and comments from the source program
  • Correlates error messages with the source program
  • Helps you to expands the macros if it is found in the source program
  • Read input characters from the source program

Lexical Errors

A character sequence which is not possible to scan into any valid token is a lexical error. Important facts about the lexical error:

  • Lexical errors are not very common, but it should be managed by a scanner
  • Misspelling of identifiers, operators, keyword are considered as lexical errors
  • Generally, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a token.

Error Recovery in Lexical Analyzer

Here, are a few most common error recovery techniques:

  • Removes one character from the remaining input
  • In the panic mode, the successive characters are always ignored until we reach a well-formed token
  • By inserting the missing character into the remaining input
  • Replace a character with another character
  • Transpose two serial characters

要查看或添加评论,请登录

abhinav Ashok kumar的更多文章

  • Parallel programming Evolution

    Parallel programming Evolution

    Parallel programming has revolutionized how we leverage modern computing power! From instruction-level parallelism…

  • LLVM vs. GCC: A Comprehensive Comparison

    LLVM vs. GCC: A Comprehensive Comparison

    When it comes to compiling C, C++, and other languages, LLVM and GCC are two of the most widely used compiler…

  • Exploring TVM for Beginners: A Must-Read Guide for Compiler Enthusiasts

    Exploring TVM for Beginners: A Must-Read Guide for Compiler Enthusiasts

    For those diving into machine learning compilers, TVM is a powerful tool that optimizes deep learning models for…

  • Optimizing LLVM Passes: Understanding Pass Execution Time

    Optimizing LLVM Passes: Understanding Pass Execution Time

    Optimizing LLVM passes is crucial for improving performance and efficiency for compiler engineers. A key aspect of this…

  • CPP MCQ Stack

    CPP MCQ Stack

    Welcome to Compiler Sutra — the place to be if you want to improve at C++ and compilers! Link :…

    1 条评论
  • Disabling LLVM Pass

    Disabling LLVM Pass

    ?? Disabling an LLVM Pass for Custom Compiler Modifications ?? LLVM is at the core of many modern compilers, and its…

    1 条评论
  • How LLVM Solve Traditional Compiler Problem m*n

    How LLVM Solve Traditional Compiler Problem m*n

    LLVM (Low-Level Virtual Machine) is a compiler framework that helps compiler developers to transform and build…

  • Pass In LLVM To Count the Number of Instructions in It

    Pass In LLVM To Count the Number of Instructions in It

    You can read the full tutorial here: Read the Full Tutorial This tutorial explores FunctionCount.cpp, a practical…

  • Unlocking C++11 part 2

    Unlocking C++11 part 2

    Hello, Tech Enthusiasts Here is the link for the Unlocking C++11 Part 1 The C++11 standard has transformed how we write…

    1 条评论
  • Unlocking C++11

    Unlocking C++11

    Hello, Tech Enthusiasts! The C++11 standard has transformed how we write C++ code by introducing new features to make…

社区洞察

其他会员也浏览了