登录查看更多内容

What is Lexical Analysis? How it works.

abhinav Ashok kumar

Curating Insights & Innovating in GPU Compiler | Performance Analyst at Qualcomm | LLVM Contributor | Maintain News Letter | AI/ML in Compiler

发布日期: 2022年9月25日

First phase of the compilation (i.e. translation of one language to another) is known as the Lexical Analysis. This phase take input from the modified source code from the pre-processed phase of the complete toolchain in a form of sentence. In simple words, these lexical analysis convert a sequence of characters into a sequence of tokens. The lexical analyzer breaks this syntax into a series of tokens. It removes any extra space or comment written in the source code and it also do many other things, which we are going to discuss in these post.

The set of code or programs that perform Lexical Analysis in compiler design are called lexical analyzers or lexers. A lexer contains tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it generates an error. The role of Lexical Analyzer in compiler design is to read character streams from the source code, check for legal tokens, and pass the data to the syntax analyzer when it demands.

/*Basic algo*/
while(tokenizer.hasToken())
{
          if(tokenizer.isValidToken())
                  auto currToken = tokenizer;
                  //use token here 
          else
              // Invalid token throw error //  
}

Basic term used in lexer

lexeme

A sequence of characters that are included in the source program according to the matching pattern of a token is known as lexeme . Lexeme is nothing but an instance of token.

token

Tokens in compiler design are nothing but the sequence of characters which represents a unit of information in the source program.

Pattern

A pattern is a description which is used by the token. In the case of a keyword which uses as a token, the pattern is a sequence of characters.

Lets take basic c code

int token()


{
		int a=10,b=20,c=30;
		printf("Numbers are a=%d b=%d c=%d",a,b,c);
	? ? ? ?return 0;	
}

You can see the token generated by these source code by running following command using clang compiler

clang -fsyntax-only -Xclang -dump-tokens source.c &>out

Token here will be stored inside the file out you can see it by below mentioned command

>cat out
int 'int' [StartOfLine] Loc=<a.c:1:1
identifier 'token' [LeadingSpace] Loc=<a.c:1:5>
l_paren '(' Loc=<a.c:1:10>
r_paren ')' Loc=<a.c:1:11>
l_brace '{' [StartOfLine] L
o=<a.c:3:1>
int 'int' [StartOfLine] [LeadingSpace] Loc=<a.c:4:3>
identifier 'a' [LeadingSpace] Loc=<a.c:4:7>
equal '=' Loc=<a.c:4:8>
numeric_constant '10' Loc=<a.c:4:9>
comma ',' Loc=<a.c:4:11>
identifier 'b' Loc=<a.c:4:12>
equal '=' Loc=<a.c:4:13>
numeric_constant '20' Loc=<a.c:4:14>
comma ',' Loc=<a.c:4:16>
identifier 'c' Loc=<a.c:4:17>
equal '=' Loc=<a.c:4:18>
numeric_constant '30' Loc=<a.c:4:19>
semi ';' Loc=<a.c:4:21>
identifier 'printf' [StartOfLine] [LeadingSpace] Loc=<a.c:5:3>
l_paren '(' Loc=<a.c:5:9>
string_literal '"Numbers are a=%d b=%d c=%d"' Loc=<a.c:5:10>
comma ',' Loc=<a.c:5:38>
identifier 'a' Loc=<a.c:5:39>
comma ',' Loc=<a.c:5:40>
identifier 'b' Loc=<a.c:5:41>
comma ',' Loc=<a.c:5:42>
identifier 'c' Loc=<a.c:5:43>
r_paren ')' Loc=<a.c:5:44>
semi ';' Loc=<a.c:5:45>
return 'return' [StartOfLine] [LeadingSpace] Loc=<a.c:6:9>
numeric_constant '0' [LeadingSpace] Loc=<a.c:6:16>
semi ';' Loc=<a.c:6:17>
r_brace '}' [StartOfLine] Loc=<a.c:7:1>
eof '' Loc=<a.c:7:2>>

In simple way token will generated as below:-

Lexeme	 Token
int	     Keyword
token	Identifier
(	    Operator
)	    Operator
{	    Operator
int	    Keyword
a	    identifier
=	    Operator
10	    numeric_constant
,	    operator
b	    identifier
=	    operator
20      numeric_constant
c	    identifier
=       operator
30	    numeric_constant
;	    operator
printf	identifier
(	    operator
"____"	string_literal
,	    operator
a	    identifier
,	    operator
b	    identifier
,	    operator
c	    identifier
)	    operator
;	    operator
return	keyword
0	    numeric_constant
;	    operatorn

Following are the things for which token does not get generated

领英推荐

C++20: More Details about Module Support of the Big…

Rainer Grimm 1 年前

Code Blocks

Ferdinand Charles 1 年前

Deterministic Finite Automata

ANIK C. 2 年前

1.  Comments
2.  Macro
3.  Whitespace

Let's discuss about architecture of lexical analyzer

Very basic work of the lexical analyzer is to read input characteristic and produce token out of it.

For performing its basic task lexical analyzer(lexer) scans whole source code of the program. After scanning is done it identifies each token one by one. Usually scanner are implemented to produce tokens only when it is requested by a parser.

Let's understand it more with the diagram:-

Let's understand it in detail:-

Lexical analyzer follows the following algo to generate the token.

We have to first write the command for getting the next token. Let's assume that “Get next token” is a command which is sent from the parser to the lexical analyzer.
Once compiler receive this command then the lexical analyzer phase of compiler scans the input until it find the next token.
Once it find the next token it return to parser.

Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is present, then Lexical analyzer will correlate that error with the source file and line number.

Roles of the Lexical analyzer

Lexical analyzer performs below given tasks:

Helps to identify token into the symbol table
Removes white spaces and comments from the source program
Correlates error messages with the source program
Helps you to expands the macros if it is found in the source program
Read input characters from the source program

Lexical Errors

A character sequence which is not possible to scan into any valid token is a lexical error. Important facts about the lexical error:

Lexical errors are not very common, but it should be managed by a scanner
Misspelling of identifiers, operators, keyword are considered as lexical errors
Generally, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a token.

Error Recovery in Lexical Analyzer

Here, are a few most common error recovery techniques:

Removes one character from the remaining input
In the panic mode, the successive characters are always ignored until we reach a well-formed token
By inserting the missing character into the remaining input
Replace a character with another character
Transpose two serial characters

要查看或添加评论，请登录

abhinav Ashok kumar的更多文章

Parallel programming Evolution

2025年3月12日

Parallel programming Evolution

Parallel programming has revolutionized how we leverage modern computing power! From instruction-level parallelism…
LLVM vs. GCC: A Comprehensive Comparison

2025年3月7日

LLVM vs. GCC: A Comprehensive Comparison

When it comes to compiling C, C++, and other languages, LLVM and GCC are two of the most widely used compiler…
Exploring TVM for Beginners: A Must-Read Guide for Compiler Enthusiasts

2025年3月4日

Exploring TVM for Beginners: A Must-Read Guide for Compiler Enthusiasts

For those diving into machine learning compilers, TVM is a powerful tool that optimizes deep learning models for…
Optimizing LLVM Passes: Understanding Pass Execution Time

2025年2月25日

Optimizing LLVM Passes: Understanding Pass Execution Time

Optimizing LLVM passes is crucial for improving performance and efficiency for compiler engineers. A key aspect of this…
CPP MCQ Stack

2025年2月17日

CPP MCQ Stack

Welcome to Compiler Sutra — the place to be if you want to improve at C++ and compilers! Link :…

1 条评论
Disabling LLVM Pass

2025年2月11日

Disabling LLVM Pass

?? Disabling an LLVM Pass for Custom Compiler Modifications ?? LLVM is at the core of many modern compilers, and its…

1 条评论
How LLVM Solve Traditional Compiler Problem m*n

2025年1月27日

How LLVM Solve Traditional Compiler Problem m*n

LLVM (Low-Level Virtual Machine) is a compiler framework that helps compiler developers to transform and build…
Pass In LLVM To Count the Number of Instructions in It

2024年12月22日

Pass In LLVM To Count the Number of Instructions in It

You can read the full tutorial here: Read the Full Tutorial This tutorial explores FunctionCount.cpp, a practical…
Unlocking C++11 part 2

2024年11月10日

Unlocking C++11 part 2

Hello, Tech Enthusiasts Here is the link for the Unlocking C++11 Part 1 The C++11 standard has transformed how we write…

1 条评论
Unlocking C++11

2024年11月3日

Unlocking C++11

Hello, Tech Enthusiasts! The C++11 standard has transformed how we write C++ code by introducing new features to make…

See all articles

What is Lexical Analysis? How it works.

abhinav Ashok kumar

Curating Insights & Innovating in GPU Compiler | Performance Analyst at Qualcomm | LLVM Contributor | Maintain News Letter | AI/ML in Compiler

Basic term used in lexer

lexeme

token

Pattern

领英推荐

Let's discuss about architecture of lexical analyzer

Roles of the Lexical analyzer

Lexical Errors

Error Recovery in Lexical Analyzer

abhinav Ashok kumar的更多文章

社区洞察

其他会员也浏览了

Under the Hood: Exploring the Inner Workings of Jetpack Compose

Nth element variadic pack extraction

Implementing an Arithmetic Circuit Compiler in?Rust

Optimizing LLVM Back-End: Global Instruction Selection

C++20 Coroutine: Under The Hood

Mastering the foreach Loop in C# – From Basics to Advanced Use Cases

Unleashing the Power of LLVM in Rust: A Deep Dive

Unbiased Compilers

Understanding Qualifiers in C++: A Complete Guide

Compiler Drivers

Basic term used in lexer

lexeme

token

Pattern

领英推荐

Let's discuss about architecture of lexical analyzer

Roles of the Lexical analyzer

Lexical Errors

Error Recovery in Lexical Analyzer

abhinav Ashok kumar的更多文章

Parallel programming Evolution

LLVM vs. GCC: A Comprehensive Comparison

Exploring TVM for Beginners: A Must-Read Guide for Compiler Enthusiasts

Optimizing LLVM Passes: Understanding Pass Execution Time

CPP MCQ Stack

Disabling LLVM Pass

How LLVM Solve Traditional Compiler Problem m*n

Pass In LLVM To Count the Number of Instructions in It

Unlocking C++11 part 2

Unlocking C++11

社区洞察

其他会员也浏览了

Under the Hood: Exploring the Inner Workings of Jetpack Compose

Nth element variadic pack extraction

Implementing an Arithmetic Circuit Compiler in?Rust

Optimizing LLVM Back-End: Global Instruction Selection

C++20 Coroutine: Under The Hood

Mastering the foreach Loop in C# – From Basics to Advanced Use Cases

Unleashing the Power of LLVM in Rust: A Deep Dive

Unbiased Compilers

Understanding Qualifiers in C++: A Complete Guide

Compiler Drivers