A High-Level Introduction to the C Programming Language
Dino Cajic
Chief Technology Officer @ Vector Laboratories | Spearheading Digital Innovation
C is a great programming language if you want to work closer with the machine. A couple of decades ago, C was considered a high-level programming language; look at how times have changed. We’ve been babied with programming languages like Java, C#, JavaScript, PHP, Python, you name it. You can pull a car using the frame machine, but we’re going to do it with a hammer and a chain. Why? Just to prove that we can. There are other reasons for studying C; the primary being the speed of the program. Understanding how everything works at a lower level makes learning other programming languages a breeze. So, let’s get into it. There will be limited code, if any, in this post. You can watch some YouTube tutorials for that.
C is a compiled language, meaning that you need a C compiler to convert your readable code into machine code. Machine code is just a bunch of zeros and ones.
C got its popularity mainly because Unix was written in C. Currently, most operating systems are written in C and high-graphics usage video games (C++ as well).
Since C is a small language, most of the functions are defined in external libraries. That code is included at the top of the source file. The compiler copies the contents from the included file and pastes it into the source code where the include statement was entered. The most common one is the stdio.h header file, which stands for standard input/output. This library contains the code that’s necessary for you to write data from and to the terminal. A couple of functions that you can’t live without that are declared in the stdio.h file are scanf and printf.
A few things to clear up before we continue. When compiling your code using, for example, gcc, you may have seen something like the following written: gcc somefile.c -o somefile. You’re using the gcc utility to create an executable file called somefile. You can enter gcc somefile.c and it’ll still work. However, this time it’ll create an a.out file. To run it, you would have to enter ./a.out. If you do include the -o somefile, you can run it by entering ./somefile into your terminal. Also, the order matters slightly. You have to enter the name of the executable after the -o attribute. You can, however, move the -o to the beginning, such as gcc -o somenewfile somefile.c. Also, the executable files and source code file’s names don’t have to match. The -o means output; so, it specifies the output file.
The -o stated previously is different from utilizing the capital O as in -O, -O2, -O3 and -Ofast. The capital O is set when you want the gcc to optimize your code. -O3 will include all of the checks of -O and -O2. Maximum optimization is done with -Ofast but will also take the longest to compile. Gcc optimization is switched off by default because it takes longer to compile the code.
If you’re writing a C program and you want to use the command line option (like the -o option), you’ll have to read it with the getopt() function. First, include the unistd.h header file; unistd.h is not part of the standard C library, but is instead part of the POSIX library. To read each of the command line options, place the getopt() function into a while loop, i.e. while((ch = getopt(argc, argv, “do:”)) != EOF) {…}. The last argument provided to the getopt() function states that both d and o are command line options, that the o option also needs a command line argument and that that argument will be included immediately after the -o option. To get the command line argument, you’ll use optarg variable once the “o” command line option is matched. Your command line options can also be combined as long as the option that requires an argument is written last (i.e. -do something). If you want to include both command line options and negative numbers, you can split the main arguments using the “-” (i.e. gcc_custom -do somefile — -5 somefile.txt).
You’re probably wondering why we must use the period forward-slash before the file name. If you were to enter somefile without the ./, the operating system would try to find the file in a directory specified in its environmental path variable. If it’s not there, it’ll let you know that the file doesn’t exist. To get around that, you’re telling the terminal to look in the current directory to execute the file: ./ specifies the current directory. If you really want to eliminate the ./, you can copy the folder’s absolute path that contains your source file and append it to your environmental path variable.
One last thing, if you’re using gcc on a Linux distribution, the compiler will have created a file named somefile after executing the command gcc somefile.c -o somefile. On Windows, the name of the file will have a .exe appended to it.
Strings
C doesn’t support strings out of the box. Strings are stored as an array of characters. When printing out characters to a screen, the printf function looks for the null character ‘\0’ to know when to terminate a string. Each character is stored in 1 byte of memory. If an array of characters has 10 elements, it would occupy 10 bytes of sequential memory. If the printf statement doesn’t encounter a null statement, it might continue into the next memory cell, which would not be beneficial for us since we have absolutely no idea what’s stored there. So, when creating a string, make sure that the array’s length is the length of the string plus one.
This also brings up another frequently asked question: why do array indices start at zero and not at one? To access the first element of an array, the variable that’s assigned that memory address knows how to locate that memory address. It doesn’t know the memory address location of the next element. What we do know is that arrays are stored sequentially. So, if we have a character array, and we know that each character takes up only one byte of memory, then we can say that the next character is 1 byte away from the starting point. The first element is similarly 0 bytes away from the starting point. Each index is an offset which literally translates to the distance from the starting point.
You can declare a string character array several ways, but here are two in C: by using the string literal or defining and populating the array manually. If you initialize a character array using the string literal, you’re creating a constant and it’s not changeable at that point since constants are stored in the read-only-data segment. When declaring a character array and later populating it, the storage is allocated on the stack and each element is mutable. You can also store strings, using the malloc function, on the heap. In other programming languages, like Java, the new operator is used to allocate space on the heap.
What else can you do with strings in C? You need to check the string.h header file for useful function declarations. As a side note, the .h file is a header file that contains declarations. Most of the time, programmers will not give you access to view the implemented functions, but will provide you with the .h file so that you may review the declarations and useful notes on how the functions work. If you’re using a Linux operating system, you can learn more about a function by typing in man function_name into the terminal (i.e. man strcmp). The man Linux utility stands for manual.
The string.h file is part of C’s standard library. What’s the standard library? It’s just a collection of code that came pre-installed with the compiler that you downloaded (or had with a Linux distribution). Another extremely useful header file is the stdio.h that you use for your input/output (i.e. printf and scanf). C is a very lightweight language so it relies heavily on the standard library.
Conditional Expressions
C follows the short-circuit evaluation technique to speed up the program. That means that if there are multiple comparisons separated with AND statements, if one fails we know that there’s absolutely no way that the overall expression can be true (thanks discrete math). Sometimes this can be a problem if you’re updating a variable in the second expression with a prefix or postfix incrementation operator. Similarly, we know that in an OR statement one or both expressions must evaluate to true. Since it’s going to be true either way, if the first expression evaluates to true, there’s no point evaluating the second one. C doesn’t have a Boolean type. C99 does allow you to enter true or false, but in the end, it gets converted into 1 or 0 respectively. This can cause unexpected results if you use the assignment operator instead of the relational equality operator in your conditional expression since in C a zero represents false and all other non-zero integers represent true.
What’s the difference between && and &; similarly, what’s the difference between || and |. The BITWISE & (AND) and BITWISE | (OR) force the evaluation of both sides always to prevent short-circuit evaluation side-effects. BITWISE & and | also perform bitwise operations on individual bits of a number.
Loops
There are two general types of loops: pretest and posttest loops. In pretest loops, the control statement is evaluated prior to the statements in the loop body. In a posttest loop, the statements in the body are evaluated first followed by the loop condition. When looking at the operational semantics of counting loops, both at “while” and “for” loops, you’ll quickly see that they’re very similar: initialize a loop variable, test against terminal value, evaluate statements in loop body and increment loop variable. Each expression in C’s for loop can have multiple statements separated by commas. C allows for the use of the break statement, which terminates the loop, as well as the continue statement, which skips the remaining statements in the loop body and takes the execution back to the start of the loop.
If loops are still hard to visualize, just take a loop at the operational semantics of each one. Let’s start off with the for loop:
for (expression_1; expression_2; expression_3) loop_body
Looking at the operational semantics of a for loop, you can quickly see that expression_1 is evaluated first.
expression_1 loop: if expression_2 = 0 goto out [loop body] expression_3 goto loop out:
This is the initialization step. The loop label comes after the initialization of the loop variable normally. After the label, the condition is evaluated. If the condition is false, the unconditional branch (goto) transfers the control to the “out” label location in the program. If the condition is true, the statements contained in the loop body are executed. After the execution of the loop body, expression_3 is evaluated. In the for loop, expression_3 normally serves as the step size. After the execution of the third statement, the goto statement transfers the control to the “loop” label location in the program which if you remember comes after the execution of the first expression.
In C’s for loop, each of the expressions are optional; the semi-colons are not optional. Missing a second expression is the same as having an expression that’s always true; this can potentially cause an infinite loop unless you have an explicit-branch in your loop body. The first and third expressions can be a series of expressions separated by a comma. The second expression can be a multi-conditional expression. The loop body in C’s for loop is also optional. If no statements are provided, a semi-colon must be included after the closing parenthesis. Since numerous expressions can be evaluated in the for loops control statement, it’s common to see for loops without a loop body.
Counter-controlled loops were created for convenience and since so many logically controlled loops had some sort of counting variable. Every counting loop can be built with a logical loop; the reverse isn’t true. The two most common logically controlled loops are the while and do-while loops. The difference between the two is that the while loop is a pretest loop, but the do-while loop is a posttest loop.
Like before, let’s examine the operational semantics of both. The general form of the while logical loop is:
while (control_expression)
loop_body
The operational semantics for the while loop looks like the following:
loop: if control_expression is false goto out [loop body] goto loop out:
In the pretest loop above, the condition is evaluated first. If false, the goto statements transfers the control to the “out” label terminating the repetition. If the condition evaluates to true, the statements in the loop body are executed and the unconditional branch redirects the execution of the program to the loop label.
Now, let’s examine the general form and operational semantics of a do-while post-test logical loop.
do loop_body
while (control_expression);
The operational semantics are listed as follows:
loop: [loop body] if control_expression is true goto loop
Examining the operational semantics of the do-while loop we notice that the statements contained within the loop body are executed at least once and are performed before the condition is evaluated. If the control expression is evaluated to true, the goto branches to the loop label.
Functions
You must specify a return type for each function. If the function is not returning anything, void is used as the return type in the function declaration. Unless specified, arguments that are passed to a function are passed by value. The programmer can specify that the parameters should be “pass by reference.”
When returning a pointer, make sure the pointer was declared outside of the function. A pointer variable declared within the function will be placed on the stack; the scope of a local variable is from declaration to function end. Something else to be cautious of is when passing pointers to arrays as parameters. Calling the sizeof operator on the array pointer prior to function call will provide you with the correct size of the array, however, if attempting to use the sizeof operator on a parameter, the sizeof operator will display the size of the pointer variable, not the array. Make sure to pass another argument to the function that contains the size of the array if you need that information. If you’re passing a pointer argument to a function and you don’t want it to be accidentally modified, include the keyword const before the pointer’s data type (i.e. const int *num).
Once in a blue-moon you’ll write some code that’s mutually recursive (i.e. function one calls function two and function two calls function one). In this case, there’s really no way to arrange the functions so that the C compiler will be happy; you have to declare the functions prior to calling those functions. Even better than placing the declarations (called prototypes in C) in the same document would be to place them in a header file. Function declarations are necessary since C doesn’t allow forward referencing of functions; they’re needed for static type checking.
When including your custom header file make sure to wrap it in double quotes to tell the compiler that it’s a local file (search via relative path) and not in a directory where library code is located. You can place the full pathname in your include statement if you’re including a header file with double-quotes. After the compiler finishes preprocessing the code, the header file code will be “copied” to the point where the “#include” is specified. The compiler doesn’t actually create a new file, instead it “pipes” the information through the compilation process. If you’re including a header file whose definitions are located in another source file, you’ll have to specify both source files when compiling (i.e. gcc file_a.c file_b.c -o file_a). If your function is returning an int value, even if you don’t declare a function prior to it being used, the compiler will still compile the code correctly. Why? When the compiler gets to that portion of the code, it’ll assume that the function returns an int since that’s what majority of the functions return.
C supports variadic functions which are functions that accept a variable number of parameters. To create your own variadic function, you’ll first have to include the stdarg.h header. When defining a function, you’ll have to specify that the function will be a variadic function by including the “…” ellipses after the parameters of the function. Within the function, you’ll need to create a va_list (variable argument list) that will store the extra arguments that are passed to the function. After you create the va_list, you’ll also have to specify the last fixed argument with va_start macro; va_start accepts two parameters: the va_list and the last fixed argument of the function. To finally read the arguments, you’ll use the va_args macro. Va_args accepts two parameters: the va_list and the type of the argument passed to the function. Once you’re finished reading the list of arguments, you’ll need to tell C that you’re done with the va_end macro; va_end accepts one parameter: the va_list. To create a variadic function, you’ll need to have at least one fixed parameter.
Function names are pointers to the function; the pointer variable contains the address of the function. If you have a function drive(), then drive and &drive are both pointers to the function. The function pointer name is a constant. To create a pointer variable that points to the function name you’ll have to specify the return type of the function, the name of the pointer variable wrapped in parentheses and the parameter types that the function that you’re pointing to has (i.e. char**(*var_name)(int, char*)). This is normally done when you’re passing a function as an argument to another function or if you’re creating an array of function pointers. Certain object-oriented languages that are built with C utilize function pointers to create many object-oriented features.
Pointers
What is a pointer? A pointer is a memory address (a variable) that stores another memory address as its value. We can use that memory address to find our way to the particular area in memory. If you remember earlier, I mentioned that parameters to a function can be passed by value. If you pass an extremely large amount of data by value, it means that the function must make a copy of that data and store it locally. Local variables (variables declared within the function) are stored in the stack. If the value that you just copied is too large, it can cause the stack to run out of memory. Also, copying such large objects (not the be confused with objects in object oriented programming) is time consuming. It’s much easier to just pass the address of where the object resides.
As a side note, why do functions store their variables in a different section of memory? One reason is scope of variables for recursive functions.
Imagine the following piece of code being evaluated:
int i = 0, j = 1; i = 2;
j = i;
In the example above, if the variable i is located on the left-hand side of the expression, the value replaces the contents of i. If i is located on the right-hand side, the value of i is assigned to j. Make sure to understand that basic concept. Once you do, dereferencing a pointer on the left-hand side causes the value of the memory location that the pointer is pointing to, to change. Dereferencing a pointer on the right-hand side of the expression causes the value to be retrieved from the memory location that the pointer points to. So, in other words, the * operator can read the contents of a memory address or set the contents of a memory address that the pointer is pointing to.
To assign a memory address of a scalar to a pointer variable, you must use the & operator to get the memory address of the scalar. You also have to make sure that both the pointer and the scalar are of the same data type. Why do pointers have types? In a couple of paragraphs, I’ll describe pointer arithmetic. But generally, if you were to add 1 to a byte, or 1 to an int, the arithmetic needs to be different since a byte occupies 1 byte and an int occupies 4 bytes of memory. If an array stores integers as values and you want to go from array[0] to array[1], you need to move 4 bytes away from array[0].
Arrays can be used as pointers. The array name stores the memory address of the first element of the array. If you print the memory address of the array name and the memory address of arrayName[0], you’ll notice that the memory addresses are identical. Array variables can’t point to somewhere else though. Also, when using the sizeof operator to check the size of the array, the compiler will tell you the size of the array. If you use a pointer that points to the first element of the array, or the array name, the compiler will lose the information about the array and will only give you the size of the pointer variable, which is 4 bytes in 32 bit machines and 8 bytes in 64 bit machines. The loss of information is called decay.
Since the array address is a number, you can do pointer arithmetic to add integers to the pointer and subtract integers from the pointer. If you create two pointers, one for example pointing to the first element and the other pointing to the third element in the array, you can subtract pointers from each other. In array pointer arithmetic, you cannot add two pointers together.
Arrays
If you understand arrays in a different programming languages, it should be simple to understand arrays in C as well. An array is stored in sequential memory addresses with array element zero acting as the memory address that can be referenced; subsequent arrays can be accessed through offset calculations. An array can store any data type as long as they’re of the same type. Arrays can also store other arrays; these types of arrays are called multi-dimensional arrays. There are two types of multi-dimensional arrays: jagged and rectangular. C’s two-dimensional arrays are always rectangular. What does that mean? Let’s say that you wanted to store strings (character arrays) into an array. The length of the second dimension will have to equal the characters of the largest string plus one (for the null character). The smaller strings will have the null character fill the unused spaces. A two-dimensional array is stored contiguously in memory so if you have a two-dimensional array[3][3], to access the third element of the second array you may write array[1][ 2]. Since we know that two-dimensional arrays are stored contiguously in memory you can also access that element by writing array[5].
You can also create an array of pointers which is just a list of memory addresses stored in an array. This way, you don’t have to declare a second dimension; each pointer can be stored in a single-dimensional array even though the values that they point to (i.e. strings) may have different lengths. The pointers still have to be of the same type (for pointer arithmetic).
Structs
A struct (structured data type) is like an array; arrays elements are accessed via indices while struct elements are accessed via field names. Arrays require that the data type of the elements be the same while struct fields can have different data types. To get the total memory size of the struct, calculate the size of each field and add them together. Fields are stored sequentially in memory in the order that they’ve been declared within the struct. Once a struct is created, the length is fixed regardless if all the fields are used or not; the maximum amount of space is allocated for each field.
Adding an identifier after the keyword struct will create a new data type that you can use to assign to some new variable. When declaring a new variable with the data type of a particular struct, you have to include the word struct prior to the struct data type name (i.e. struct vehicle lambo). When defining a struct variable, make sure to place the values in the order that they’re declared within the struct (i.e. struct vehicle lambo = {“Murci”, “mph”, 220};. To access a field within a struct, you would use the dot (.) operator (to update and read values). If you assign the struct to another variable a copy of the struct is made and new memory is allocated. When dealing with complex structurers, sometimes it’s necessary to nest structs. You can access the nested struct with the dot operator again (i.e. lambo.Murci.topSpeed); The nested struct can be initialized in a similar fashion as a single struct (i.e. struct vehicle lambo = {{“Murci”, “green”}, “mph”, 220}. If the variable is a pointer to a struct then you’ll need to dereference the variable prior to referencing the field (i.e. (*lambo).speed). The -> operator can be used (i.e. lambo->speed); it combines the dereferencing of the pointer variable and field referencing. When using the dereferencing symbol “*” and the dot operator (.) make sure that the dereferencing is wrapped in parentheses since the dot operator has higher precedence over the dereference operator.
To eliminate placing the struct keyword prior to variable declaration, you can use the typedef operator and place the identifier, which will act as an alias for the struct, after the closing brace. Once the data type has been assigned an alias, you may use only the type name in front of the variable name (i.e. vehicle a = {…};
Unions
Unions are used when a variable may contain different data types throughout its lifetime. A struct can be used, however, due to how structs are implemented, memory space will be wasted. When declaring a union, the compiler will allocate enough space for the largest field within it (i.e. if a union contains an int and a float, it will allocate enough space for a float). Regardless of how many fields are defined within a union, each value is assigned to the same memory address.
A union looks like a struct other than the keyword union being used. Typedef can be used in unions as well to create an alias for the data type. You can use the designated initializer to initialize a union by field name (i.e. height x {.euroStd=1.1};). You can also set the value with the dot notation after the variable has been declared (i.e. height x; x.euroStd = 1.1;). You don’t have to initialize a field by name; you can obtain the value of it by calling the variable name directly. Unions can be declared within structs to have a field that can accept different data types and potentially save memory space. You can access union fields with the dot “.” or “->” operators.
For both structs and unions, when an identifier is placed after the closing brace without using the typedef, the struct or union data type is assigned to the variable.
Other things to think about
On larger projects, you don’t want to recompile all the code each time you make the change. First, make sure that you have object files of everything using the command gcc -c *.c. The -c specifies to the gcc compiler that it should create all the object files but not link them. After the object files are created, you need to run the gcc -o file *.o which will link all of the generated .o files. The compiler will skip most of the compilation process and will begin linking them together to form an executable. If changes are made to a single file, you’ll only have to recompile that one file using the -c option outlined above (of course specifying your file name instead of using the * symbol). You will have to link all of the files again to create the executable but it’s a drastic reduction in compilation time. You can automate this process using the “make” build automation tool.
When you need to allocate memory at runtime, you’ll use the malloc function. Malloc take a single parameter; that parameter tells the malloc function how many bytes to allocate on the heap. Since most of the time you’re not going to know how many bytes you’ll need, the malloc parameter almost always utilizes the sizeof operator. To be able to use the malloc function, you’ll first need to import the stdlib.h library. Once the memory has been allocated on the heap, the malloc function returns a general-purpose pointer (void*) to the newly generated space. Although it’s not necessary, most programmers will cast the general-purpose pointer to a specific data type. Programmers should always use the free function to deallocate memory on the heap. If not, there’s a possibility that a memory leak may occur. If a memory leak does occur, you can use a Linux utility like valgrind to locate it. Valgrind has its own version of the malloc and free functions; it will intercept your code and keep track of the code that calls for heap allocation and deallocation. Valgrind works best if your compiled executable contains debug information (to add debug information to your code, use the -g option with gcc).
Fun fact, if you look up the definition of heap, it’s “an untidy collection of things piled up haphazardly.” The heap in memory is called the heap because it stores data in an unorganized way.
Data Science Student | Operations Analyst
3 周Err. D. Dr
Student at Sabaragamuwa University of Sri Lanka
3 个月@ycvcxch c y is