Introduction
C is a general purpose programming language developed in the late 1960s for the creation of the Unix operating system. Its use for writing systems explains why it relates closely to the way machines work. Programming in C is much lower-level than working with Java or Python. For example, understanding how computer memory is allocated is an important aspect of programming in C. Although C can be considered “hard to learn”, it is actually a very simple language. Even with modern languages such as Java, Python, Go, and R, learning C is an important skill for systems work as it allows access to the hardware of a computer. It is essential for writing device drivers, operating systems, and ultra-fast code for data processing.
C is a very common language, and it is the language in many applications, systems, and even other languages are built. For example, Windows and MacOS are both written in C, the Python and R interpreters are written in C, the Git version control system is written in C, to name a few.
C is a compiled language - which means that in order to run it, the compiler (e.g., gcc) must transform the code that we wrote to machine language and create an executable program file. This file can then be executed from the command line (or, as it’s often called, the shell, the CLI, or the terminal).
Edit-Compile-Run Cycle
Writing a program in C requires several steps and is similar to other compiled languages but different from interpreted languages such as Python and R. The steps are:
Write the source code for the program in an editor. On Linux, the simplest editor to use in vi or its improved version vim. For Windows and MacOS one can also use a development environment such as Eclipse.
Compile the source code into assembly code using a compiler such as gcc or g++. Most C compilers are actually C++ compilers as C++ is a superset of C.
After compilation, the code must be transformed into an executable using a linker. The linker is generally invoked by the compiler after compilation is complete.
Once the code is linked with all external libraries, the program is now executable and can be run from the terminal shell.
Editing with vi and vim
For more information on how to edit text files with vi and vim, consult:
Example
Use an editor such as vim to type in the program below and save it under the name “hello.c”.
#include <stdio.h>
int main(int argc, char* argv[]) {
printf("Hello World\n");
return 0;
}
In Linux, compile the program from the shell using the command:
% gcc hello.c -o hello
In the shell command above, the % is used to indicate the command prompt; it may differ on your system. Do not type it in.
The option -o hello specifies the name of the executable program. If it is not specified, it will default to a.out.
To run, type in
% ./hello
The ./ is required as the current directory is generally not in the searched path for programs in Linux.
Program Structure
A C program is a collection of functions located in one or more source files. Source files are text files with the .c extension. They are compiled separately and then linked together with libraries that contain external functions provided by third-parties. The names of the functions must be unique and conform to the rules for naming identifiers (start with letter or underscore and then have any number of letters, digits, and underscores)
Every C program must have a function call main. Execution of the program starts with the first statement in main. The main function is called by the operating system when the process in which the program runs is created. As programs are generally run from the command line (shell or terminal), they can take parameters as input, which is the reason why main has two arguments: the number of arguments passed by the command line and an array of strings corresponding to the arguments.
Let’s take a look at each line in the program below.
#include <stdio.h>
int main(int argc, char* argv[]) {
printf("Hello World\n");
return 0;
}
Line 1: Includes the source code contained in another file that is compiled as part of this program. Called an include or header file. Headers files with a .h extension (by convention) contain function definitions and are often used for external libraries. In this program, the stdio.h header file includes the definitions of input/output functions.
Line 3: The definition of the main function where execution starts. The body of the function is between { and }.
Line 4: Call to the function printf which outputs the text to the console. The \n is a newline character to advance the cursor to the next line on the console device (normally a terminal screen but can also be a file).
Line 5: Explicit return from the function with the value 0. The value is returned to the shell so that the program can be used within shell scripts. If there is no return statement, then the return is implicit but the return value is not defined and should only be used for functions with a return type of void.
Note that every statement must be terminated by a semi-colon (;). Statements may span several lines.
Never put a semi-colon before the start of a block ({) or after a }. This is the source of many sleepless nights for C/C++ (and even Java) programmers.
Variables
Variables are memory allocated to hold values. They are of a specific type which determines what values can be stored in the variable. All variables must be declared with a type and an identifier. Variables can have a default value.
int counter;
char ioByte = 0;
In original C, all variables must be declared at the beginning of a block before any statements. This is not true for C++ and, consequently, most modern C programs.
Types
C has several data types and additional user data types can be defined through struct definitions.
The table below summarizes commonly used types in C programming. The column for “Format Specifier” is useful for printing or displaying variables using printf()
.
int |
between 2 and 4 |
%d, %i |
char |
1 |
%c |
float |
4 |
%f |
double |
8 |
%lf |
short int |
2 usually |
%hd |
unsigned int |
at least 2, usually 4 |
%u |
long int |
at least 4, usually 8 |
%ld, %li |
long long int |
at least 8 |
%lld, %lli |
unsigned long int |
at least 4 |
%lu |
unsigned long long int |
at least 8 |
%llu |
signed char |
1 |
%c |
unsigned char |
1 |
%c |
long double |
at least 10, usually 12 or 16 |
%Lf |
You can get the exact number of bytes used to store a variable’s value using the sizeof()
function.
int i;
printf("size in bytes of i = %d", sizeof(i));
There are modifiers for variables including extern, register, static, and volatile. They are directives for the compiler on how to (or not to) optimize the code. These are less commonly used unless one works with systems code. Here is what they mean:
extern – the variable is defined in a library or some other source file but is used in this source file; variables can only be defined once but must be declared wherever they are used so the compiler can perform type checking; does not result in the allocation of memory, just the declaration of the identifier and its type
register – the variable should be kept in a CPU register for fast processing
static – the value of the variable does not change and its memory should be allocated from the static block rather than the stack
volatile - the variable can be modified outside of the program code (e.g., through and interrupt handler) and the compiler should always reload the value before using and not cache any values or optimize out the variable
In addition, there are derived types, including pointers and bool.
Default Values
Variables in C do not have a default value, so they must be initialized to a value prior to use. Do not assume that the default value is 0 as is common in other programming languages. To be more exact, local variables are uninitialized, while global and static variables of type integer or pointer are initialized to 0. However, it is a best practice not to rely on language defaults or compiler specific initialization and to always initialize variables after definition.
Memory Allocation and Overflows
C does not perform type checking, so it is the responsibility of the programmer to ensure that the variable has sufficient memory allocated to hold the value assigned to the variable. For example, assigning a 32-bit (4 byte) integer to a variable of type char (an 8-bit or 1 byte character) would cause an overflow. The assignment is still done and four bytes are copied to memory reserved for one byte cause the remaining 3 bytes to overflow into the next three bytes in memory, potentially overwriting other variables. A common source of headaches for programmers. Modern compilers will “lint” the code and point out such issues.
Identifiers
An identifier is the name given to entities such as variables, functions, structures, etc. Identifiers must be unique within a program block (its scope). Identifier are used to refer to a memory object.
For example, in the code fragment below, amount and accountBalance are identifiers:
int amount;
double accountBalance;
Identifier names must be different from keywords. For example, you cannot use int as an identifier because int is a keyword. A list of all C keywords is below. Note that most modern C compilers are actually C++ compilers and C++ has additional keywords. Most editors are syntax-aware and will tag keywords in different colors.
Identifier Naming Rules
A valid identifier can have letters (both uppercase and lowercase), digits and underscores (_). The first letter of an identifier must be either a letter or an underscore. You cannot use keywords like int, for, etc. as identifiers. There is no rule on how long an identifier can be, but you may run into problems with some compilers when the identifier is longer than 31 characters. You can choose any name as an identifier as long as you follow the above rules. However, use meaningful names to variables and function that represent what they contain; it makes programs more readable and easier to understand and debug.
Keywords
Identifiers cannot be reserved keywords. The list below are all the keywords of C. Most C compilers are actually C++ compilers and thus have additional keywords, such as class or virtual.
## Command Line Arguments
C (and C++) programs can run from the command line (terminal or shell) as can take arguments. The main
function takes two parameters – argc
and argv
– that are used to retrieve the command line arguments.
The example below prints out the values of the arguments, although that is no that practical.
Note that the first argument (argv[0]
) is the name of the executable program.
#include <stdio.h>
int main(int argc, char* argv[])
{
for (int i = 0; i < argc; ++i) {
printf("%i | %s\n", i, argv[i]);
}
return 0;
}
Tutorial
Watch and follow along with the narrated tutorial of a code walk below.
---
title: "Introduction to C Programming"
params:
  category: 8
  number: 100
  time: 60
  level: beginner
  tags: "c,gcc,repl.it"
  description: "Introduces basic structure of a C program. Shows
                how to compile, link, and run a C program using gcc
                and online using repl.it."
date: "<small>`r Sys.Date()`</small>"
author: "<small>Martin Schedlbauer</small>"
email: "m.schedlbauer@neu.edu"
affilitation: "Northeastern University"
output: 
  bookdown::html_document2:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    code_download: true
    theme: spacelab
    highlight: tango
---

---
title: "<small>`r params$category`.`r params$number`</small><br/><span style='color: #2E4053; font-size: 0.9em'>`r rmarkdown::metadata$title`</span>"
---

```{r code=xfun::read_utf8(paste0(here::here(),'/R/_insert2DB.R')), include = FALSE}
```

## Introduction

C is a general purpose programming language developed in the late 1960s for the creation of the Unix operating system. Its use for writing systems explains why it relates closely to the way machines work. Programming in C is much lower-level than working with Java or Python. For example, understanding how computer memory is allocated is an important aspect of programming in C. Although C can be considered "hard to learn", it is actually a very simple language. Even with modern languages such as Java, Python, Go, and R, learning C is an important skill for systems work as it allows access to the hardware of a computer. It is essential for writing device drivers, operating systems, and ultra-fast code for data processing.

C is a very common language, and it is the language in many applications, systems, and even other languages are built. For example, Windows and MacOS are both written in C, the Python and R interpreters are written in C, the Git version control system is written in C, to name a few.

C is a compiled language - which means that in order to run it, the compiler (*e.g.*, *gcc*) must transform the code that we wrote to machine language and create an executable program file. This file can then be executed from the command line (or, as it's often called, the shell, the CLI, or the terminal).

## Edit-Compile-Run Cycle

Writing a program in C requires several steps and is similar to other compiled languages but different from interpreted languages such as Python and R. The steps are:

1.  Write the source code for the program in an editor. On Linux, the simplest editor to use in *vi* or its improved version *vim*. For Windows and MacOS one can also use a development environment such as *Eclipse*.

2.  Compile the source code into assembly code using a compiler such as *gcc* or *g++*. Most C compilers are actually C++ compilers as C++ is a superset of C.

3.  After compilation, the code must be transformed into an executable using a linker. The linker is generally invoked by the compiler after compilation is complete.

4.  Once the code is linked with all external libraries, the program is now executable and can be run from the terminal shell.

## Editing with *vi* and *vim*

For more information on how to edit text files with *vi* and *vim*, consult:

-   [Classic SysAdmin: Vim 101: A Beginner's Guide to Vim](https://linuxfoundation.org/blog/classic-sysadmin-vim-101-a-beginners-guide-to-vim)
-   [VI Editor](https://www.guru99.com/the-vi-editor.html)

## Example

Use an editor such as *vim* to type in the program below and save it under the name "hello.c".

```{Rcpp eval=F}
#include <stdio.h>

int main(int argc, char* argv[]) {
  printf("Hello World\n");
  return 0;
}
```

In Linux, compile the program from the shell using the command:

```         
% gcc hello.c -o hello
```

In the shell command above, the *%* is used to indicate the command prompt; it may differ on your system. Do not type it in.

The option *-o hello* specifies the name of the executable program. If it is not specified, it will default to *a.out*.

To run, type in

```         
% ./hello
```

The *./* is required as the current directory is generally not in the searched path for programs in Linux.

## Program Structure

A C program is a collection of functions located in one or more source files. Source files are text files with the *.c* extension. They are compiled separately and then linked together with libraries that contain external functions provided by third-parties. The names of the functions must be unique and conform to the rules for naming identifiers (start with letter or underscore and then have any number of letters, digits, and underscores)

Every C program must have a function call *main*. Execution of the program starts with the first statement in *main*. The *main* function is called by the operating system when the process in which the program runs is created. As programs are generally run from the command line (shell or terminal), they can take parameters as input, which is the reason why *main* has two arguments: the number of arguments passed by the command line and an array of strings corresponding to the arguments.

Let's take a look at each line in the program below.

```{Rcpp eval=F, attr.source='.numberLines'}
#include <stdio.h>

int main(int argc, char* argv[]) {
  printf("Hello World\n");
  return 0;
}
```

**Line 1**: Includes the source code contained in another file that is compiled as part of this program. Called an include or header file. Headers files with a *.h* extension (by convention) contain function definitions and are often used for external libraries. In this program, the *stdio.h* header file includes the definitions of input/output functions.

**Line 3**: The definition of the *main* function where execution starts. The body of the function is between *{* and *}*.

**Line 4**: Call to the function *printf* which outputs the text to the console. The *\\n* is a newline character to advance the cursor to the next line on the console device (normally a terminal screen but can also be a file).

**Line 5**: Explicit return from the function with the value *0*. The value is returned to the shell so that the program can be used within shell scripts. If there is no *return* statement, then the return is implicit but the return value is not defined and should only be used for functions with a return type of *void*.

Note that every statement must be terminated by a semi-colon (;). Statements may span several lines.

> Never put a semi-colon before the start of a block ({) or after a }. This is the source of many sleepless nights for C/C++ (and even Java) programmers.

## Variables

Variables are memory allocated to hold values. They are of a specific type which determines what values can be stored in the variable. All variables must be declared with a type and an identifier. Variables can have a default value.

```{Rcpp eval=F}
int counter;
char ioByte = 0;
```

In original C, all variables must be declared at the beginning of a block before any statements. This is not true for C++ and, consequently, most modern C programs.

### Types

C has several data types and additional user data types can be defined through *struct* definitions.

The table below summarizes commonly used types in C programming. The column for "Format Specifier" is useful for printing or displaying variables using `printf()`.

|          Type          | Size (bytes)                  | Format Specifier |
|:----------------------:|:------------------------------|:----------------:|
|          int           | between 2 and 4               |      %d, %i      |
|          char          | 1                             |        %c        |
|         float          | 4                             |        %f        |
|         double         | 8                             |       %lf        |
|       short int        | 2 usually                     |       %hd        |
|      unsigned int      | at least 2, usually 4         |        %u        |
|        long int        | at least 4, usually 8         |     %ld, %li     |
|     long long int      | at least 8                    |    %lld, %lli    |
|   unsigned long int    | at least 4                    |       %lu        |
| unsigned long long int | at least 8                    |       %llu       |
|      signed char       | 1                             |        %c        |
|     unsigned char      | 1                             |        %c        |
|      long double       | at least 10, usually 12 or 16 |       %Lf        |

You can get the exact number of bytes used to store a variable's value using the `sizeof()` function.

```{Rcpp eval=F}
int i;

printf("size in bytes of i = %d", sizeof(i));
```

There are modifiers for variables including *extern*, *register*, *static*, and *volatile*. They are directives for the compiler on how to (or not to) optimize the code. These are less commonly used unless one works with systems code. Here is what they mean:

-   **extern** -- the variable is defined in a library or some other source file but is used in this source file; variables can only be defined once but must be declared wherever they are used so the compiler can perform type checking; does not result in the allocation of memory, just the declaration of the identifier and its type

-   **register** -- the variable should be kept in a CPU register for fast processing

-   **static** -- the value of the variable does not change and its memory should be allocated from the static block rather than the stack

-   **volatile** - the variable can be modified outside of the program code (*e.g.*, through and interrupt handler) and the compiler should always reload the value before using and not cache any values or optimize out the variable

In addition, there are derived types, including pointers and *bool*.

### Default Values

Variables in C do not have a default value, so they must be initialized to a value prior to use. Do not assume that the default value is 0 as is common in other programming languages. To be more exact, local variables are uninitialized, while global and static variables of type integer or pointer are initialized to 0. However, it is a best practice not to rely on language defaults or compiler specific initialization and to always initialize variables after definition.

### Memory Allocation and Overflows

C does not perform type checking, so it is the responsibility of the programmer to ensure that the variable has sufficient memory allocated to hold the value assigned to the variable. For example, assigning a 32-bit (4 byte) integer to a variable of type *char* (an 8-bit or 1 byte character) would cause an overflow. The assignment is still done and four bytes are copied to memory reserved for one byte cause the remaining 3 bytes to overflow into the next three bytes in memory, potentially overwriting other variables. A common source of headaches for programmers. Modern compilers will "lint" the code and point out such issues.

### Identifiers

An identifier is the name given to entities such as variables, functions, structures, *etc.* Identifiers must be unique within a program block (its *scope*). Identifier are used to refer to a memory object.

For example, in the code fragment below, *amount* and *accountBalance* are identifiers:

```{Rcpp eval=F}
int amount;
double accountBalance;
```

Identifier names must be different from keywords. For example, you cannot use *int* as an identifier because *int* is a keyword. A list of all C keywords is below. Note that most modern C compilers are actually C++ compilers and C++ has additional keywords. Most editors are syntax-aware and will tag keywords in different colors.

#### Identifier Naming Rules

A valid identifier can have letters (both uppercase and lowercase), digits and underscores (\_). The first letter of an identifier must be either a letter or an underscore. You cannot use keywords like *int*, *for*, *etc.* as identifiers. There is no rule on how long an identifier can be, but you may run into problems with some compilers when the identifier is longer than 31 characters. You can choose any name as an identifier as long as you follow the above rules. However, use meaningful names to variables and function that represent what they contain; it makes programs more readable and easier to understand and debug.

### Keywords

Identifiers cannot be reserved keywords. The list below are all the keywords of C. Most C compilers are actually C++ compilers and thus have additional keywords, such as *class* or *virtual*.

![Keywords in C](images/keywords-in-c.jpg){width="75%"} \## Command Line Arguments

C (and C++) programs can run from the command line (terminal or shell) as can take arguments. The `main` function takes two parameters -- `argc` and `argv` -- that are used to retrieve the command line arguments.

The example below prints out the values of the arguments, although that is no that practical.

Note that the first argument (`argv[0]`) is the name of the executable program.

```{Rcpp eval=F}
#include <stdio.h>

int main(int argc, char* argv[])
{
  for (int i = 0; i < argc; ++i) {
    printf("%i | %s\n", i, argv[i]);
  }

  return 0;
}
```

### Tutorial

Watch and follow along with the narrated tutorial of a code walk below.

<iframe src="https://player.vimeo.com/video/747621011?h=445f200048" width="320" height="180" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen data-external="1">

</iframe>

## Files & Resources

```{r zipFiles, echo=FALSE}
zipName = sprintf("LessonFiles-%s-%s.zip", 
                 params$category,
                 params$number)

textALink = paste0("All Files for Lesson ", 
               params$category,".",params$number)

# downloadFilesLink() is included from _insert2DB.R
knitr::raw_html(downloadFilesLink(".", zipName, textALink))
```

------------------------------------------------------------------------

## References

Kernighan, B. W., & Ritchie, D. M. (1988). *The C programming language*. Pearson Education

[Ritchie, D. M. (1993). The development of the C language. ACM Sigplan Notices, 28(3), 201-208.](https://www.bell-labs.com/usr/dmr/www/chist.pdf)

## Errata

[Let us know](https://form.jotform.com/212187072784157){target="_blank"}.
