Skip to main content Link Search Menu Expand Document (external link)

Theme: auto

University Simple C Language Reference

Motivation

There are many different languages that have been created for academic settings, and even languages that have been created for an undergraduate compilers curriculum. The most popular compiler-focused language is arguably Classroom Object Oriented Language (COOL), which was developed by Dr. Alex Aiken at Stanford. However, there were three main reasons we developed something different:

  1. We wanted a language that was a valid subset of an actual programming language, and in particular, the C programming language. This makes it interesting because students will have the opportunity to work with a language they are familiar with. It also means that output can be compared against production-quality compilers.
  2. We wanted a language that was amenable enough to recursive descent parsing such that it wouldn’t be unreasonable to ask an undergraduate student to write a portion of the parser within three weeks. The parser uses recursive descent because the most popular C/C++ compilers (including both GCC and Clang) utilize recursive descent for almost all their parsing. This allows for very specific error messages, which are usually not present in academic compilers. Of course, a parser for the language described herein could also (rather trivially) be created using an LALR parser-generator such as Bison.
  3. We wanted a language which did not have too many high-level constructs, as that would make it more manageable to construct the LLVM IR. If the language were object-oriented, it would require more effort to generate the appropriate runtime data structures within LLVM IR.

While University Simple C (abbreviated as USC) is a fairly narrow subset of Standard C, it ultimately is possible to create non-trivial programs within these restrictions. For example, you can implement an in-place quicksort.

Types and Variables

One of the most significant simplifications in USC is how it handles types and variables. The only variable types supported are char and int. It is also possible to create statically-allocated arrays of characters and integers. Type specifiers such as static or volatile are not supported. Pointers are not explicitly supported, but they are implicitly supported in that arrays are passed to functions by address. No user-defined types (typedef, struct, or union) are supported.

With only char and int base types, type conversions are fairly straightforward. All expressions are aggressively converted to 32-bit integer operations, and if they ultimately must be stored in a char, the value is truncated. This is fairly consistent with what occurs in C, though it technically is not standard-compliant. There are no conversions allowed between arrays and base types.

Variable declarations follow the C89 restriction in that they must appear at the beginning of a lexical scope. Scoping follows standard C89 rules. Each function has its own lexical scope, and any compound statements within that function are given child scopes. This means standard variable ghosting/redeclaration rules apply. One restriction is that there are no global variables.

To simplify the syntax of the declarations, USC does not allow multiple declarations in one statement. Each variable declaration must appear in a separate statement. For array declarations, they must either have a fixed size specified or have an assignment at declaration. The latter can only be done with strings, since initializer lists are not supported.

Functions

All functions must be declared and implemented at the same time in source – there is no support for forward declaration. This also means that it is not possible to write code where there is a circular dependency between two functions.

Functions must return either void, int, or char. As for function parameters, int, char, and arrays are supported. As in Standard C, arrays are passed by address while the base types are passed by value. User-defined functions do not support variable arguments.

There must be an entry function named main that returns an int. No parameters are allowed for the main function.

All functions that have a non-void return type must conclude their body with a return statement. This is because parsing and semantic analysis is completed in a single pass, so performing reachability analysis on return statements is not possible.

Operators and Expressions

There is a reasonable complement of mathematical and logical operators available in USC, so most of the commonly-used binary operators are available. Array subscript operators are also supported. Not very many unary operators are supported, however, other than logical not, pre-increment, pre-decrement, and “address of” in the specific case of taking the address of a subscript element of an array.

There are some restrictions on the operators – for example, pre-increment/decrement must be performed directly on an identifier, so code such as --(x) is not allowed. Most other operators can be performed on expressions.

For logical operators, Standard C rules are followed where any non-zero value is considered “true.” However, one restriction is that any such operators on a string/array are not allowed. Logical || and && are short-circuited as they are in Standard C.

One important note is that assignment in USC is not treated as an expression. This greatly simplifies expression evaluation, since expressions such as (x = 5) * 5 are not possible.

Comments

Single line // comments are supported. Multiline comments are not supported.

Statements

There are only a few types of statements supported in USC, but they still allow for fairly complex logic. Compound statements (multiple statements surrounded by braces) are allowed, as are assignment statements, return statements, expression statements, and null statements.

In terms of control flow, if (with or without else) is supported, but the only type of loop that’s supported is while. This isn’t really a problem since all for and do/while loops can easily be changed to while loops. Jumps in flow such as break, continue, or goto are also not allowed. This makes all control flow in USC reducible. Finally, there is no support for switch statements, but these can be emulated with a series of chained else ifs.

Memory

As there are no explicit pointers types, there is no dynamic memory allocation support in USC. This means that all variables are local to the function they are declared in. Of course, arrays are implicitly passed by address. This means that it is possible to create arrays with a single element to pass it by address.

Preprocessor/Multiple Source Files

There is no preprocessor whatsoever, and all code must be contained in a single source file.

Standard Library

In order to allow for basic output, USC allows for calls to the printf function from the Standard C Library. This is the only Standard Library function that is currently supported. The compiler will automatically add a declaration of printf to the IR if it detects use of the function, as there would otherwise be no way to forward declare the function (since USC does not support variable arguments). No validation of the parameters to printf is performed.

Grammar

This section contains the full context-free grammar for USC. Note that the grammar as written is most definitely not LL(1) or LR(1). There is left recursion, which would be problematic for LL(1), and there are common left prefixes, which would cause shift-reduce conflicts in strict LR(1). Of course, the grammar can be transformed to be more amenable to LL(1) or LR(1).

Terminals are shown in \(\texttt{monospace}\). The following terminals are defined by these regular expressions (following Flex syntax):

\[\begin{align} \texttt{id} = & \texttt{[a-zA-Z}\_\texttt{][a-zA-Z0-9}\_\texttt{]*} \\ \texttt{constant} = & \texttt{("\\'"("\\\\t"|"\\\\n"|.)"\\'")|("-"?(0|([1-9][0-9]*)))} \\ \texttt{string} = & \texttt{\\"([} \verb|^| \texttt{\\\\\\"]|\\\\n|\\\\t)*\\"} \end{align}\]

All other terminals are the exact text as written.

\[\begin{align} Program \rightarrow & FunctionList \\ FunctionList \rightarrow & FunctionList \hspace{0.25cm} Function \\ | & Function \\ Function \rightarrow & ReturnType \hspace{0.15cm} \texttt{id} \hspace{0.15cm} \texttt{(} \hspace{0.15cm} ArgumentDecl\hspace{0.15cm}\texttt{)}\hspace{0.15cm}CompoundStmt \\ ReturnType \rightarrow & \texttt{void} \\ | & \texttt{int} \\ | & \texttt{char} \\ ArgumentDecl \rightarrow & ArgDeclList \\ | & \epsilon \\ ArgDeclList \rightarrow & ArgDeclList\hspace{0.15cm}\texttt{,}\hspace{0.15cm}ArgDecl \\ | & ArgDecl \\ ArgDecl \rightarrow & VarType\hspace{0.15cm}\texttt{id}\hspace{0.15cm}ArgDeclArray \\ ArgDeclArray \rightarrow & \texttt{[}\hspace{0.15cm}\texttt{]} \\ | & \epsilon \\ VarType \rightarrow & \texttt{int} \\ | & \texttt{char} \\ CompoundStmt \rightarrow & \texttt{\{}\hspace{0.15cm}Declarations\hspace{0.25cm} Statements\hspace{0.15cm}\texttt{\}} \\ Declarations \rightarrow & DeclList \\ | & \epsilon \\ DeclList \rightarrow & DecList \hspace{0.25cm} Decl \\ | & Decl \\ Decl \rightarrow & VarType\hspace{0.15cm}\texttt{id}\hspace{0.15cm}DeclArray\hspace{0.25cm}DeclAssign\hspace{0.15cm}\texttt{;} \\ DeclArray \rightarrow & \texttt{[}\hspace{0.15cm}\texttt{constant}\hspace{0.15cm}\texttt{]} \\ | & \texttt{[}\hspace{0.15cm}\texttt{]} \\ | & \epsilon \\ DeclAssign \rightarrow & \texttt{=}\hspace{0.15cm}Expr \\ | & \epsilon \\ Statements \rightarrow & StmtList \\ | & \epsilon \\ StmtlList \rightarrow & StmtList\hspace{0.25cm}Stmt \\ | & Stmt \\ Stmt \rightarrow & CompoundStmt \\ | & AssignStmt \\ | & IfStmt \\ | & WhileStmt \\ | & ReturnStmt \\ | & Expr\hspace{0.15cm}\texttt{;} \\ | & \texttt{;} \\ AssignStmt \rightarrow & \texttt{id}\hspace{0.15cm}\texttt{=}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{;} \\ | & \texttt{id}\hspace{0.15cm}\texttt{[}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{]}\hspace{0.15cm}\texttt{=}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{;}\\ IfStmt \rightarrow & \texttt{if}\hspace{0.15cm}\texttt{(}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{)}\hspace{0.15cm}Stmt \\ | & \texttt{if}\hspace{0.15cm}\texttt{(}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{)}\hspace{0.15cm}Stmt\hspace{0.15cm}\texttt{else}\hspace{0.15cm}Stmt \\ WhileStmt \rightarrow & \texttt{while}\hspace{0.15cm}\texttt{(}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{)}\hspace{0.15cm}Stmt \\ ReturnStmt \rightarrow & \texttt{return}\hspace{0.15cm}\texttt{;} \\ | & \texttt{return}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{;} \\ Expr \rightarrow & Expr\hspace{0.15cm}\texttt{||}\hspace{0.15cm}AndTerm \\ | & AndTerm \\ AndTerm \rightarrow & AndTerm\hspace{0.15cm}\texttt{&&}\hspace{0.15cm}RelExpr \\ | & RelExpr \\ RelExpr \rightarrow & RelExpr\hspace{0.15cm}\texttt{==}\hspace{0.15cm}NumExpr \\ | & RelExpr\hspace{0.15cm}\texttt{!=}\hspace{0.15cm}NumExpr \\ | & RelExpr\hspace{0.15cm}\texttt{<}\hspace{0.15cm}NumExpr \\ | & RelExpr\hspace{0.15cm}\texttt{>}\hspace{0.15cm}NumExpr \\ | & NumExpr \\ NumExpr \rightarrow & NumExpr\hspace{0.15cm}\texttt{+}\hspace{0.15cm}Term \\ | & NumExpr\hspace{0.15cm}\texttt{-}\hspace{0.15cm}Term \\ | & Term \\ Term \rightarrow & Term\hspace{0.15cm}\texttt{*}\hspace{0.15cm}Value \\ | & Term\hspace{0.15cm}\texttt{/}\hspace{0.15cm}Value \\ | & Term\hspace{0.15cm}\%\hspace{0.15cm}Value \\ | & Value \\ Value \rightarrow & \texttt{!}\hspace{0.15cm}Factor \\ | & Factor \\ Factor \rightarrow & \texttt{(}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{)} \\ | & \texttt{constant} \\ | & \texttt{string} \\ | & \texttt{id} \\ | & \texttt{id}\hspace{0.15cm}\texttt{[}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{]} \\ | & \texttt{id}\hspace{0.15cm}\texttt{(}\hspace{0.15cm}FuncCallArgs\hspace{0.15cm}\texttt{)} \\ | & \texttt{++}\hspace{0.15cm}\texttt{id} \\ | & \texttt{--}\hspace{0.15cm}\texttt{id} \\ | & \texttt{&}\hspace{0.15cm}\texttt{id}\hspace{0.15cm}\texttt{[}\hspace{0.15cm}Expr\hspace{0.15cm}\texttt{]} \\ FuncCallArgs \rightarrow & ArgList \\ | & \epsilon \\ ArgList \rightarrow & Arglist\hspace{0.15cm}\texttt{,}\hspace{0.15cm}Expr \\ | & Expr \end{align}\]

Code Listing: Quicksort in USC

Although USC is a relatively narrow subset of C, it is still possible to write non-trivial programs, such as the following quicksort implementation:

// quicksort.usc
// Implements in-place quicksort algorithm
// (I just went by the wikipedia pseudocode)
// Expected result:
// abcdeeefghhijklmnoooopqrrsttuuvwxyz
//---------------------------------------------------------
// Copyright (c) 2014-2022, Sanjay Madhav
// All rights reserved.
//
// This file is distributed under the BSD license.
// See LICENSE.TXT for details.
//---------------------------------------------------------

int partition(char array[], int left, int right, int pivotIdx)
{
	char pivotVal = array[pivotIdx];
	int storeIdx = left;
	int i = left;
	char temp;
	
	// Move pivot to end
	temp = array[pivotIdx];
	array[pivotIdx] = array[right];
	array[right] = temp;
	
	while (i < right)
	{
		if (array[i] < pivotVal)
		{
			// Swap array[i] and array[storeIdx]
			temp = array[i];
			array[i] = array[storeIdx];
			array[storeIdx] = temp;
			++storeIdx;
		}
		
		++i;
	}
	
	// Swap array[storeIdx] and array[right]
	temp = array[storeIdx];
	array[storeIdx] = array[right];
	array[right] = temp;
	
	return storeIdx;
}

void quicksort(char array[], int left, int right)
{
	int pivotIdx;
	
	if (left < right)
	{
		// Pick the middle point
		pivotIdx = left + (right - left) / 2;
		
		pivotIdx = partition(array, left, right, pivotIdx);
		quicksort(array, left, pivotIdx - 1);
		quicksort(array, pivotIdx + 1, right);
	}
}

int main()
{
	char letters[] = "thequickbrownfoxjumpsoverthelazydog";
	quicksort(letters, 0, 34);
	
	printf("%s\n", letters);
	
	return 0;
}