6.3 KiB
Parser's Algorithim
gencpp uses a hand-written recursive descent parser. Both the lexer and parser currently handle a full C/C++ file in a single pass.
Notable implementation background
Lexer
The lex procedure does the lexical pass of content provided as a StrC
type.
The tokens are stored (for now) in gen::parser::Tokens
.
Fields:
Array<Token> Arr;
s32 Idx;
What token types are supported can be found in ETokType.csv you can also find the token types in ETokType.h , which is the generated enum from the csv file.
Tokens are defined with the struct gen::parser::Token
:
Fields:
char const* Text;
sptr Length;
TokType Type;
s32 Line;
s32 Column;
u32 Flags;
Flags is a bitfield made up of TokFlags (Token Flags):
TF_Operator
: Any operator token used in expressionsTF_Assign
- Using statment assignment
- Parameter argument default value assignment
- Variable declaration initialization assignment
TF_Preprocess
: Related to a preprocessing directiveTF_Preprocess_Cond
: A preprocess conditionalTF_Attribute
: An attribute tokenTF_AccessSpecifier
: An accesor operation tokenTF_Specifier
: One of the specifier tokensTF_EndDefinition
: Can be interpreted as an end definition for a scope.TF_Formatting
: Considered a part of the formattingTF_Literal
: Anything considered a literal by C++.
I plan to replace IsAssign with a general flags field and properly keep track of all operator types instead of abstracting it away to ETokType::Operator
.
Traversing the tokens is done with the following interface macros:
Macro | Description |
---|---|
currtok_noskip |
Get the current token without skipping whitespace |
currtok |
Get the current token, skip any whitespace tokens |
prevtok |
Get the previous token (does not skip whitespace) |
nexttok |
Get the next token (does not skip whitespace) |
eat( Token Type ) |
Check to see if the current token is of the given type, if so, advance Token's index to the next token |
left |
Get the number of tokens left in the token array |
check_noskip |
Check to see if the current token is of the given type, without skipping whitespace |
check |
Check to see if the current token is of the given type, skip any whitespace tokens |
Parser
The parser has a limited user interface, only specific types of definitions or statements are expected to be provided by the user directly when using to construct an AST dynamically (See SOA for example). It however does attempt to provide capability to parse a full C/C++ from production codebases.
Each public user interface procedure has the following format:
<code type> parse_<definition type>( StrC def )
{
check_parse_args( def );
using namespace Parser;
TokArray toks = lex( def );
if ( toks.Arr == nullptr )
return CodeInvalid;
// Parse the tokens and return a constructed AST using internal procedures
...
}
The most top-level parsing procedure used for C/C++ file parsing is parse_global_body
:
It uses a helper procedure called parse_global_nspace
.
Each internal procedure will have the following format:
internal
<code type> parse_<definition_type>( <empty or contextual params> )
{
push_scope();
...
<code type> result = (<code type>) make_code();
...
Context.pop();
return result;
}
Below is an outline of the general alogirithim used for these internal procedures. The intention is provide a basic briefing to aid the user in traversing the actual code definitions. These appear in the same order as they are in the parser.cpp
file
parse_array_decl
- Check if its an array declaration with no expression.
- Consume and return empty array declaration
- Opening square bracket
- Consume expression
- Closing square bracket
- If adjacent opening bracket
- Repeat array declaration parse until no brackets remain
parse_attributes
- Check for standard attribute
- Check for GNU attribute
- Check for MSVC attribute
- Check for a token registered as an attribute
parse_class_struct
parse_class_struct_body
parse_comment
parse_compilcated_definition
parse_define
parse_forward_or_definition
parse_function_after_name
parse_function_body
parse_global_nspace
- Make sure the type provided to the helper function is a
Namespace_Body
,Global_Body
,Export_Body
,Extern_Linkage_body
. - If its not a
Global_Body
eat the opening brace for the scope. - `
parse_identifier
parse_include
parse_operator_after_ret_type
parse_operator_function_or_variable
parse_pragma
parse_params
parse_preprocess_cond
parse_simple_preprocess
parse_static_assert
parse_template_args
parse_variable_after_name
parse_variable_declaration_list
parse_class
parse_constructor
parse_destructor
parse_enum
parse_export_body
parse_extern_link_body
parse_extern_link
parse_friend
parse_function
parse_namespace
parse_operator
parse_operator_cast
parse_struct
parse_template
parse_type
parse_typedef
- Check for export module specifier
- typedef keyword
- If its a preprocess macro: Get the macro name
parse_union
- Check for export module specifier
- union keyword
parse_attributes
- Check for identifier
- Parse the body (Possible options):
- Newline
- Comment
- Decl_Class
- Decl_Enum
- Decl_Struct
- Decl_Union
- Preprocess_Define
- Preprocess_Conditional
- Preprocess_Macro
- Preprocess_Pragma
- Unsupported preprocess directive
- Variable
- If its not an inplace definiton: End Statement
parse_using
- Check for export module specifier
- using keyword
- Check to see if its a using namespace
- Get the identifier
- If its a regular using declaration:
parse_attributes
parse_type
parse_array_decl
- End statement
- Check for inline comment
parse_variable
- Check for export module specifier
parse_attributes
parse specifiers
parse_type
parse_identifier
parse_variable_after_name