Proposed C++ Manifold Operator

Proposed C++ Manifold Operator

The following details the design, implementation, and WG21 discussion on proposed manifold operators for the C++ programming language.

It is suggested that four new operators be introduced into ISO/IEC 14882:2020(E) Standard – Programming Language C++.

manifold operator syntax

The stated operators aim to simplify common code patterns, improve clarity when interpreting multifaceted conditionals, and reduce unnecessary evaluation when compared to alternative solutions; see below.

The chosen syntax was selected to align with similar bitwise operations XOR, AND, OR, and NOT. Surrounded by the subscript operator tokens to signify a sequence of values. It is hoped this would lead to a more intuitive interpretation of the operators given a basic understanding of existing programming languages.


?? all-of-operator syntax was altered during testing to avoid lambda conflicts.


The following example was liberated from LLVM's SmartPtrModeling.cpp:460.

Before the Proposal

const OverloadedOperatorKind OOK = FD->getOverloadedOperator();
if (!(OOK == OO_EqualEqual || OOK == OO_ExclaimEqual ||
      OOK == OO_Less || OOK == OO_LessEqual ||
      OOK == OO_Greater || OOK == OO_GreaterEqual ||
      OOK == OO_Spaceship))
    return false;        

After the Proposal

const OverloadedOperatorKind OOK = FD->getOverloadedOperator();
if (OOK [!] OO_EqualEqual, OO_ExclaimEqual, OO_Less, OO_LessEqual,   
            OO_Greater, OO_GreaterEqual, OO_Spaceship)
    return false;        

The post-proposal logic results in a 21.7% decrease in character count and a reduction in cognitive load — no requirement exists to study each comparison operator delimited by subsequent logical operators.

Constraints

Datatypes and expressions valid on the right-hand side (RHS) of a comparison operator remain valid with the manifold operators; where it does not conflict with the following:

Expectations

The suggested manifold operators are solving a niche problem with somewhat subjective reasoning used to justify their introduction. It is highly unlikely that the proposal will make any headway with WG21.


Alternative Solutions

A number of existing solutions can be crafted using standard language constructs, and it is important that we justify any new operators by demonstrating where existing capabilities fall short.


?? Design of the alternative solutions are geared towards readability, they are not intended to be exhaustive, optimal, or complete.


Macro Solution

A variadic macro implementation has all the shortcomings of macros but most importantly presents a difficult solution to debug and requires the addition of new definitions as the argument list expands.

#define NOT_OF0(lhs,rhs,...) lhs != rhs
#define NOT_OF1(lhs,rhs,...) lhs != rhs && NOT_OF0(lhs, __VA_ARGS__)
// Add NOT_OF2, 3, 4, 5... as desired.
#define NONE_OF(lhs, ...) NOT_OF1(lhs, __VA_ARGS__)

const OverloadedOperatorKind OOK = FD->getOverloadedOperator();
if (NONE_OF(OOK, OO_EqualEqual, OO_ExclaimEqual,
                 OO_Less, OO_LessEqual,
                 OO_Greater, OO_GreaterEqual,
                 OO_Spaceship))
    return false;        

Template Solution

A variadic template combined with a C++17 fold expression offers an improved solution. Yet, it fails to avoid unnecessary evaluation of the RHS, as there is no concept of left-to-right or right-to-left evaluation of an expression in C++ — see Order of evaluation for details.

template <typename LHS, typename... Args>
auto none_of(LHS lhs, Args... args)
{
    const auto predicate = std::not_equal_to<LHS>();
    return (predicate(lhs, args) && ...);
}

const OverloadedOperatorKind OOK = FD->getOverloadedOperator();
if (none_of(OOK, OO_EqualEqual, OO_ExclaimEqual,
                 OO_Less, OO_LessEqual,
                 OO_Greater, OO_GreaterEqual,
                 OO_Spaceship))
    return false;        

STL Solution

The Standard Template Library (STL) produces a verbose conditional construction with the pitfalls (order of evaluation) of the variadic template solution.

const std::set<OverloadedOperatorKind> ComparisonOperators = {
    OO_EqualEqual, OO_ExclaimEqual,
    OO_Less, OO_LessEqual,
    OO_Greater, OO_GreaterEqual,
    OO_Spaceship
};

const OverloadedOperatorKind OOK = FD->getOverloadedOperator();
if (std::none_of(std::cbegin(ComparisonOperators),
                 std::cend(ComparisonOperators),
                 [&](auto op) { return op == OOK; }))
    return false;        

Helper Functions

Clean, concise, and readable - a hard case to dismiss.

However, results in the proliferation of many helper functions to address all possible combinations throughout a codebase e.g. is_comparison_operator, is_equality_operator, is_not_equality_operator.

It too suffers from the order of evaluation pitfall, if the argument list grows.

const OverloadedOperatorKind OOK = FD->getOverloadedOperator();
if (is_not_comparison_operator(OOK))
    return false;        

C++ Operators

The language has a number of existing operators that can be classified into one of six categories Arithmetic, Relational, Logical, Bitwise, Assignment, and Conditional.

The manifold operators align with the singular existing conditional operator within the language, Ternary.

conditional-expression:                                [C99 6.5.15]
    logical-OR-expression
    logical-OR-expression '?' expression ':' conditional-expression
[GNU]
    logical-OR-expression '?' ':' conditional-expression        

As with the Ternary operator the manifold operators cannot be overridden, this is to ensure that true/false expressions are only evaluated based on the truth or falseness of the conditional expression.

The order of evaluation is maintained due to operator associativity, overcoming the stated pitfall of the alternative solutions above.

Associativity is the left-to-right or right-to-left order for grouping operands to operators that have the same precedence. An operator's precedence is meaningful only if other operators with higher or lower precedence are present.

Three-letter operators have been a contentious area within the C++ community. This barrier was breached with the release of C++20 introducing the three-way comparison operator (P0515R0); more commonly known as the spaceship operator due to its designated token <=>.

Fortunately, the suggested manifold syntax does not intersect with any pre-existing token sequence.


Implementation

The LLVM compiler infrastructure was selected as the target for operator implementation.

LLVM began as a research project at the University of Illinois, with the goal of providing a modern, SSA-based compilation strategy capable of supporting both static and dynamic compilation of arbitrary programming languages

?? No requirement exists to implement a proposal prior to WG21 submission.


Internals

LLVM is broken up into a number of distinct modules tasked with the primary stages of the compilation process, from lexical analysis to code generation.

Given the new operators can be represented by existing conditionals, we'll focus on three key stages.

Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a "lexer" program.

LLVM's Lexer library implements an interface that consumes a text buffer and generates a stream of symbols, one or more characters, for parsing.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

The Parser library interprets an incoming stream of symbols from the Lexer and performs syntactic analysis and Abstract Syntax Tree (AST) generation.

Semantic analysis or context sensitive analysis is a process in compiler construction, usually after parsing, to gather necessary semantic information from the source code.

The Sema library refines and builds out the AST.

We shouldn't have to perform modifications in this area. However, the Sema library appears to be the default dumping ground for common logic, and can't be completely ruled out.

Additional Reading

For the inquisitive reader, the following links are a great start prior to code-base [LLVM] exploration.


Approach

LLVM's documentation details an approach to adding a new expression or statement. Additionally, the three-way comparison operator's commits will hopefully illuminate a potential implementation.

That being said, we will lean heavily towards treating the problem as a software defect and attempt to get it to perform an operation it was not intended to do.

Development

As stated in the build documentation, compilation will be achieved through the CMake build scripts on a Linux distro (pop-os 6.2.6-76060206-generic).

$ git clone https://github.com/llvm/llvm-project
Cloning into 'llvm-project'...
...
Receiving objects: 100% (5588761/5588761), 1.89 GiB | 4.35 MiB/s, done.
        

?? A debug build will consume ~124GB of storage, not including test cases.

$ cd llvm-project
$ mkdir build && cd build/
$ cmake -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=Debug -G "Unix Makefiles" ../llvm
...
$ make
...
$ make check-all
...
$ cd .. && build/bin/clang++ --version
clang version 18.0.0 (https://github.com/llvm-project f8b2544c422404b0c8d1225d9240301f50de1cdb)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/bcrowhurst/Development/llvm-project/build/bin        

Step 1 - Create an Error ??

Starting with the most complex operator let's compile the desired syntax.

Out of the four new operators, the most involved is the one-of-operator. Its equivalent syntax requires two distinct comparison operators (==, !=), whereas the alternatives only require a single comparison operator.

llvm-project/manifold-proposal/one-of-rfc.cpp
$ build/bin/clang++ -fsyntax-only manifold-proposal/one-of-rfc.cpp
one-of-rfc.cpp:7:13: error: type name requires a specifier or qualifier
    7 |   if (var [^] 1,2)
      |             ^
one-of-rfc.cpp:7:13: error: expected expression
2 errors generated.        

Step 2 - Walk the Codebase

Scanning for the rendered messages gives rise to DiagnosticParseKinds.td:201.

$ grep --recursive "expected expression"
...
DiagnosticParseKinds.td:def err_expected_expression : Error<"expected expression">;
...        

Subsequently, scanning for 'err_expected_expression' determines the origin of the error message in the Parse library at one of five possible locations.

$ grep --recursive "err_expected_expression"
...
ParseExpr.cpp: Diag(Tok, diag::err_expected_expression);
ParseExpr.cpp: Diag(Tok, diag::err_expected_expression);
ParseExpr.cpp: Diag(CCLoc, diag::err_expected_expression);
ParseExpr.cpp: Diag(Tok, diag::err_expected_expression);
ParseOpenMP.cpp: Diag(Tok.getLocation(), diag::err_expected_expression);        

Studying the candidates, we see a comment on line 3705, just above one of the reported diagnostics within ParseExpr.cpp:3705.

llvm-project/clang/lib/Parse/ParseExpr.cpp

Breaking down the code line by line.

Tok (class Token) is the current token being parsed. If its kind (enum TokenKind) is not an opening left brace, enter the body of the if-statement.

3704    if (!Tok.is(tok::l_brace)) {        

Report a diagnostic error message indicating malformed syntax, expression expected — see diagnostic documentation for further details.

The ERROR severity is used for diagnostics indicating the program is never acceptable under any circumstances. When an error is emitted, the AST for the input code may not be fully built.
3706    Diag(Tok, diag::err_expected_expression);        

Discard the evaluation context associated with the malformed block of code.

3707    Actions.ActOnBlockError(CaretLoc, getCurScope());        

Inform the caller of ParseBlockLiteralExpression that an error has occurred.

3708    return ExprError();        

Step 3 - Mapping the call stack

I've summarised the program control flow. The reality is the LLVM project has thousands of lines of code and a short article is not the appropriate place to communicate the minutiae.

$ cloc .
  156889 text files.
  147840 unique files.                                          
   32119 files ignored.

cloc v 1.90  T=231.71 s (550.0 files/s, 145368.8 lines/s)
--------------------------------------------------------------------
Language          files          blank        comment           code
--------------------------------------------------------------------
C++               29317         948242        1856402        5487231
LLVM IR           35402         633797        5838837        2977776
C/C++ Header      11741         316619         496652        1593157
D                  2980              2              0        1038679
C                 10383         257156        1566090         997444
YAML               4846          66451          64219         863733
Assembly          10582         470100        1185436         808791
JSON                111             18              0         443909
Python             2315          51776          57258         222274
...                ....           ....           ....           ....
--------------------------------------------------------------------
SUM:             127446        3039636       14947767       15695664
--------------------------------------------------------------------        

?? ParseStmt.cpp

Line 37: Enter ParseStatement.

Line 45: Invoke ParseStatementOrDeclaration.

Line 118: Invoke ParseStatementOrDeclarationAfterAttributes.

Line 303: Invoke ParseIfStatement.

llvm-project/clang/lib/Parse/ParseStmt.cpp

Line 1515: Invoke ParseParenExprOrCondition.

Line 1307: Invoke ParseCXXCondition.

?? ParseExprCXX.cpp

Line 2042: Enter ParseCXXCondition.

Line 2103: Invoke ParseExpression.

?? ParseExpr.cpp

Line 126: Enter ParseExpression — a precedence-based parser.

llvm-project/clang/lib/Parse/ParseExpr.cpp

Line 128: Invoke ParseRHSOfBinaryExpression.

Line 549: Invoke ParseCastExpression.

Line 1766: Finally, ?? invoke ParseBlockLiteralExpression.

llvm-project/clang/lib/Parse/ParseExpr.cpp

Step 4 - Circumventing the error message

At this stage, we know the source of the diagnostic message and the path through the code that leads to the error. What we desire is to avoid the case expression at ParseExpr.cpp:1765.

Recall our new one-of-operator needs to be parsed as a complete token, not processed as the individual characters [, ^, ]. Introducing this token into the Lexer library should lead to new diagnostics and an area of further research.

The first step is to introduce the token definition into TokenKinds.def:244.

PUNCTUATOR(manifoldoneof,       "[^]")        

?? The token was introduced into the C++ Support section. For unknown reasons, the Spaceship operator is listed under C99 6.4.6 Punctuators.


Now at Lexer.cpp:3934 we introduce additional logic for operator tokenisation.

llvm-project/clang/lib/Lex/Lexer.cpp

?? Ah! The leading character of the Spaceship operator is an opening angle bracket, which is tokenised under the C99 6.4.6: Punctuators section in the CPP file — header sections marry up to the CPP file layout based on the first character of each token.


As a stream of bytes is fed into the lexer they are organised into tokens consisting of one or more characters.

When an opening square bracket is encountered (Lexer.cpp:3934), we peek ahead to the next character testing if it equals caret, and peek ahead one more to find a matching closing square bracket. If the stream satisfies the requirement, we consume the two proceeding characters and assign a token kind of manifold-one-of.

Later we'll extend this logic to support the any-of, all-of, and none-of operators.

llvm-project/clang/lib/Lex/Lexer.cpp

Given that we are inside a precedence-based parser (ParseExpression), we need to assign a precedence level to the token by creating an additional case expression at OperatorPrecedence.cpp:53.

llvm-project/clang/lib/Basic/OperatorPrecedence.cpp

A complete diff of the changes can be reviewed in commit a437fd6.

commit a437fd6f74ee49ddef246026ace1ca467bb47a2e

As before, let's (re)build LLVM and attempt to compile the new syntax.

$ build/bin/clang++ -fsyntax-only manifold-proposal/one-of-rfc.cpp
one-of-rfc.cpp:7:18: error: expected ':'
    7 |   if (var [^] 1,2)
      |                  ^
      |                  : 
one-of-rfc.cpp:7:11: note: to match this '?'
    7 |   if (var [^] 1,2)
      |           ^
one-of-rfc.cpp:7:18: error: expected expression
    7 |   if (var [^] 1,2)
      |                  ^
one-of-rfc.cpp:7:15: warning: left operand of comma operator has no effect [-Wunused-value]
    7 |   if (var [^] 1,2)
      |               ^
1 warning and 2 errors generated        

Awesome! ?? This is precisely the kind of feedback we hoped to see.

The new token is being processed as if it were a conditional operator. Additionally, we have a warning that the comma operator is being parsed, indicating the entirety of the expression has been evaluated.

We see the original error 'expected expression' and a new entry 'expected ':''. This looks as if our expression is being parsed as the Ternary operator; as anticipated.

The source of the new diagnostics can be tracked down to ParseRHSOfBinaryExpression at ParseExpr.cpp:522.

llvm-project/clang/lib/Parse/ParseExpr.cpp

Step 5 - Erecting the scaffolding

Reviewing ParseRHSOfBinaryExpression we see a special block of logic for handling the ternary operator, and the source of our diagnostic messages.

llvm-project/clang/lib/Parse/ParseExpr.cpp

The logic is tasked with parsing the inner expression of the Ternary operator...

logical-OR-expression '?' expression ':' conditional-expression        

...if one exists (Elvis operator), and then passing it into ActOnConditionalOp for semantic parsing at ParseExpr.cpp:641.

llvm-project/clang/lib/Parse/ParseExpr.cpp

This [function] seems to be an appropriate location to introduce the new manifold parsing routines.

The following commit (580f147) erects the scaffolding to avoid the special Ternary logic and invoke a distinct path for the manifold operators — resulting in a deliberate expression error at ParseExpr.cpp:3854.

commit 580f14763d143544df1f08aeddeb733aff83d429

?? The ParseManifoldExpression location changed with commit 95e08c3.


Step 6 - Crafting the AST

Clang has a built-in AST-dump mode, which can be enabled with the flags '-ast-dump' and '-ast-dump-xml' — XML generation is only supported in debug builds.

Generating the desired structure is as easy as dumping the equivalent syntax of the one-of-operator.

llvm-project/manifold-operator/one-of-equiv.cpp

?? Minimal implementations produce direct output when dumping the AST.


one-of equivalent syntax AST

Breaking down the output line by line details the required steps for emitting a well-structured AST entry.

  1. IfStmt represents an if, then, else structure.
  2. BinaryOperator (||) node describes a binary operation whose operands will have already undergone promotions or conversions for appropriate type transformation.
  3. ParenExpr represents a parenthesised expression e.g. (var == 1 && var != 2).
  4. BinaryOperator (&&)
  5. BinaryOperator (==)
  6. ImplicitCastExpr (var) In C, implicit casts always produce rvalues. However, in C++, an implicit cast whose result is being bound to a reference will be an lvalue or xvalue.
  7. IntegerLiteral (1) node representing the integer literal. All literals are primary expressions.
  8. BinaryOperator (!=)
  9. ImplicitCastExpr (var)
  10. IntegerLiteral (2)
  11. ParenExpr (var != 1 && var == 2)
  12. BinaryOperator (&&)
  13. BinaryOperator (!=)
  14. ImplicitCastExpr (var)
  15. IntegerLiteral (1)
  16. BinaryOperator (==)
  17. ImplicitCastExpr (var)
  18. IntegerLiteral (2)

Finally, the CompoundStmt at line 19 represents a group of one or more statements making up the if-body.

Step 7 - Parsing the operator

See commit ea2f67d or ParseExprCXX.cpp:4120 for details.

llvm-project/clang/lib/Parse/ParseExprCXX.cpp

?? ParseExprCXX.cpp

Line 4121-4123: Debug assertion; will improve this later with a 'ManifoldExpressions' feature switch.

Line 4125-4140: Parse the first RHS expression and produce an error if we fail to find two or more comma-delimited expressions.

Line 4142-4155: Iterate over the remaining RHS expressions, parse them, and ensure we end with closing parentheses.

Line 4158-4176: Stitch the RHS expressions together while propagating the equality comparison operator further up for each outer iteration — O(n*m).

Line 4178-4187: For each outer iteration wrap the stitched expressions in parentheses to ensure order of evaluation.

Step 8 - Testing

We've updated the final form of the test code to take a single command line argument and to prompt if the input is inside or outside of the acceptable set.

llvm-project/manifold-proposal/one-of-test.cpp
one-of example application output

Additionally, test cases were introduced in commit c832a78.

build/bin/clang++ -fsyntax-only -Xclang -verify clang/test/Parser/manifold.cpp        

And finally, we rerun LLVMs complete test suite.

$ make check-all
...
[100%] Running all regression tests
llvm-lit: /home/bcrowhurst/Development/llvm-project/llvm/utils/lit/lit/llvm/config.py:487: note: using clang: /home/bcrowhurst/Development/llvm-project/build/bin/clang

Testing Time: 1624.06s
  Skipped          :    50
  Unsupported      :  1994
  Passed           : 88521
  Expectedly Failed:   180
[100%] Built target check-all        

Step 9 - Resolved Bugs

  • Altered (39f4189) the all-of-operator syntax to avoid lambda capture collision. This was the path of least resistance, avoiding increased parser complexity.

auto fn = [&]{ }  // error: expected manifold expression        

  • Refactored (3e3bce6) test case pr47804.cpp due to a name conflict.

Step 10 - Round off

  • Implemented (bda18ce) alternative operator representation
  • Implemented (1839e69) manifold feature switch

// Enable
clang++ -fmanifold-expressions test.cpp -o test

// Disable (default)
clang++ -fno-manifold-expressions test.cpp -o test        

  • Implemented (375efe3) all-of-operator syntax
  • Implemented (7d45540) any-of-operator syntax
  • Implemented (39b005a) none-of-operator syntax

  • [Outstanding] Improve diagnostics
  • [Outstanding] Expand test coverage
  • [Outstanding] Update clang-format


?? Warning this implementation is a proof-of-concept; bugs may still reside.


Step 11 - Retrospective

As a first swing at the LLVM project, the resulting implementation is sufficient for a proof-of-concept.

However, refactoring ParseRHSOfBinaryExpression and splitting the parsing logic for conditional operators from that of other expressions, would make it easier to introduce new conditionals, and help confine the 277 lines of existing logic into a more digestible unit.

Additionally, the newly introduced manifold logic would be better positioned if transferred, or split between the Sema library with a specific action, or extension of the existing action ActOnConditionalOp.

The current approach allows us to reuse much of the existing framework, yet it falls short of a mergeable implementation. Refactoring is required given the manifold operators gain any traction during discussions with the WG21 committee.


WG21 Submission

ISOCPP outlines the proposal process as follows:

  1. Float?the idea. Post an initial brief description of your feature on the std-proposals mailing list,?including especially the problem it solves and alternatives considered.
  2. Post an initial draft. Using the proposal template at the bottom of this page, write a draft proposal and circulate a link to it on the std-proposals mailing list to start a second round of discussion.
  3. Iterate and fine-tune. Update the draft proposal with a new revision to incorporate the feedback you’ve received, and post a new link. Rinse and repeat until you converge on a design that gets general support.

Feedback

Over the course of a month, the proposal was discussed and a general disagreement towards introducing new operators was widely received. The preferred approach is a template-based solution with much of the thread discussing the nuances of such an implementation.

Other comments suggested this would lead to even more complex conditionals, countering any arguments on a reduction in cognitive load.

if(Alpha [!] Beta, Gamma, Delta && Alpha[^] Epsilon, Zeta, Eta && Theta [^] ... && ....) I'd argue cognitive load goes up from needing to track the three-character symbols and their relative positions between a bunch of names (especially consider if the names themselves become compounded names or function calls); over a more traditional function call syntax. It adds more characters, yes, but verbosity is not a dirty word if it means making the code clearer. I will agree that the need to insert a wrapping "matcher" template is unideal, but that's how the language works; and adding something as extraordinarily fundamental as a new set of operators to the language requires an extraordinary motivation which in the nicest possible way I don't quite see here.?

One member was concerned with how daunting the addition of new operators would be when wording the change into the standard.

...have you looked at how it would interact with the existing standard in terms of wording; including all the various fringe cases and bits that adding new operators and conditionals will have when pathologically arranged? I dread to think how many places the standard refers to operators and constructs which can be treated as conditionals and predicates, so there may well be wording awkwardness aplenty.

Conclusion

To conclude, we've explored the LLVM Compiler Infrastructure Project, implementing a number of new operators for the C++ language.

The WG21 response was somewhat lacklustre but to be expected due to such a fundamental set of new language features. I acknowledge my naivety when it comes to the impact on the C++ Standard that four new operators would cause; as shown in my initial post to the working group.

Section '7.6 Compound expressions' requires the addition of new grammar and wording detailing the Manifold operator syntax. TBD.

Going into the project I was well aware it would be difficult to get much traction however, my motivation was primarily driven by curiosity with respect to LLVM's compiler design and implementation.

A working copy of the compiler with all the manifold operators implemented can be located in the project's GitHub repository.

For those seeking to scratch a C++ itch, this year's CppCon has concluded with videos now available at https://cppcon.org/. ??



要查看或添加评论,请登录

社区洞察