STATIC ANALYZER IMPLEMENTATION MODEL FOR THE SOLIDTY PROGRAMMING LANGUAGE
STATIC ANALYZER IMPLEMENTATION MODEL FOR THE SOLIDTY PROGRAMMING LANGUAGE
Vladislav Stolyarov
graduate student, Tula State University,
Russia, Tula
Initially, the idea for the implementation of a new static analyzer for Solidity was that there is only one reference implementation of the compiler for the Solidity language in the world. This implementation generates bytecode for the Ethereum virtual machine.
Figure 1. High-level diagram of the Solidity compiler
The idea presented was to take advantage of the LLVM modular architecture and write your own frontend for the Solidity programming language . The output of this frontend would be LLVM IR , which would allow to generate machine code for all platforms for which it generates LLVM code . Thus, a compiler of the Solidity programming language would be implemented , with all architecture-specific optimizations, which would be guaranteed to work faster than the reference implementation. This would be achieved through at least one machine-specific layer of optimizations, on the so-called compiler backend (the procedure for generating machine code for the target architecture). The speed of the files compiled by such a compiler would reach the speed of programs in the C and C++ programming languages.
The implementation of a static analyzer, together with the presented compiler, will allow you to write your own optimizations for the generated intermediate representation. And the compiler implementation itself degenerates into adding one more pass that generates an LLVM intermediate representation for further transformations to middle - end and back - end . Also, it is worth noting that the presented implementation of the static analyzer will potentially provide the best analysis quality due to the built-in LLVM tools .
Separation of the static analyzer from the compiler is necessary in order to ensure the highest possible compilation speed and not drag expensive static analysis checks along with it. Thus, the expected compiler implements a code parser, converting it to LLVM IR and lightweight checks, on their intermediate representation of the code. A static analyzer, on the other hand, implements a large layer of various checks, sometimes heuristic ones.
Using the presented family of utilities involves separate launches of the static analyzer, possibly integrating it into the CI / CD pipeline.
The analyzer core can be implemented using standard C++ language tools. At the time of writing, the work has been implemented: a lexical analyzer, a parser, a semantic analyzer, a separate traversal of the abstract syntax tree. Also, parallel work is underway to integrate the analyzer with LLVM , implement diagnostic rules and develop the infrastructure.
The analysis can take place in several stages; different utilities are launched for them. The first of them is a lexical analyzer (lexer) [1, p. 156] . The task of the lexer is to break the incoming file with the source code into its component parts, i.e. tokens. The resulting container with data is usually fed to the input of the parser to build a code parse tree or any other format of an intermediate representation of the code for analysis and subsequent transformations.
There is a point of view that for the implementation of a compiler/static analyzer, a lexer is not needed, since it is not a mandatory part of them. From a formal point of view, this is true, all functions for tokenizing a file with source code can be performed at the stage of parsing [1, p. 253] , because they are regulated by the syntax of the language. However, there are several reasons why a lexer is present in almost every implementation of a compiler or static analyzer:
- The use of a lexical analyzer makes the work with the source code of the program, which occurs at the stage of parsing, much shorter and easier. This is achieved due to the fact that the parser is fed already structured text as input.
- At the stage of lexical analysis, you can get rid of all insignificant information. For example, from spaces , tabs , etc. d .
- To split a source code file into tokens, a simple set of rules is usually used, which is well applied to other programming languages. It's usually fairly easy to make a working prototype lexer .
The principle of operation of the lexical analyzer is simple. It receives a file with the source code of the program as an input. The file can be either pre-processed (in particular, pre-processed) or not. The lexer starts traversing the file and classifying the tokens it finds. Depending on whether the file with the source code has been preprocessed, the lexical analyzer can perform other tasks (for example, remember the places where comments are located).
If a lexeme does not fall under any of the classifications, it is often an identifier - with an entity named by the user. As a result, the following problem may arise: the list of tokens may change depending on the version of the language.
The parser, or parser, works on the container of classified tokens, not on the raw source file. For optimization, these utilities work in parallel, that is, the lexer lazily bypasses the file and gives the data to the parser. The scheme of work is as follows: the parser asks the lexer for a token, the lexer accesses the internal storage buffer, and if there are tokens, it gives them to the parser. If the storage is empty, then the lexer processes a piece of the file and fills its cache. The parser, on the other hand, is a regular recursive parser that processes the formal grammar of the Solidity language and builds an abstract syntax tree from it.
Semantic analysis in the compiler ensures that program declarations and statements are semantically correct. This is a set of procedures that are called by the parser when required by the grammar. To check the consistency of this code, both the syntax tree of the previous phase and the symbol table are used. Type checking is an important part of semantic analysis, where the compiler makes sure that each operator has corresponding operands. It uses an abstract syntax tree and a symbol table to check whether a given program semantically matches the language definition. It collects type information and stores it in either a syntax tree or a symbol table. This type information is subsequently used by the compiler during intermediate code generation. Also, this information is useful for a static analyzer, it allows you to implement a whole layer of type checks and control flow.
Thus, the combination of the presented approaches allows you to implement your own modular and cross-platform static analyzer.
References:
- Aho A. V., Lam M. S., Seti R, Ulman D. D. Compilers. Principles, technologies, tools, 2nd ed. - M .: Publishing House Williams, 2008. C. 155-251.