Is it possible to add new identifiers to a language? And how would one go about doing it?
The end-result I want to achieve is to have an sql-editor that highlights the tables, views and stored procedures in different colours.
Finding the names of the tables, views and SPs are quite easy.
But how do I add them to the language of the editor?
Yes it is possible. You essentially need to do two things:
Both tasks are relatively easy. To add the new identifier, you first need to define a new TerminalSymbol capable of recognizing the identifier. This terminal symbol must be added to all lexer states where the name might appear.
If you are not familiar with the lexer or lexer states, here is a brief introduction. The lexer, or tokenizer, reads through the characters of the source file being parsed and breaks it up into "chunks" of logical content, called tokens. So a C# file might have the text "var foo = 3;". The lexer would read that string and produce, as output, a list of tokens, each associated with a terminal symbol. So here it would produce tokens associated with the symbols VarKeyword, WhitespaceToken, IdentifierToken, WhitespaceToken, EqualsToken, WhitespaceToken, IntegerLiteralToken, and SemicolonToken, in order. The parser then uses those tokens to make sense of the overall document without having to worry about individual characters. Lexer states are basically different types of reading modes the lexer goes in when creating the tokens. Typically, the lexer is in the default state, where tokens for most of the terminal symbols are recognized. But if a double-quote character is encountered, let's say, the lexer will change to a state where it only reads content of a string literal. Once it sees the ending double-quote of the string literal, it will go back to being in the default state. This is helpful, because you don't want tokens for keywords and identifiers to be recognized and parsed when they exist in a string literal. Instead, a single token representing the content of the string literal should be recognized. Another thing to keep in mind about the lexer is the priority of terminal symbols. When the lexer is reading characters to form a token, it always creates the longest valid token it can possibly create. This is so that in the code "var foo = 3;", the lexer creates one token for an identifier named "foo" and not three tokens for identifiers name "f", "o" and "o". Once the longest token is found, the terminal symbol able to match that token is associated. But sometimes multiple terminal symbols can match a token. For example, keywords in C# like "class" are valid identifiers and valid class keywords. The way the lexer determines which terminal symbol to associate with the token is by using the terminal symbol defined earliest in the collection of symbols for the current lexer state. So in the case of C# (as well as all other languages), all reserved keyword terminal symbols must be defined before the identifier terminal symbol in the default lexer state so that they have priority and always get associated with tokens they can match.
In the T-SQL language, identifiers are only recognized in the default lexer state, so the new TerminalSymbol instance only needs to be added to that state. However, simply adding the TerminalSymbol to the lexer state will not work as expected because it is added to the end of the lexer state's symbols collection and therefore has the lowest priority. When a token is matched containing the new symbol's text, the token will be associated with the IdentifierToken symbol, and not the new terminal symbol. So the new terminal symbol needs to be inserted into the beginning of the default lexer state, ensuring it has priority over IdentifierToken.
Once the new symbol is added, it must be incorporated into the grammar rules. Luckily, there is a "helper" non-terminal symbol called _identifierResolved in T-SQL used anywhere an identifier is allowed. This allows IdentifierToken as well as contextual keywords to be used as identifiers when appropriate. So adding the new terminal symbol as another alternate kind of _identifierResolved will allow the new symbol to automatically be recognized anywhere an identifier is allowed in the grammar.
The last thing that needs to be done is to add in the custom coloring for the new terminal symbol. This is done by assigning a LanguageElementName to the TerminalSymbol, and using that same name as a key in a custom ClassificationAppearanceMap on the XamSyntaxEditor.
I have attached a sample which shows how to do this. I used the same sample I provided to you on your other post and just region-ed out the code to solve the other problem. This way you can see solutions to both your issues being used in conjunction.
Hi,
This is a great answer. I'm needing to add items that contain % symbols here(in AddIdentifierSymbol Method)
var identifierSymbol = grammar.LexerStates.DefaultLexerState.Symbols.Insert(0, identifier + "Keyword", identifier, TerminalSymbolComparison.Literal);
but when I try to add %error% for example, I get an error stating that the value can only contain underscores, letters, or numbers. I've tried the insert above with Regular expressions and changing TerminalSymbolComparison.Literal to TerminalSymbolComparison.RegularExpression without much luck.
Any help will be appreciated. Please let me know if I need to clarify more.
Andy
After experimenting with this a little more, I figured it out. Here is an example. If I need to clarify the example, let me know. This example applies to scenario described above.
var identifierSymbol = grammar.LexerStates.DefaultLexerState.Symbols.Insert(0, "error" + "Keyword", "(%error%)", TerminalSymbolComparison.RegularExpression);
The name can only contain underscores, numbers, and letters. The value can be Reg Expression as in example above.