gbXML Help
XML Format
Language Frame
Tokenset Frame
XML Code
Sample Code
Manual Edits
Debug Mode
Dirty XML


Contact Me

GBIC >> FreeWare >> gbXML >> XML File Format
gbXML - XML Language File Format
The XML language files created by gbXML consist of the following hierarchy of elements: language > tokenset > validscope|tokens|tokens2. An example of the basic XML language file element structure is given in the following example:

        <validscope ... />
        <tokens>  ... </tokens>
        <tokens2> ... </tokens2>

A typical language definition file might consist of 5-10 tokenset elements, each with 2-4 validscope elements, 1 tokens element, and 1 tokens2 element. gbXML allows languages to have up to 50 tokensetes and up to 10 validscopes.

Additional details on each element type are provided below.

Tokens and Tokenset
The text which makes up language source code can be thought of as consisting entirely of 'tokens' - groups of text characters, such as keywords, variables, operators, and other symbols specific to the language. Tokens are often thought of as 'words', but a token can also be multiple words, such as the text string 'End Function', which is used to terminate functions in some languages. The purpose of a language definition file is to identify groups, or lists, of tokens and to specify the formatting options to applied to those tokens.

Sometimes a token pair needs to be identified, where all text found within the token pair can be given specific formatting instructions. In this case, a second list of tokens may be required in the language definition - one for the starting token of the pair and a second for the ending token of the pair.

The token list (or lists, if token pairs are involved) are placed within a tokenset element. Tokenset attributes may be defined which describe the formatting to be applied to its tokens or to the content between token pairs. Formatting information can also be entered on a token-by-token basis, overriding the formatting instructions entered at the tokenset level.

The tokens lists are typically simple multi-line listings of every token to which the formatting will be applied. However, a list of tokens may also be defined through the use of regular expressions. Using a single regular expression to define an entire list of tokens is a powerful simplifying tool for creating language definition files.

Tokensets Types
There are actually two types of tokensets - list and scope. As the name implies, a list tokenset is simply a listing of all text strings (tokens) which belong to the tokenset. As noted above, a regular expression may also be used to define the list of tokens.

Here's a simple example of a list tokenset with only a few tokens. Assigning properties (attributes) of XML elements with be discussed later.

<language name="java">
    <tokenset name="Common Words" id="keywords" type="list" forecolor="red">

In this example, three tokens (if, while, end) are defined and will be displayed in the color red.

The second kind of tokenset, a scope tokenset, is a list of one or more token pairs. Each pair consists of a starting and ending token, where specific display characteristics are applied to the source code between the two tokens (as well as to the tokens themselves). For example, in most languages double-quotes are used to enclose strings. A scope tokenset which defines a pair of double-quote tokens would be used to apply formatting to all source code between the two double-quote characters.

A scope tokenset can include multiple token pairs. The first token of a pair is placed in a tokens element. The second, corresponding token of a pair is placed in a tokens2 element. Both tokens and tokens2 elements can contain any number of tokens but must contain the same number of tokens, corresponding to pairs of tokens.

Here's an example of a scope tokenset.

<language name="java">
    <tokenset name="String Tokens" id="strings" type="scope" forecolor="blue">

In this example, two pairs of tokens are defined - a pair of double-quotes and a pair of single quotes - both of which are used to enclose strings in many languages. Source code between either token pair would be colored blue in this example.

gbXML language definition files also support special, single-token scope definitions - where a single token is used to define the start of a scope and the end of the line of text defines the end of the scope. In such cases, only the tokens element is needed - no tokens2 element is required.

For example, a single quote is used in Visual Basic to represent comments. The end of the line defines the end of the comment scope.

Sometimes, language elements may be embedded within one another. For example, a comment string may have a hypertext link embedded within it. The language definition files can be written to recognize such occurrences, applying a color syntax to the embedded elements that is different than the formatting applied to the enclosing element.

The validscope element is used to specify that a tokenset is to be valid (recognized) within an other element. A tokenset may contain any number of validscope elements. Typically, only 2-4 validscopes are required to describe most languages. If no validscopes are enclosed in a tokenset element, the tokenset is treated as valid everywhere.

Here's an XML example showing how to indicate that a hyperlink should be valid within a string scope tokenset. In this case, note that the hyperlink token is defined as a regular expression.

<language name="java">
    <tokenset name="String Tokens" id="strings" type="scope" forecolor="blue">
    <tokenset name="Active Links" id="hyperlinks" type="scope" forecolor="red">
        <validscope name="String Tokens" />
        <tokens regexp="yes" >
            <token> https?://([\.~:?#=\w]+\w+)(/[\.~:?#=\w]+\w)* </token>

In this example, the hyperlink would be displayed as red text within a blue text string.