This file describes the format of the data files containing trees for the Netgraph server.
The origin of this description of the syntax of FS File Format has been taken from the Prague Dependency Treebank 1.0. It has been updated to the current state of the format, used in Netgraph.
FS files serve for encoding tree annotation of sentences in natural language. Each FS file contains a sequence of trees, which represent the sentences. Each node is described by a set of attributes.
The names and data types of particular attributes are not part of FS format. Rather, each FS file has a head that defines attributes for its tree nodes locally.
The non-terminal
symbols are enclosed in "<"
and
">"
characters, terminal symbols or strings of terminal symbols are
enclosed in double quotes. A C-like notation is used inside the
quotes, thus "\t"
means the character with code 9, i.e. HTAB. The character "\n"
represents the end of line regardless of the platform, i.e. it
matches not only real "\n" in its C sense, but also "\r\n"
(DOS-Windows EOL), or even "\r".
The unary postfix
operators "*
",
"+
"
and "?
"
mean that the operand appears n-times in a row, where n>=0
for *, n>0
for +, and n
is 0 or 1 for ?.
In contexts where a
non-terminal can be interpreted as a set, the binary operator "-
"
can be used. It denotes a difference of two sets.
The FS file contains a head with node attribute definitions, and a sequence of trees. Anything following the trees is considered a configuration for an editor and is ignored in Netgraph.
Netgraph only accepts files encoded in UTF-8.
An identifier is one of the main elements of the FS file syntax. It is a string of arbitrary characters starting by the first character and ending before the first functional character. Functional characters can be parts of identifiers when they are escaped by a backslash (the backslash used for escaping a special character is not a part of the identifier).
Note: The length of identifiers is limited, the limit depends on the usage. In Netgraph, an attribute name is limited to 30 bytes, an attribute value it is limited to 5000 bytes.
The beginning of each file contains a
head with definitions of the attributes which can appear in tree
nodes. Each head line begins with the "@"
character. A capital letter follows, denoting properties of the
attribute, then a space and the attribute name. For example "@P
m/lemma".
<definition-line> ::=
Property |
Description |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
More than one property can be defined for one attribute. The definition lines with all the properties need not follow each other in the file head. They must however fulfil the following constraints:
Only one @V attribute per file can be defined.
Only one @W attribute per file can be defined.
Only one @N attribute per file can be defined.
The @N property cannot be combined with other properties. Nevertheless, the @N attribute has automatically the properties @P and @O as well.
An attribute cannot be both @V and @L.
@L must be the last property defined for an attribute but it cannot be the only property of that attribute.
Trees are described in the usual parentheses notation, i.e. after the description of a node, the parenthesized comma-separated list of its sons (or their subtrees) follows. The order of the brothers is not significant, since the attribute with property @N is used for controlling the order of nodes.
Besides the pure syntax, it is also
necessary to check the relations between the element
<attributes>
and the definitions of the respective attributes in the head
of the file. The constraints following from these relations are
described below.
<node> ::=
The element <attributes>
must fulfil the following constraints (based on the particular
definition of attributes in the file head):
The attribute name is required for non-positional attributes.
If the attribute name is not present it is necessary to figure out the attribute the value belongs to. It is the first positional attribute whose definition in head follows the definition of the last read attribute (positional or not).
The identifier
in the <attribute-name>
element must equal to a name of an attribute defined in the
head.
No attribute can be read more than once.
The identifier representing a value of a numeric attribute can contain only non-negative real numbers
The value of a @L attribute must be one of the predefined values from the definition of the attribute.
Values of all obligatory attributes (with property @O) have to be defined.