FS File Format Description

This file describes the format of the data files containing trees for the Netgraph server.

The origin of this description of the syntax of FS File Format has been taken from the Prague Dependency Treebank 1.0. It has been updated to the current state of the format, used in Netgraph.

FS files serve for encoding tree annotation of sentences in natural language. Each FS file contains a sequence of trees, which represent the sentences. Each node is described by a set of attributes.

The names and data types of particular attributes are not part of FS format. Rather, each FS file has a head that defines attributes for its tree nodes locally.

Notes on Metasyntax

The non-terminal symbols are enclosed in "<" and ">" characters, terminal symbols or strings of terminal symbols are enclosed in double quotes. A C-like notation is used inside the quotes, thus "\t" means the character with code 9, i.e. HTAB. The character "\n" represents the end of line regardless of the platform, i.e. it matches not only real "\n" in its C sense, but also "\r\n" (DOS-Windows EOL), or even "\r".

The unary postfix operators "*", "+" and "?" mean that the operand appears n-times in a row, where n>=0 for *, n>0 for +, and n is 0 or 1 for ?.

In contexts where a non-terminal can be interpreted as a set, the binary operator "-" can be used. It denotes a difference of two sets.

The FS File Structure

The FS file contains a head with node attribute definitions, and a sequence of trees. Anything following the trees is considered a configuration for an editor and is ignored in Netgraph.

<fs-file> ::=
<encoding-line>? <definition-line>+ "\n"+ (<tree> "\n")+ <editor-configuration>?
<encoding-line> ::=
"@E " <encoding>
<encoding> ::=
"utf-8"

Netgraph only accepts files encoded in UTF-8.

Identifiers, Attribute Names and Values

An identifier is one of the main elements of the FS file syntax. It is a string of arbitrary characters starting by the first character and ending before the first functional character. Functional characters can be parts of identifiers when they are escaped by a backslash (the backslash used for escaping a special character is not a part of the identifier).

Note: The length of identifiers is limited, the limit depends on the usage. In Netgraph, an attribute name is limited to 30 bytes, an attribute value it is limited to 5000 bytes.

<attribute-name> ::=
<identifier>
<attribute-value> ::=
<identifier>
<identifier> ::=
<identifier-character>+
<identifier-character> ::=
<normal-character> | <escaped-character>
<functional-character> ::=
"\" | "=" | "," | "[" | "]" | "|" | "<" | ">" | "!"
<normal-character> ::=
<any-character> - <functional-character>-"\n"
<escaped-character> ::=
"\" (<any-character> - "\n")

Node Attributes Definition

The beginning of each file contains a head with definitions of the attributes which can appear in tree nodes. Each head line begins with the "@" character. A capital letter follows, denoting properties of the attribute, then a space and the attribute name. For example "@P m/lemma".

<definition-line> ::=

("@" <property> " " <attribute-name> "\n") |
("@L" " " <attribute-name> "|" <values> "\n")
<property> ::=
"K" | "P" | "O" | "N" | "V" | "W" | "H"
<values> ::=
<attribute-value> ("|" <values>)?

Properties



Property

Description

K
A key attribute. The word "key" does not really mean anything except "this has no specific properties".
P
A positional attribute. All other attributes require that their name is written before their value in the data (e.g. a/ord=7). Positional attributes do not. The name of a positional attribute is figured out of the relative position of its value with respect to the previous values (see details below in the paragraph "Node").
O
An obligatory attribute. Its value has to be non-empty for every node (the empty string is the default value for all attributes). Thus the value must appear in the data.
L
A list attribute. Such an attribute can only have a value from a predefined list, or be empty. The values cannot be repeated in the definition of the list.
H
A hiding attribute. Nodes that have value "true" in this attribute are hidden.
N
A numeric attribute (the value is a non-negative real number), specifying the order of the nodes in the tree from left to right. Maximally one such attribute per FS file can be defined.
W
Another numeric attribute. It denotes the order of words in the sentence. If it is not defined in the head, the attribute with property @N (which is obligatory) is used.
V
A value attribute. The linear form of the sentence is assembled from values of this attribute, the values are ordered according to an attribute with property @W. Maximally one such attribute per FS file can be defined.



More than one property can be defined for one attribute. The definition lines with all the properties need not follow each other in the file head. They must however fulfil the following constraints:

A Tree

Trees are described in the usual parentheses notation, i.e. after the description of a node, the parenthesized comma-separated list of its sons (or their subtrees) follows. The order of the brothers is not significant, since the attribute with property @N is used for controlling the order of nodes.

<tree> ::=
<node> ("(" <children> ")")?
<children> ::=
<tree> ("," <children>)?

A Node

Besides the pure syntax, it is also necessary to check the relations between the element <attributes> and the definitions of the respective attributes in the head of the file. The constraints following from these relations are described below.

<node> ::=

<attribute-set> ("|" <node>)?
<attribute-set> ::=
"[" <attributes>? "]"
<attributes> ::=
<attribute> ("," <attributes>)?
<attribute> ::=
(<attribute-name> "=")? <values>
<values> ::=
<attribute-value> ("|" <values>)

The element <attributes> must fulfil the following constraints (based on the particular definition of attributes in the file head):