FS File Format Description

This file describes the format of the data files containing trees for the Netgraph server.

The origin of this description of the syntax of FS File Format has been taken from the Prague Dependency Treebank 1.0. It has been updated to the current state of the format, used in Netgraph.

FS files serve for encoding tree annotation of sentences in natural language. Each FS file contains a sequence of trees, which represent the sentences. Each node is described by a set of attributes.

The names and data types of particular attributes are not part of FS format. Rather, each FS file has a head that defines attributes for its tree nodes locally.

Notes on Metasyntax

The non-terminal symbols are enclosed in "<"and">" characters, terminal symbols or strings of terminal symbols are enclosed in double quotes. A C-like notation is used inside the quotes, thus "\t" means the character with code 9, i.e. HTAB. The character "\n" represents the end of line regardless of the platform, i.e. it matches not only real "\n" in its C sense, but also "\r\n" (DOS-Windows EOL), or even "\r".

The unary postfix operators "*", "+" and "?" mean that the operand appears n-times in a row, where n>=0 for *, n>0 for +, and n is 0 or 1 for ?.

In contexts where a non-terminal can be interpreted as a set, the binary operator "-" can be used. It denotes a difference of two sets.

The FS File Structure

The FS file contains a head with node attribute definitions, and a sequence of trees. Anything following the trees is considered a configuration for an editor and is ignored in Netgraph.

<fs-file> ::=: <encoding-line>? <definition-line>+ "\n"+ (<tree> "\n")+ <editor-configuration>?
<encoding-line> ::=: "@E " <encoding>
<encoding> ::=: "utf-8"

Netgraph only accepts files encoded in UTF-8.

Identifiers, Attribute Names and Values

An identifier is one of the main elements of the FS file syntax. It is a string of arbitrary characters starting by the first character and ending before the first functional character. Functional characters can be parts of identifiers when they are escaped by a backslash (the backslash used for escaping a special character is not a part of the identifier).

Note: The length of identifiers is limited, the limit depends on the usage. In Netgraph, an attribute name is limited to 30 bytes, an attribute value it is limited to 5000 bytes.

<attribute-name> ::=: <identifier>
<attribute-value> ::=: <identifier>
<identifier> ::=: <identifier-character>+
<identifier-character> ::=: <normal-character> | <escaped-character>
<functional-character> ::=: "\" | "=" | "," | "[" | "]" | "|" | "<" | ">" | "!"
<normal-character> ::=: <any-character> - <functional-character>-"\n"
<escaped-character> ::=: "\" (<any-character> - "\n")

Node Attributes Definition

The beginning of each file contains a head with definitions of the attributes which can appear in tree nodes. Each head line begins with the "@" character. A capital letter follows, denoting properties of the attribute, then a space and the attribute name. For example "@P m/lemma".

<definition-line> ::=

("@" <property> " " <attribute-name> "\n") |; ("@L" " " <attribute-name> "|" <values> "\n")
<property> ::=: "K" | "P" | "O" | "N" | "V" | "W" | "H"
<values> ::=: <attribute-value> ("|" <values>)?

Properties

Property	Description
K	A key attribute. The word "key" does not really mean anything except "this has no specific properties".
P	A positional attribute. All other attributes require that their name is written before their value in the data (e.g. a/ord=7). Positional attributes do not. The name of a positional attribute is figured out of the relative position of its value with respect to the previous values (see details below in the paragraph "Node").
O	An obligatory attribute. Its value has to be non-empty for every node (the empty string is the default value for all attributes). Thus the value must appear in the data.
L	A list attribute. Such an attribute can only have a value from a predefined list, or be empty. The values cannot be repeated in the definition of the list.
H	A hiding attribute. Nodes that have value "true" in this attribute are hidden.
N	A numeric attribute (the value is a non-negative real number), specifying the order of the nodes in the tree from left to right. Maximally one such attribute per FS file can be defined.
W	Another numeric attribute. It denotes the order of words in the sentence. If it is not defined in the head, the attribute with property @N (which is obligatory) is used.
V	A value attribute. The linear form of the sentence is assembled from values of this attribute, the values are ordered according to an attribute with property @W. Maximally one such attribute per FS file can be defined.

More than one property can be defined for one attribute. The definition lines with all the properties need not follow each other in the file head. They must however fulfil the following constraints:

Only one @V attribute per file can be defined.
Only one @W attribute per file can be defined.
Only one @N attribute per file can be defined.
The @N property cannot be combined with other properties. Nevertheless, the @N attribute has automatically the properties @P and @O as well.
An attribute cannot be both @V and @L.
@L must be the last property defined for an attribute but it cannot be the only property of that attribute.

A Tree

Trees are described in the usual parentheses notation, i.e. after the description of a node, the parenthesized comma-separated list of its sons (or their subtrees) follows. The order of the brothers is not significant, since the attribute with property @N is used for controlling the order of nodes.

<tree> ::=: <node> ("(" <children> ")")?
<children> ::=: <tree> ("," <children>)?

A Node

Besides the pure syntax, it is also necessary to check the relations between the element <attributes> and the definitions of the respective attributes in the head of the file. The constraints following from these relations are described below.

<node> ::=

<attribute-set> ("|" <node>)?
<attribute-set> ::=: "[" <attributes>? "]"
<attributes> ::=: <attribute> ("," <attributes>)?
<attribute> ::=: (<attribute-name> "=")? <values>
<values> ::=: <attribute-value> ("|" <values>)

The element <attributes> must fulfil the following constraints (based on the particular definition of attributes in the file head):

The attribute name is required for non-positional attributes.
If the attribute name is not present it is necessary to figure out the attribute the value belongs to. It is the first positional attribute whose definition in head follows the definition of the last read attribute (positional or not).
The identifier in the <attribute-name> element must equal to a name of an attribute defined in the head.
No attribute can be read more than once.
The identifier representing a value of a numeric attribute can contain only non-negative real numbers
The value of a @L attribute must be one of the predefined values from the definition of the attribute.
Values of all obligatory attributes (with property @O) have to be defined.