CGIF (CG Interchange Format) is intended to be one of the standard notations for exchanging knowledge (click here for the current draft) and is the official textual notation for Conceptual Graphs (CGs).
The original textual notation for CGs is called CGLF (CG linear Form).
It was (and still is) used as input/output of many CG tools but with some
variations or extentions. These variations could have been
integrated into a unique grammar (e.g. the
CG grammar used by WebKB-1 includes
the main variations for representing relations from a concept node).
However, at ICCS'95, John Sowa (the creator of CGs) and other CG researchers
decided that a new notation should be cooperatively designed
by the members of the CG community (via the CG mailing list cg@cs.uah.edu)
in order to involve everyone and define a minimal abstract syntax and
notation that achieves the following
a) has a proper foundation with translations to
KIF and
permits extensions to be built upon;
b) permits fast interchange of CGs between CG tools.
The basis of the new notation was a syntax developed by Gerard Ellis
for a rapid parsing (and hence transfer) of simple CGs.
This notation is closer to KIF (or RDF) than CGLF: it uses a relation-based
lisp-like notation with variables to connect concept nodes.
Thus, except for simple graphs, CGIF is less readable than CGLF
which is graph-oriented and hence reduces the need for variables.
The argued gain of speed over CGLF has been used as a rationale for the adoption of the new notation. However, this gain has not been demonstrated and even if it exists, it is unlikely to be sufficiently significant to be used as a rationale. On our machine, our Flex&Bison parser for CGIF (see below) takes 1.80s in average to parse the 277,666 lines of the WebKB-2 ontology in CGIF and 1.25s in average to parse the 210,382 lines of the same ontology represented in FT, a notation we derived from CGLF to (re)present links between categories. (In these tests, the "parsing" did not include semantic checks nor the construction of objects in the knowledge base: with FT, this takes 14 to 24s depending on the checking level).
By "ontology", we mean the list of
categories (concept types, relation types
and individuals) and links between them.
The current CGIF draft does not have much detail about type declaration.
It specifies that
TypeHierarchy
constructs
may be used for storing definitions and subsumption links between concept types,
that
RelationHierarchy
constructs
may be used for storing definitions and subsumption links between relation types,
and that CatalogOfIndividuals
constructs
may be used for storing delaration/definitions of individuals.
We assume that within these constructs, and only within them, categories
that have not yet been declared (i.e. "forward references") are allowed.
For readability reasons and because categories in WebKB-2 may have many kinds of links (not just subsumption and equality links), we cannot re-use these constructs. The various kinds of links connected from a category have to be presented together, in a concise way, and as "links": separated type definitions for representing links other than subsumption links cannot be used (e.g. type definitions do not permit the presentation of links via clean indented lists and hence increase the complexity of the exploration of links in a large ontology and its design or update; use the "category search" tool to convince yourself of this). Furthermore, in WebKB-2, categories and links have additional information associated to them, e.g. (the identifiers of) their creators.
Thus, CGIF needs to be extended. The best way would be a special syntax for links between categories, and hence an adaptation, extension or replacement of the above cited constructs. FT, the syntax we designed for (re)presenting links between categories in WebKB-2, is an extension of the way that subsumption links can be represented in CGLF. Many people in the CG community would like to keep representing links this way, probably because it is clear and concise. A different notation for links also reflects the difference in the CG data model between the ontology and the (other) graphs, differences which make the exploitation of CGs much easier (click here if you want to see the data model of WebKB-2). Hence, we think the best extension would be to integrate FT into CGIF.
However, since FT is already proposed by WebKB-2 and not yet integrated in CGIF, we must adopt another way to offer an alternative to developers of CGIF parsers. Since we need a graph-like structure to replace the restrictive special constructs, the only alternative we found is to use second-order graphs (i.e. graphs with types in the referent fields) despite the fact that the CGIF current standard does not give details or examples using them.
For the concept types and relation types used in these second-order graphs, we could use the identifiers of second-order types in the ontology of WebKB-2. But then, a CGIF parser would have no (simple) way to know that these graphs are second-order graphs and that yet undeclared types (i.e. "forward references") should be allowed in the referent fields of concept nodes. For this reason, and also to improve readability, we have chosen to introduce reserved types. Here is an example.
//Original version in the FT notation: pm#thing__top_concept_type (^type of things that are not relations^) 29/11/1999 ^ rdfs#class, > {(pm#situation pm#entity)} pm#thing_playing_some_role, ! pm#relation; //Reminder: '^' means "instance of", '!' means "has for subtype", '!' means "exclusive with", // "{(...)}" represents a close partition of exclusive subtypes // "{...}" represents an open partition of exclusive subtypes //Translation in CGIF without reserved types: [pm#type: pm#thing *x ;type of things that are not relations;] (pm#creator ?x spamOnly@phmartin.info) (pm#creation_date ?x 29/11/1999) (pm#name ?x "thing") (pm#name ?x "top_concept_type") (pm#kind ?x rdfs#class) (pm#subtype ?x {pm#situation pm#entity}) (pm#subtype ?x pm#thing_playing_some_role) (pm#exclusive_class ?x pm#relation) //Translation in CGIF with reserved types: [TYPE: pm#thing *x ;type of things that are not relations;] (CREATOR ?x spamOnly@phmartin.info) (CREATION_DATE ?x 29/11/1999) (NAME ?x "thing") (NAME ?x "top_concept_type") (KIND ?x rdfs#class) (GT ?x {pm#situation pm#entity}) (GT ?x pm#thing_playing_some_role) (EXCL ?x pm#relation)
GT, LT
and EQ
are already reserved types in the
standard but must be used within the above cited constructs.
TYPE
and KIND
have always been used by
John Sowa to denote the top-level of second-order types and the "instanceOf"
relation. NAME
is also usual. For the signature of relations, we use
DOMAIN
and RANGE
, the terms used in RDF
(DOMAIN
and RANGE
are always used when declaring
relation types and only in that case).
To indicate the category creator and its creation date, we use
CREATOR
and CREATION_DATE
(note: for readability,
WordNet is never specified as creator; it is the default creator).
Here are 4 other reserved types for common links (i.e. second-order relation)
that we need to add: EXCL, REVERSE, COMPLEMENT, SIMILAR
(these links are explained in the
documentation of WebKB-2).
The ontology of WebKB-2 has other kinds of links (most are inherited from
WordNet)
but they are simple shortcuts for defining relations between instances of the
connected types. The semantics of links in WordNet
is not very consistent but generally, a link L from a type X to a type Y means
that for most object x instance of X, x has a relation of type L to
some individual of type Y. For example,
#bird p #wing
means that most birds have wings.
(In FCG, this can be written as [most #bird, pm#part: a #wing]
).
The second-order relation types used in the part of WordNet included in
the ontology of WebKB-2 (i.e. that part relative to categories representing
the meanings of nouns) are:
pm#location, pm#substance, pm#part, pm#wnMember, pm#wnObject
.
(pm#wnMember
is a membership relation that can have anything as its
source, not just a collection; pm#wnObject
can be used to represent
that some objects belong to a domain). WebKB-2 also has a link for
pm#url
, i.e. for connecting a category to a home page about the
kind of object represented by the category. Other such "shortcut" links may be
added in the future. Given their specificity, rather than introducing a reserved
type for each of them, we prefer to introduce only one to cover all of them:
LINK
.
Here are examples:
[TYPE: #table *x] (NAME ?x "table") (LT ?x #array) (GT ?x #correlation_table) (GT ?x #periodic_table) //plus many others in WebKB-2 (LINK pm#wnMember ?x #row/3) (LINK pm#wnMember ?x #column/3) [TYPE: #table/2 *x] (LT ?x #furniture) (GT ?x #card_table) (GT ?x #coffee_table) //plus many others in WebKB-2 (LINK pm#part ?x #leg/3) (LINK pm#part ?x #tabletop) [TYPE: pm#Web-scripting_language *x] //just an example; more details in WebKB-2 (CREATOR ?x spamOnly@phmartin.info) (LT ?x #programming_language) (GT ?x pm#style_sheet_language) (GT ?x pm#Javascript) (LINK pm#url ?x http://www.phmartin.info/int3004/) //In WebKB-2, pm#Javascript is a type (not an individual) and pm#Javascript_1.0 is one of its subtypes //The Javascript parsed by the browser on the machine that you currently use //may be considered as an instance. //However, the WebKB-2 parsers accept graphs that use certain types as if they were individuals //(the condition is that these types must be a subtype of pm#information_entity)
When a link has a creator that is different from the creator of its source
category, there is a need to specify the creator of this link. The only
sufficiently readable way we found was to use ternary relations with reserved
types: NAME_BY, GT_BY, LT_BY, EQ_BY, TYPE_BY,
KIND_BY, EXCL_BY, REVERSE_BY, COMPLEMENT_BY, SIMILAR_BY and LINK_BY
.
(Note: this does not go against the fact that, for knowledge use and re-use
purposes, non-binary relations should not be used in WebKB-2 because these
ternary relations can be translated back into adequate representation in WebKB-2).
For names, in addition to the creator, it is interesting to permit the
specification of the linguistic community (e.g. #French, #English,
#CG_English_terminology
)
since it cannot be guessed with certainty from information attached to
the creator (the default community is #English
). Examples:
[TYPE: sowa#object *x (^see http://users.bestweb.net/~sowa/ontology/toplevel.htm^)] (CREATOR ?x sowa@bestweb.net) (NAME ?x "object") (NAME ?x "physical_independent_continuant") (NAME_BY ?x "independent_spatial_entity" spamOnly@phmartin.info) (LT_BY ?x pm#spatial_entity spamOnly@phmartin.info) (LT ?x sowa#actuality) (LT ?x sowa#continuant) [TYPE: #object *x] (NAME ?x "object") (NAME ?x "physical_object") (NAME_BY_IN ?x "objet" spamOnly@phmartin.info #French) (NAME_BY_IN ?x "chose" spamOnly@phmartin.info #French) (LT ?x #entity) (LT_BY ?x pm#physical_entity_that_cannot_be_alive spamOnly@phmartin.info #French) (GT ?x {#natural_object #artifact}) //plus many others in WebKB-2
Recall that we use a graph syntax for links only to give an alternative to a FT-like syntax. We assume that links are encoded differently from graphs in the data model (as is currently the case in CG workbenches). Thus, the above graphs would not be stored as such; the reserved link names would only have to be known to the CGIF parsers.
As the above examples show, for readability reasons, we have also extended
CGIF in other ways: (i) there are no concept node around individual markers,
strings, dates and partitions because there is no ambiguity about the types
of these objects, (ii) we have kept the FT syntax for dates and for
open/close subtype partitions. CGIF does not have a special format for dates
and therefore no standard way to represent (and hence exchange) them.
For partitions, if we had followed the current syntax of CGIF, we would have
used [TYPE: {#natural_object #artifact}]
instead of
just {#natural_object #artifact}
.
But what would we have done if the partition had been closed:
{(#natural_object #artifact)}
.
In CGIF, there is no standard way to distinguish between:
- open collection, close collection (meaning that all the elements have been given);
- bag, set (the usual set), xor-set, or-set, list (ordered set), ordered bag;
- distributive interpretation, collective interpretation (as with "together" in English).
Given the pervasiveness of collections in knowledge representation, the
absence of distinctions implies that the parsing and exploitation of
knowledge in CGIF is domain-dependant and source-dependant (which is what
an interchange format is intended to avoid).
On the 23/08/2001, I sent an
e-mail to the CG list about this problem
and the lack of reserved keywords for distinguishing various kinds
of quantifiers.
CGIF is intended to be basic but to permit the definition of extensions. However, how these extensions are to be made is not yet defined in the standard. Some extensions could be made via type definitions, rules or second-order graphs but each of these possibilities greatly complicates the parsing, checking, and exploitation of the knowledge, and presents theoretical problems in general cases. Furthermore, since many extensions are needed, we expect that each knowledge provider will define its own extensions and that the extensions defined by different people will be difficult to compare/combine automatically. Standard reserved types/keywords with well defined meanings constitute a simple way to avoid the use of ad-hoc extensions.
We welcome suggestions to improve the above proposed format, i.e. the way WebKB-2 presents links between categories in CGIF.
Though integrating FT in CGIF remains a better solution than the above "patches", the extension of the CGIF grammar to permit the omission of concept nodes around objects when there is no ambuiguity about their types seems important to improve readability (and hence knowledge representation/presentation/update and tool debugging). Objects such as dates and "robust" strings should also be added (by robust strings, we mean strings with unusual delimiters to permit the storage of most textual/binary data without need for escaping it). The syntax of identifiers should also be extended to permit import/export from various languages and ontologies (at least URLs and e-mail addresses should be accepted as identifiers). The distinction between annotations and comments should be made: annotations are stored in the knowledge base with the object they are associated to while comments are discarded by the parsers. Line and multi-line Java/C++ comments should be accepted.
We show the CGIF grammar in EBNF and then with the Lex&Yacc syntax,
with the extensions we propose highlighted in bold.
Note: in WebKB-2, identifiers may begin by '#'. Given our extension,
indexicals may and should be removed (below it is kept in the grammar but the
lexical analyser interpret "#foo" as an identifier, not an indexical).
Extended BNF grammar for CGIF
("?" means 0 or 1 times, "*" means 0 to N times, "+" means 1 to N times)
CG := (Concept | Relation | Actor | SpecialContext)+ SpecialContext:= "~[" CG "]" | "[" SpecialType ":"? CG "]" SpecialType := "if" | "then" | "either" | "or" | "sc" Actor := "<" Type(N) ArcConcept* "|" ArcConcept* Annotation? ">" Concept := "[" ConceptType? (CorefLinks? Referent?)+ Annotation? "]" Relation := "(" RelationType ArcConcept* Annotation? ")" ConceptType := Identifier | CT_expression CT_expression := "(" "lambda" "(" CT_parameter ")" CG ")" | CT_disjuncts CT_parameter := ConceptType DefLabel? CT_disjuncts := "(" CT_conjuncts ("|" CT_conjuncts)* ")" CT_conjuncts := CT_term ("&" CT_term)* CT_term := "~"? ConceptType RelationType := Identifier | RT_expression RT_expression := "(" "lambda" "(" RT_signature ")" CG ")" | RT_disjuncts RT_signature := RT_parameter ("," RT_parameter)* RT_parameter := RelationType DefLabel? RT_disjuncts := "(" RT_conjuncts ("|" RT_conjuncts)* ")" RT_conjuncts := RT_term ("&" RT_term)* RT_term := "~"? RelationType CorefLink := DefLabel | BoundLabel* Referent := ":" (Designator? Descriptor?)+ Descriptor := Structure | CG Structure := StructID? Collection Designator := Literal | Locator | Quantifier Literal := Number | Quoted2_str Quantifier := "@" (UnsignedInt | Identifier Collection?) Locator := Name | IndividualMarker | Indexical | Identifier Collection := "{" ArcConcept* "}" | "{(" ArcConcept* ")}" ArcConcept := Concept | BoundLabel | Collection | Point_in_time | Time | Literal | Locator | "@" (Identifier Collection) //Lexical elements (some LEX syntax is used): StructID := '%' Identifier DefLabel := "*" Identifier BoundLabel := "?" Identifier IndividualMarker:= "#" UnsignedInt Indexical := "#" Identifier? Digit := [0-9] UnsignedInt := Digit+ Letter := [a-zA-Z_] Identifier := IdLetter1 IdLetter* IdLetter1 := Letter | "#"Letter | "#." IdLetter := Letter | Digit | "#" | "_" | "-" | "/" | "?" | "&" | "~" | "."?[a-z0-9?#~] //"." may be within an identifier but not at the end | "://" //a URL may be an identifier Number := ("+"|"-")? Digit+ ("." Digit* )? Point_in_time := Date" "Time? Date := ([0-9][0-9]?"/")?[0-9][0-9]?"/""-"?[0-9][0-9]?[0-9]?[0-9]? Time := [0-9][0-9]?":"[0-9][0-9](":"[0-9][0-9])? Name := DelimitedStr("'") //'...' (as in the standard) Usual_string := DelimitedStr('"') //"..." (as QuotedStr in the standard) Robust_string := QuotedStr := "$("([^\)]|(\)[^$]))*")$" //$(...)$ QuotedStr := Usual_string | Robust_string
Yacc + Lex grammar for CGIF
%token IDENTIFIER NUMBER QUOTED1_STR QUOTED2_STR ROBUST_STR INDIVIDUAL_MARKER %token INDEXICAL DEF_LABEL BOUND_LABEL LAMBDA IF THEN EITHER OR SC %token QUANTIF_INT QUANTIF_ID STRUCT_ID QUANTIF_ID_BRACKET QUANTIF_ID_CCB %token ANNOTATION POINT_IN_TIME PERIOD CCB CCE %start CGIF %% CGIF : CG | CGIF CG; CG : CONCEPT | RELATION | ACTOR | SPECIAL_CTXT ; ACTOR : '<' RTYPE _ARCconcepts '|' _ARCconcepts _ANNOT '>' ; SPECIAL_CTXT: '~' '[' CG ']' | '[' SPECIAL_TYPE ':' CG ']' | '[' SPECIAL_TYPE CG ']'; SPECIAL_TYPE: IF | THEN | EITHER | OR | SC; CONCEPT : '[' _CTYPE _REFS _ANNOT ']'; RELATION : '(' RTYPE _ARCconcepts _ANNOT ')'; _REFS : | REF _REFS; REF : COREF_LINK | REFERENT; _CTYPE : | CTYPE; CTYPE : IDENTIFIER | CT_EXPR; CT_EXPR : '(' LAMBDA '(' CTYPE _DEF_LABEL ')' CG ')' | '(' CT_CONJUNCTS _CT_DISJS ')'; _CT_DISJS : | '|' CT_CONJUNCTS _CT_DISJS; CT_CONJUNCTS: CT_TERM _CT_CONJS; _CT_CONJS : | '&' CT_TERM _CT_CONJS; CT_TERM : CTYPE | '~' CTYPE; RTYPE : IDENTIFIER | RT_EXPR; RT_EXPR : '(' LAMBDA '(' RTYPE _DEF_LABEL _RT_PARAMS ')' CG ')' | '(' RT_CONJUNCTS _RT_DISJS ')'; _RT_PARAMS : | ',' RTYPE _DEF_LABEL _RT_PARAMS; _RT_DISJS : | '|' RT_CONJUNCTS _RT_DISJS; RT_CONJUNCTS: RT_TERM _RT_CONJS; _RT_CONJS : | '&' RT_TERM _RT_CONJS; RT_TERM : RTYPE | '~' RTYPE; COREF_LINK : DEF_LABEL | BOUND_LABEL; _DEF_LABEL : | DEF_LABEL; REFERENT : ':' _DESS; _DESS : | DESIG__DESCR _DESS; DESIG__DESCR: DESIGNATOR | DESCRIPTOR; DESCRIPTOR : STRUCTURE | CG; STRUCTURE : _STRUCT_ID COLLECTION; DESIGNATOR : LITERAL | LOCATOR | QUANTIFIER; LITERAL : NUMBER | QUOTED2_STR | ROBUST_STR; LOCATOR : NAME | INDIVIDUAL_MARKER | INDEXICAL | IDENTIFIER; NAME : QUOTED1_STR; QUANTIFIER : QUANTIF_INT | QUANTIF_ID | QUANTIF_COLL; QUANTIF_COLL: QUANTIF_ID_BRACKET _ARCconcepts '}' | QUANTIF_ID_CCB _ARCconcepts CCE; COLLECTION : '{' _ARCconcepts '}' | CCB _ARCconcepts CCE; _ARCconcepts: | ARCconcept _ARCconcepts; ARCconcept : CONCEPT | BOUND_LABEL | COLLECTION | LITERAL | LOCATOR | QUANTIF_COLL | POINT_IN_TIME | PERIOD; _ANNOT : | ANNOTATION; _STRUCT_ID : | STRUCT_ID; %%