Article published in the proceedings of
WebNet 2000.
Abstract. The Resource Description Framework [RDF] provides a basic model to describe relationships between objects. Ultimately, it is intended to permit the representation, combination and processing of most kinds of metadata from Web-accessible documents or databases. However, except for representing simple metadata, its current XML-based syntax [RDF syntax] and the set of basic classes that have been defined [RDF schema] are insufficient. To make extensions, the users are required to declare new classes in "schemas" or import schemas from other users. The problem is that similar/identical classes or features will probably be introduced by various users via different names or used in different ways, and this prevents the comparison, reuse and combination of the metadata. To maximize the reuse of metadata, we propose some lexical, structural and semantic conventions, inspired from various knowledge representation projects. These conventions would have to be agreed on and completed by the W3C committee.
Topics/Keywords. Data and Link Management, Metadata Representation/Retrieval/Reuse
The Resource Description Framework [RDF] "can be characterized as a simple frame model" [RDF syntax]. It is sufficiently low-level and general for most other knowledge representation models to be translated into. However, since it is low-level, there are many possible ways for such translations, and there are many ways for users to represent the same fact. These various ways are not comparable and therefore, without additional features and conventions, RDF cannot support metadata exchange and reuse. As noted by [Berners-Lee et al., 1999], "work is needed to define common terms for [extending RDF to the power of usual knowledge representation systems]".
RDF currently has constructs for representing simple existential graphs plus some kinds of sets, contexts and term declaration. [Berners-Lee, 1999] proposes some additional constructs for representing universal quantification. RDF users can declare new classes to introduce additional features, and give some constraints (essentially via relation signatures) but no real class "definition" is as yet possible.
A usual concern is that an increase in expressiveness leads to a model and a language too complex to handle efficiently. In actual fact, like other XML agents, the RDF/XML analyzers will exploit some terms (those representing features they know how to handle) and ignore others. Thus, the level of complexity dealt with is not determined by the language but chosen by the user via the selected analyzer. Some applications require the exploitation of rules, sets, logical negation and contexts, whereas simple structure matching may be sufficient for a search engine.
In Section 2, we propose lexical, structural and semantic general conventions, synthesizing conventions from various knowledge representation communities. In Section 3, we propose ways to apply and extend RDF/XML in various logical cases of knowledge representation.
These conventions apply not only to RDF but any knowledge representation language that can be translated into a directed graph model such as RDF. Rather than using RDF terminology, we use the more intuitive terminology of Conceptual Graphs [CGs]. A "concept" refers to a node ("resource" in RDF; it may represent 1 or several objects). A "relation" ("property" in RDF) refers to a relationship between concepts. A "class" refers to a certain kind of concept or relation. An "ontology" (or "schema") is a set of class declarations.
InterCap style for identifiers. Identifiers in RDF/XML must have legal XML names. The "InterCap style" has been adopted, with a lower case first letter for relation classes [RDF syntax (appendix C)] -- as in rhetoricalRelation and subClassOf -- and an upper case first letter for concept classes [RDF schema (Section 1.2.2] -- as in TaxiDriver.
High-level lexical facilities. To reduce lexical problems and promote metadata reuse, high-level languages or query interfaces should provide lexical facilities for the user. For instance, language analyzers could automatically normalize identifiers that include uppercase letters, dashes or underscores into the Intercap style, as well as exploit user-defined aliases. Such analyzers should also accept queries or representations that use undeclared class names (e.g. common words) when the relevant class names can be automatically inferred via the structural and semantic constraints in the queries or representations and the ontologies they are based upon. When different interpretations are possible, the user should be alerted to make a choice. This last facility, detailed in [Martin & Eklund, 1999], is particularly interesting when the exploited ontologies reuse a natural language lexical database such as WordNet [WN]: it spares the user the complex (and tedious) work of declaring and organizing each term used. This facility (along with high-level notations and interfaces) seems an essential step to encourage Web (human) users to build knowledge representations. Similar ideas for the exploitation of lexical databases such as WordNet are developed in Ontoseek [Guarino et al., 1999].
Nouns for identifiers.
The convention of using nouns, compound nouns or verb nominal forms
whenever possible within representations not only makes them more explicit,
it also efficiently reduces the lexical and structural ways they may be
expressed.
Concept classes referred to by adjectives can rarely be organized by
generalization relations but may be decomposed into concept classes
referred to by nouns.
Concept classes referred to by verbs can be organized by generalization relations
but cannot be inserted into the hierarchy of concept classes referred to by nouns
(and therefore cannot be compared with them) unless verb nominal forms are used.
These nominal forms, e.g. Driving, also recall the
need to represent the time frame or frequency of the referred processes.
Additionally, they are in accordance with the use of various kinds of
quantifiers, e.g. it is possible to speak about "any abstract_entity"
and "at least 3 transformations" but "any abstract" nor "at least 3 transform".
Most identifiers in current ontologies are nouns (e.g. the
Dublin Core [DC] or the
Upper Cyc Ontology [CYC]),
even in relation class ontologies such as the
Generalized Upper Model relation hierarchy.
Avoiding adverbs for relation names is sometimes difficult, e.g. for
spatial/temporal relations.
What should be avoided is the introduction of relation names such as
isDefinedBy and seeAlso (both proposed in [RDFschema]).
Better names are Definition and AdditionalInformation.
Singular nouns for identifiers. Most identifiers in ontologies are singular nouns. Category names must be in the singular in the Meta Content Framework Using XML [MCF/XML]. For the sake of normalization, it is therefore better to avoid the use of plural identifiers whenever possible, e.g. by using "distributive sets" (that is by using the RDF keyword "aboutEach" instead of "about" whenever possible).
Binary basic relations.
As with most frame-based models, RDF only has binary and unary relations.
Relationships of greater arity may still be represented by using
structured objects or collections, or using more primitive relations.
For instance, "the point A is between the points B and C" may be represented
using the relation between and a collection object grouping B and
C, or using the relations left and right, above and
under, etc. Most often, decomposition makes a representation more
explicit, precise and comparable with other representations.
Thus, relations should refer to simple/primitive relationships because complex
relationships cannot be compared without special rules being added.
As a rule of thumb, relations should not refer to processes and should --
whenever possible -- be named with simple "relational nouns", e.g. part
and instrument. Complex relational nouns such as child
and driver imply additional lexical or structural facilities
(e.g. those of Ontoseek).
Avoid disjunctions, negations and collections.
Representations including disjunctions, negations or collections are
generally less efficiently exploitable for logical inferencing than
conjunctive existential formulas and IF-THEN rules based on these formulas
[BRML].
It is often possible to avoid
disjunctions and negations without loss of expressivity using IF-THEN rules
or by exploiting class hierarchies. For instance, instead of writing
that an object X is an instance of DirectFlight OR of
IndirectFlight, it is better to declare X as an instance of a
class Flight that has DirectFlight and IndirectFlight as
exclusive subclasses (i.e. classes that cannot have common subclasses or instances).
Exclusion links between classes (or between whole formulas) are
kinds of negations that can be handled efficiently, and are included in
many expressive but efficient logic models, e.g. Courteous logic on which
the Business
Rules Markup Language [BRML] is based.
The introduction of identifiers for collections may also often be avoided
using "distributive collections", i.e. in RDF by using the keyword "aboutEach".
Distributive collections are often easy to handle since
they can be considered as syntactic shortcuts for representing relations
about each of their members. Class definitions describing typical or necessary
relations associated with the class instances are also a way of
representing facts about collections of objects that knowledge representation
systems generally handles more efficiently than if these relations were
directly represented using (real) collections inside other assertions.
Precision, term definitions and constraints.
The more precise the representation the less chance of conflict with another.
The more primitive its components, the more likely the representation can
be cross-checked and compared with others to respond to queries. Representations
should be contextualized in space, time and author origin. No relevant
concepts should be left implicit.
It is stated in [RDF syntax,
Section 2.3] that for some uses, writing property values without qualifiers
is appropriate, e.g. "the price of that pencil is 75" instead of "the price
of that pencil is 75 U.S. cents". However, a representation of the first
sentence would be ambiguous, not comparable with other prices. This violates
the original purpose of RDF.
To improve precision and allow consistency checks, it is important to
use precise classes and associate constraints about their use.
At least, a signature should be associated to each relation class,
and exclusion between classes should be represented.
To improve the retrieval of classes or representations
using them, it is important that these classes specialize commonly used classes.
One way to do this is to specialize classes from a natural language ontology
such as WordNet with domain-oriented classes.
Extending such an ontology is often quicker and safer than creating
an ontology from scratch, ensures a better reusability of the
representations and automatic comparisons with representations
based on the same ontology. These issues are discussed
and implemented in
[LOOM].
We now propose some extensions to the RDF syntax or basic set of classes. Following our conventions, the classes in our examples have WordNet nouns for names.
RDF supports the representation of typed individuals, existentially quantified variables, and relations between them. The only extension that seems convenient at this stage is a special relation property (named dir for instance) to indicate that the direction of the relation is reversed. Example:
Representing the same information without this property dir would imply more graphs (smaller ones) or the use of relations such as partOf, homePageOf and employer. To permit comparisons of graphs, these relations would have to be declared as inverse of part, homePage and employer. There is not yet a standard way to do so. If there was, parsers taking it into account would be less efficient.
Syntaxically, a context is a concept which embedds other concepts (possibly linked by relations). Semantically, a context represents a situation (i.e. relationships between objects in a real or imaginary world) or a statement (i.e. a description of a situation). As any other concept, a context may be referred to via an individual identifier or an existentially quantified variable. Thus, it is possible to describe relations from/to them and therefore about their content. In RDF/XML, the keyword "aboutEach" must be used to do so.
Some of these relations involve situations, e.g. to situate them in time, while others involve statements, e.g. to state that they have been authored by someone at a certain time. The signatures of such relations is important information for checking or classifying the kinds of contexts connected by such relations. However, explicitly typing contexts with Situation or Statement to comply with the relation signatures is not intuitive, and it leads to lengthy representations and rather arbitrary decisions. For instance, can logical relations be directly connected to Situation concepts or is an intermediary Statement concept always necessary? This problem has not yet been tackled by the RDF specifications [RDF syntax, RDF schema], only one class of context is used: rdf:Description.
We propose that the RDF parsers still accept rdf:Description as a generic class for contexts, but automatically deduce their adequate classes (Situation or Statement) and the implicit intermediate contexts. Even with this facility, the notation for contexts quicky becomes cumbersome and would need to be adapted. The next example shows this need.
In this example, we have represented negation using the relation class
truth. A better convention might be to use the context class
Negated_description or simply Not.
[Berners-Lee, 1999]
also proposes the relation class "truth", plus an IF-THEN construct to allow
the representations of rules. Though we reuse this construct in the next
section, an alternative is to use relations such as
implication or equivalence.
Modalities could be represented via the relation modality and an
instance of an agreed-on set of modality classes.
It is not the purpose of this article to propose an ontology of modalities
but we would like to emphasize that for the sake of knowledge reuse such issues
should be made part of the RDF standard.
[BernersLee, 1999] proposes a construct for universal quantification. Here is an extract from his examples.
Additional properties (e.g. "atLeast", "atMost" and "part") would be interesting to specify some restrictions on the quantification. Here is an example.
Such a construct permits the definition of rules on the instances of a class,
or in other words, to associate definitions to that class. Without restricting
properties (e.g. "atLeast", "atMost" and "part"), the definition specifies relations
"necessarily" connected to all instances of that class (that is, necessary conditions
of membership to the class).
Using part="most"
, typical relations can be defined, but more precision
is achieved with percentages (e.g. part="75%"
or atLeast="75%"
).
[RDFSchema] also permits one to define some restrictions on the use of a class
by directly connecting classes via relations.
Though this method is convenient for a few well-known special cases (generalization
relations, exclusion relations and relation signatures), the semantics of such connections
is unknown for other cases. Assume for example that two classes Airplane
and Wing
are connected by a relation "part". Does this mean that "any airplane
has for part a wing" or "any wing is part of a plane" or "a wing is part of any plane" or
"any airplane has for part all the wings"?
We propose the first interpretation be adopted (i.e. the source of the relation is
universally quantified and the destination existentially quantified).
The properties "atLeast", "atMost" and others such as "size" would also be convenient for containers, and the "forall" construct useful for quantifying over the members of a container. Consider for example the sentence "Ten persons, including Fred and Wilma, have each approved a resolution". Since the persons may or may not have approved the same resolution, an existential quantifier must be used with an existential quantifier within to refer to the resolutions. In the following example, we introduce the class Set to specify that the members cannot be identical.
The properties "atLeast" and "atMost" permit the delimitation of intervals. Here is an example.
This last example could also be represented
using the relations minimalSize
and maximalSize
which are
part of the 120 basic relations of our top-level ontology.
However, like conventions, if such common and basic relations are not adopted as
standards, the comparison of RDF metadata (and therefore their retrieval, merge
and reuse) will remain problematic.
Information can be represented in a number of different ways, especially with low-level general languages such as RDF/XML. For representations to be automatically comparable, conventions must be followed. We have proposed general lexical, structural and semantic conventions, then examined some issues associated to the most common logical cases and proposed ways to use RDF in those cases. More issues need to be tackled and incorporated into the RDF specifications before this language can support knowledge reuse.
This work is supported by a research grant from the Australian Defense, Science and Technology Organisation (DSTO). Many thanks to Dr OLivier Corby for his readings and corrections of this article.
T. Berners-Lee, The Semantic Toolbox: Building Semantics on top of XML-RDF, W3C Note, 24 May 1999. http://www.w3.org/DesignIssues/Toolbox.html; see also the "Semantic Web Road map" at http://www.w3.org/DesignIssues/Semantic.html
T. Berners-Lee, D. Connolly, R. Swick., Web Architecture: Describing and Exchanging Data, W3C Note, 7 June 1999. http://www.w3.org/1999/04/WebData
N. Guarino, C. Masolo and G. Vetere, Ontoseek: Content-based Access to the Web, IEEE Intelligent Systems, Vol. 14, No. 3, pp. 70-80, May/June 1999
Martin Ph. & Eklund P., Embedding Knowledge in Web Documents. Proceedings of WWW8, Eigth International World Wide Web Conference, special issue of The International Journal of Computer and Telecommunications Networking (in press), Toronto, Canada, May 11-14, 1999.
BRML | Business Rules Markup Language. http://xml.coverpages.org/brml.html |
CGs | Conceptual Graphs. http://www.jfsowa.com/cg/cgstand.htm. See also: J.F. Sowa, Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, 1984. |
CYC | http://www.cyc.com/ |
DC | Dublin Core. http://purl.oclc.org/dc/ |
LOOM | The LOOM knowledge representation system.
http://www.isi.edu/isd/LOOM/LOOM-HOME.html. See also http://www.isi.edu/isd/OntoLoom/hpkb/OntoLoom.html#RTFToC18 |
MCF/XML | Meta Content Framework Using XML. http://www.w3.org/TR/NOTE-MCF-XML/#secA. |
RDF | Resource Description Framework. http://www.w3.org/RDF/ |
RDF syntax | RDF Model and Syntax Specification. http://www.w3.org/TR/REC-rdf-syntax/ |
RDF schema | RDF Schema Specification. http://www.w3.org/TR/1998/WD-rdf-schema/ |
WN | WordNet. http://wordnet.princeton.edu/ |