The recommendations discussed below may be useful for the design of ontologies
which are supposed to be used via various languages and connected to other ontologies.
Normalizing knowledge representations, i.e. reducing the number of incomparable ways
something can be expressed (and hence, increasing knowledge inferencing possibilities)
is the main goal of these recommendations.
Most authors of frame-based or graph-based knowledge representation languages
have given
templates to read triplets {node1, binary_relation, node2} and
these templates are similar to the following ones:
<node1> HAS FOR <binary_relation> <node2>
or
<node2> IS THE <binary_relation> OF <node1>
.
For example, the FCG statement
[Tom, father: some man]
can be read
"Tom has for father some man" or "some man is the father of Tom".
Hence, the signature of this relation father
must specify
animal
as first argument and male_animal
as second argument,
i.e. in the FL notation:
(animal, male_animal)
, or in order to specify that the relation
is functional: (animal -> male_animal)
.
Such a reading convention is rarely provided by authors of predicate-based notations.
This often leads their users to create relation names that are awkward for the above
cited common reading convention or in certain notations. For example, when using
controlled
languages such as Formalized-English, it is awkward
to use relation names that are prefixed by "has", or postfixed by "of", or which are verbs
in the present tense and third person (e.g. "fathers").
More problematically, the order of the arguments in certain relations may be the opposite
of what is expected from the common reading convention, which makes the use of these
relations error-prone and the understanding of their uses more difficult. This is for
example the case in the SUMO for most binary relations from or to a class
(e.g. instance
and subclass
whose counterparts in RDF/RDFS are
named type
and subClassOf
).
However, in the SUMO, many other binary relation types (e.g. those imported from other
ontologies such as the "caseRole" relations) follow the common reading convention.
The functional binary relations (called "unary functions" in the SUMO) also
necessarily follow it. Organizing relations into a specialization hierarchy
is easier (or the result is more understandable) if the relations follow the common
reading convention.
Non-binary relations are supported by a smaller number of languages or systems,
and there is no common reading convention for non-binary relations. However, many of
these relations would be advantageously replaced by binary relations. For example,
instead of being ternary relations, the relations location_between
and sum
could be declared with the following signatures in
FL: (spatial_object, {spatial_object})
and
({number} -> number)
. Such relations respect the common
reading convention and permit statements such as the following FCGs:
[the set {1,2,3,4}, sum: 10]
and [sum({1,2,3,4}) = 10]
.
This use of sets also seems a good alternative to variable-arity relations since
their meaning is more intuitive and there may be more languages able to support sets
than variable-arity relations.
Finally, whenever possible, instead of adding temporal and modal arguments to relations,
it seems more economic and better for knowledge organization and sharing to keep
relations binary and ask knowledge providers to contextualize their statements (i.e.
ask them to use temporal relations on states or processes, and relations on
propositions to express modalities). This is however debatable, at least because there
may be more languages accepting non-binary relations than contexts.
The use of the InterCap style for category identifiers, with the first letter capitalized except for relation types ("properties" in RDF), is now widespread and sometimes a recommendation , e.g. in the Meta Content Framework Using XML.
However, this is not the best convention to adopt because the correct spelling of the words used in the identifiers cannot always be recovered (e.g. in order to generate English or Formalized English) and hence this correct spelling must also specified which is cumbersome, in pratice rarely done, and not possible in all languages. It is also more important to readily distinguish between types and individuals (these last ones cannot be specialized and, at least in European languages, are denoted by words with a capitalized first letter) than between relation types and concept types or individuals.
The use of words with their normal case and spelling, separated by an underscore if more than one is used, is a more readable alternative which does not lead to information loss. Thus, category searches via words are eased, and exports in controlled or natural languages are more readable. Automatic conversions to the above cited convention are still possible.
As noted above, the common reading convention for relations leads to the use
of nouns (rather than verbs) for their names. Verbs should also be avoided because
concepts rather than relations should be used for representing states or processes.
Indeed, as opposed to relations, concepts representing meanings of nouns can be quantified
in various ways (as in "at least 2 hits") and connected to relations (e.g. to
represent the agent, object, instrument, reason, time and place of the process(es)).
Hence for example, many relations representing hitting relationships would have to be
declared (e.g. hits, hits_with_intrument, hits_with_intrument_for_a_reason
)
and they would have to be defined with respect to the concept type hit
to
permit the statements which use these relations to be comparable (that is, to permit
generalization relations between the statements to be found automatically).
Whenever possible, it is better for knowledge inferencing not to introduce a relation
with a collection (in its collective interpretation) as destination, hence it is better not
to use plural nouns. For example, the FCG
[Tom, parents: the set {John,Mary}]
is not comparable to [Tom, parent: a man]
(i.e. it is neither a specialization nor a generalization).
However, using collections in their distributive interpretation, as in
[Tom, parent: {John,Mary}]
is no problem because they are
shortcuts to avoid repeating relations; for example, the last statement is
equivalent to [Tom, parent: John, parent: Mary]
and is a specialization of [Tom, parent: a man]
.
A concept type with an adjective as identifier but actually not representing a meaning
of this adjective may be misleading (e.g. "abstract" in the SUMO should rather be named
"abstract_entity").
Concept types for nouns (i.e. representing meanings of nouns) are more common and easier
to organize into a subsumption hierarchy than concept types for adjectives or verbs.
Within statements, each concept type for a verb can be replaced by
its related concept type for a noun, e.g. hitting
can be replaced
by hit
.
To illustrate how introducing concept types for adjectives may be avoided, consider the
expression "a long car"; it can be represented by
[a car, length: an important_length_for_a_car]
(this
requires the introduction of important_length_for_a_car
as a specialization
of size
) or by
[a car, length: an important length]
(in FCG,
the keywords "important", "small", "big", "great", "good" and "bad" may be used as
qualifiers; this solution is more user-friendly and helps normalizing knowledge entering;
what constitutes a big size for a car may be derived automatically if
statements such as
[most car, length: 2.5 to 4 meter]
are entered too).
Thus, manual knowledge representation can be done with "concept types for nouns" only.
However, when knowledge entering is not manual but done via natural language
parsing, such normalized representations may be difficult to obtain; in that case,
concept types for verbs and adjectives have to be used too.
Most concept types in ontologies are singular nouns (and they have to be in the singular
in the Meta Content Framework Using XML). The possibility of systematically using
quantification or sets (e.g. as in
[3 man {Tom,Joe,*}, parent: Mary]
,
[the set {Tom,Joe,*}, size: 3]
and
[at least 40% of person, kind: man]
)
increases statement comparability and reduces the need of using concept types which
express collections.
Ending the identifiers of second order types by "_class" if they have concept types as instances, and by "_relation_type" (or "Property" as is the case in RDFS/OWL) or "_function_type" (or "FunctionalProperty") if they have relation types as instances, helps people recognizing the nature of these types and hence eases their understanding of the ontology organization.
Normalizing the annotations (informal descriptions) of categories is also helpful.
The annotation of a first order type usually refers to an anonymous instance
of that type (examples of typical beginnings of such annotations:
"a car is a vehicle..."
,
"vehicle which has ..."
and
"relation which connects ..."
;
using the plural should rather be avoided as it may lead to ambiguities). The type itself
is not refered unless this is made explicit via an expression such as "this type ...".
Hence, the annotation of a second order type should refer to a first order type, and not
directly to the instances of this type; a concise way to do that is to begin the
annotation by "class of " or "type of " as in "class of cars which ... "
or "type of relations which ..."
.
Second order types are useful and commonly used (e.g. in OWL and the SUMO) to permit
the declaration of certain properties of classes and relations, e.g. reflexivivity,
transitivity and symmetry.
The OntoClean methodology
encourages the implicit or explicit use of class properties (hence second-order types)
such as rigidity and unity in order to avoid mis-uses of the subtype link (e.g.
subtyping water
by ocean
, subtyping a type representing
a role by a type not representing a role, and using subtype links instead of partOf
or instance links).
However, some uses of second order types do not seem warranted.
For example, the TAP ontology categorizes certain
types of magazines or books as instances of a second-order type product_type
which has no other supertype than Class
from RDFS. Even if it had, the use of
the first-order type product
permits more (or easier) connections and
comparisons and connections with other types, and hence more knowledge retrieval and
checking possibilities. (A more debatable example is the classification of
species
as a second order type which has various types of plants or animals as
instances. It seems better classify it as a subtype of collection
which has
various types of plants or animals as members; this is the case in WordNet).
If first order types such as product
are duplicated
into second order types, (i) many relations also have to be duplicated, and
(ii) the ontology becomes more complex to search, understand and use, and for
inference engines, harder to exploit.
Furthermore, even when second order types have to be introduced
(e.g. TransitiveProperty in OWL since transitivity is a property of a relation type which
is not necessarily inherited by its subtypes), the existence of first-order types such
as transitive_relation
is interesting: the list of instances of
TransitiveProperty
can be structured by subtyping
transitive_relation
and this eases the access, understanding and
management of these relation types by people.
Individuals (i.e. instances of first-order types) may also be over-used.
For example, it may be tempting to represent a certain doctrine, language, program
or day of the week as an individual, but then what about their variants and their
occurences? For example, Monday
has a potentially infinite number of
occurrences, and so has Whitmonday
(the day after Whitsunday). The
simplest scheme is to represent Whitmonday
as a subtype of
Monday
and their occurrences as individuals (anonymous or not).
Similarly, an alphabetic character (seen as a symbol) and the content of a book may
also have (existing or potential) variants; for example, the Bible has many versions
in many languages and the character A has "versions" too (uppercase, lowercase, ...).
There are many ways to view, categorize and relate such "versions" but using
subtype relations seems the simplest way.
It is the purpose of "language ontologies" (e.g. the Frame-Ontology, OWL and a good part of the SUMO) to define certain concepts or relations (e.g. subtype links and partitions, relation signatures and cardinalities, and numerical quantifiers) which support and partially normalize the knowledge entering in low-level languages (e.g. RDF and KIF). In high-level languages, syntactic sugar is often provided to ease and encourage the systematic use of these concepts or relations.
A restricted language and language ontology such as RDF+OWL restricts knowledge
entering (or leads knowledge engineers to use ad-hoc or biased knowledge representations).
This may be done purposefully to ensure that all entered knowledge representations
can be exploited efficiently by certain kinds of inference engines. However, for
knowledge re-use purposes, a more scalable and flexible approach is for each inference
engine to filter out or warn about the (parts of the) knowledge representations it
cannot exploit. It is problematic for the checking of users' updates in an
ontology that RDFS and OWL do not propose a "proper subClassOf" relation:
subClassOf
cycles are permitted and interpreted as
equivalentClasses
relations.
Expressive high-level languages, or rich language ontologies, ease and
normalize knowledge entering. The following two examples illustrate the interest of
features supporting and encouraging the use of cardinalities, numerical quantifiers and
(if this is acceptable) physical possibility.
"E" is for "English" and "OWLD" is for "RDF+OWL_DL". The category identifiers come
from the Multi-Source Ontology (MSO) of WebKB-2.
In the second example there is no translation in RDF+OWL because the keyword
aboutEach
which permitted common uses of sets and contexts (although
sometimes in a rather ad-hoc way) is no longer part of RDF. In KIF, both examples can
be represented in a uniform way via the relation/quantifier atMostN
(there would be advantages in introducing such relations/quantifiers in SUMO, OWL
or other well-known language ontologies).
Other examples and features for a language or a language ontology can be found in this comparison of FL, FE and FCG with KIF, RDF+OWL and UML. Please e-mail "pm .@. phmartin dot info" is you think some parts of the KIF definitions could be improved.
Section 2.2. suggested that relations should rather be primitive (and hence not represent processes) as this leads to more explicit and comparable statements. This also avoids double declarations (one type in the concept type hierarchy and one in the relation type hierarchy) and hence reduces the workload of users and inference engines. This applies not just to processes but to any concept that could also be represented via a relation, e.g. roles such as parent or employer, and attributes such as mass or color. Somes languages only allow a few predefined relations, e.g. knowledge graphs allows only 12 basic relations (however, this is not a common approach and the resulting knowledge representations seem quite cumbersome to read). Without going that far, features that avoid double declarations and lead to more normalized representations seem interesting. To that end, I have recently detailed one possibility on the SUO list: permitting users to associate relation signatures to concept types in such a way that (i) in low-level languages such as KIF, actual relation types are implied or generated, (ii) in high-level languages such as FCG, these concept types may be allowed to be used as if they were relation types.
Structuring the relation types into a hierarchy help their access and use by people.
It also permits the inference engine to perform more checks on the additions of new
relation types. John Sowa proposes a systematic classification of
roles and relations (and in
particular,
case relations)
according to their ontological nature.
Another way to organize relation/function types is according to their source or
destination arguments. This method is always applicable and quite straightforward.
In the
relation type hierarchy of the MSO of WebKB-2,
this method turned out to be scalable and effective in structuring and systematically
grouping semantically related relations, thus offering various ways to access to them.
Here is an extract of the MSO; ">" refers to the subtype link, "*
" refers
to an unknown number of unknown types, "pm#collection+
" refers to at least
one occurrence of the type pm#collection
.
pm#relation_from_collection (pm#collection, *) > pm#member ... pm#relation_to_another_collection pm#relation_from_type; pm#relation_to_another_collection (pm#collection, pm#collection+) > pm#sub_collection ... pm#relation_to_another_set_or_class; pm#relation_to_another_set_or_class (pm#set_or_class, pm#set_or_class+) > pm#subclass_of_or_equal ... sumo#power_set_fn pm#relation_to_another_class; pm#subclass_of_or_equal (pm#set_or_class, pm#set_or_class) > sumo#subclass rdfs#sub_class_of ; sumo#subclass__subclass_of (sumo#set_or_class, sumo#set_or_class) > sumo#immediate_subclass ; sumo#power_set_fn (sumo#set_or_class -> sumo#set_or_class); pm#relation_to_another_class (rdfs#class, rdfs#class+) > rdfs#sub_class_of owl#equivalent_class ... sumo#exhaustive_decomposition; sumo#exhaustive_decomposition (sumo#class, sumo#class+);
Associating constraints to categories reduce the chances of mis-uses.
Being precise when representing statements (e.g. by using precise types and
contextualizing statements in space, time and, if appropriate, modalities) reduce the
chances of conflicts between statements. This may sound obvious for knowledge sharing
purposes but is not always followed. For example,
(i) at one stage, the RDF documentation stated that for some uses, representing
"the price of that pencil is 75" was better than representing
"the price of that pencil is 75 US cents",
(ii) when integrating WordNet into the MSO, the string "_USA" had to be added to
category names such as "North" (hence now "North_USA"), and the string "in USA" had to be
added within the annotations of some categories with names such as
"Department_of_Education", and
(iii) statements such as "any bird flies" are more often represented than statements
such as
"a study made by Dr Foo found that in 1999,
93% of adult healthy birds could fly" (in this particular case, it may be
appropriate to specialize the type bird
with two exclusive subtypes such as
bird_that_can_fly_when_adult_and_healthy
and
pm#bird_that_cannot_fly_when_adult_and_healthy
).
Connecting the categories of specialized ontologies to a large lexical ontology seems to have several advantages: (i) it saves the ontology creators a lot of work and guides them since many distinctions that they would not have thought about are already recorded and inter-related, (ii) the constraints associated to those distinctions support some semantic checking bu inference engines, and (iii) the lexical ontology acts as a large "hat" which documents, structures, and permits to compare and retrieve the specialized categories and the knowledge representations that use them. These are some of the reasons why WordNet has been used to extend some top-level ontologies, e.g. for the SUMO, for DOLCE in OntoWordNet, or in the MSO. On the WebKB site, some examples of domain ontologies represented by specializing WordNet categories and interconnecting the new categories between themselves are the representation of the ADFP9 glossary and the representation of the CADM model.