Discussion on recommendations to increase knowledge re-use

Dr Philippe A. Martin - March 8th, 2004

1. Order and number of relation arguments
2. Some naming recommendations
3. Some structural recommendations

The recommendations discussed below may be useful for the design of ontologies which are supposed to be used via various languages and connected to other ontologies. Normalizing knowledge representations, i.e. reducing the number of incomparable ways something can be expressed (and hence, increasing knowledge inferencing possibilities) is the main goal of these recommendations.

1. Order and number of relation arguments

Most authors of frame-based or graph-based knowledge representation languages have given templates to read triplets {node1, binary_relation, node2} and these templates are similar to the following ones: <node1> HAS FOR <binary_relation> <node2> or <node2> IS THE <binary_relation> OF <node1>. For example, the FCG statement [Tom, father: some man] can be read "Tom has for father some man" or "some man is the father of Tom". Hence, the signature of this relation father must specify animal as first argument and male_animal as second argument, i.e. in the FL notation: (animal, male_animal), or in order to specify that the relation is functional: (animal -> male_animal).

Such a reading convention is rarely provided by authors of predicate-based notations. This often leads their users to create relation names that are awkward for the above cited common reading convention or in certain notations. For example, when using controlled languages such as Formalized-English, it is awkward to use relation names that are prefixed by "has", or postfixed by "of", or which are verbs in the present tense and third person (e.g. "fathers"). More problematically, the order of the arguments in certain relations may be the opposite of what is expected from the common reading convention, which makes the use of these relations error-prone and the understanding of their uses more difficult. This is for example the case in the SUMO for most binary relations from or to a class (e.g. instance and subclass whose counterparts in RDF/RDFS are named type and subClassOf). However, in the SUMO, many other binary relation types (e.g. those imported from other ontologies such as the "caseRole" relations) follow the common reading convention. The functional binary relations (called "unary functions" in the SUMO) also necessarily follow it. Organizing relations into a specialization hierarchy is easier (or the result is more understandable) if the relations follow the common reading convention.

Non-binary relations are supported by a smaller number of languages or systems, and there is no common reading convention for non-binary relations. However, many of these relations would be advantageously replaced by binary relations. For example, instead of being ternary relations, the relations location_between and sum could be declared with the following signatures in FL: (spatial_object, {spatial_object}) and ({number} -> number). Such relations respect the common reading convention and permit statements such as the following FCGs: [the set {1,2,3,4}, sum: 10] and [sum({1,2,3,4}) = 10]. This use of sets also seems a good alternative to variable-arity relations since their meaning is more intuitive and there may be more languages able to support sets than variable-arity relations.
Finally, whenever possible, instead of adding temporal and modal arguments to relations, it seems more economic and better for knowledge organization and sharing to keep relations binary and ask knowledge providers to contextualize their statements (i.e. ask them to use temporal relations on states or processes, and relations on propositions to express modalities). This is however debatable, at least because there may be more languages accepting non-binary relations than contexts.

2. Some naming recommendations

2.1. Rationales against the InterCap style

The use of the InterCap style for category identifiers, with the first letter capitalized except for relation types ("properties" in RDF), is now widespread and sometimes a recommendation , e.g. in the Meta Content Framework Using XML.

However, this is not the best convention to adopt because the correct spelling of the words used in the identifiers cannot always be recovered (e.g. in order to generate English or Formalized English) and hence this correct spelling must also specified which is cumbersome, in pratice rarely done, and not possible in all languages. It is also more important to readily distinguish between types and individuals (these last ones cannot be specialized and, at least in European languages, are denoted by words with a capitalized first letter) than between relation types and concept types or individuals.

The use of words with their normal case and spelling, separated by an underscore if more than one is used, is a more readable alternative which does not lead to information loss. Thus, category searches via words are eased, and exports in controlled or natural languages are more readable. Automatic conversions to the above cited convention are still possible.

2.2. Rationales for the use of singular nouns

As noted above, the common reading convention for relations leads to the use of nouns (rather than verbs) for their names. Verbs should also be avoided because concepts rather than relations should be used for representing states or processes. Indeed, as opposed to relations, concepts representing meanings of nouns can be quantified in various ways (as in "at least 2 hits") and connected to relations (e.g. to represent the agent, object, instrument, reason, time and place of the process(es)). Hence for example, many relations representing hitting relationships would have to be declared (e.g. hits, hits_with_intrument, hits_with_intrument_for_a_reason) and they would have to be defined with respect to the concept type hit to permit the statements which use these relations to be comparable (that is, to permit generalization relations between the statements to be found automatically).

Whenever possible, it is better for knowledge inferencing not to introduce a relation with a collection (in its collective interpretation) as destination, hence it is better not to use plural nouns. For example, the FCG [Tom, parents: the set {John,Mary}] is not comparable to [Tom, parent: a man] (i.e. it is neither a specialization nor a generalization). However, using collections in their distributive interpretation, as in [Tom, parent: {John,Mary}] is no problem because they are shortcuts to avoid repeating relations; for example, the last statement is equivalent to [Tom, parent: John, parent: Mary] and is a specialization of [Tom, parent: a man].

A concept type with an adjective as identifier but actually not representing a meaning of this adjective may be misleading (e.g. "abstract" in the SUMO should rather be named "abstract_entity").
Concept types for nouns (i.e. representing meanings of nouns) are more common and easier to organize into a subsumption hierarchy than concept types for adjectives or verbs. Within statements, each concept type for a verb can be replaced by its related concept type for a noun, e.g. hitting can be replaced by hit.
To illustrate how introducing concept types for adjectives may be avoided, consider the expression "a long car"; it can be represented by [a car, length: an important_length_for_a_car] (this requires the introduction of important_length_for_a_car as a specialization of size) or by [a car, length: an important length] (in FCG, the keywords "important", "small", "big", "great", "good" and "bad" may be used as qualifiers; this solution is more user-friendly and helps normalizing knowledge entering; what constitutes a big size for a car may be derived automatically if statements such as [most car, length: 2.5 to 4 meter] are entered too).
Thus, manual knowledge representation can be done with "concept types for nouns" only. However, when knowledge entering is not manual but done via natural language parsing, such normalized representations may be difficult to obtain; in that case, concept types for verbs and adjectives have to be used too.

Most concept types in ontologies are singular nouns (and they have to be in the singular in the Meta Content Framework Using XML). The possibility of systematically using quantification or sets (e.g. as in [3 man {Tom,Joe,*}, parent: Mary], [the set {Tom,Joe,*}, size: 3] and [at least 40% of person, kind: man]) increases statement comparability and reduces the need of using concept types which express collections.

2.3. Naming and annotating second order types

Ending the identifiers of second order types by "_class" if they have concept types as instances, and by "_relation_type" (or "Property" as is the case in RDFS/OWL) or "_function_type" (or "FunctionalProperty") if they have relation types as instances, helps people recognizing the nature of these types and hence eases their understanding of the ontology organization.

Normalizing the annotations (informal descriptions) of categories is also helpful. The annotation of a first order type usually refers to an anonymous instance of that type (examples of typical beginnings of such annotations: "a car is a vehicle...", "vehicle which has ..." and "relation which connects ..."; using the plural should rather be avoided as it may lead to ambiguities). The type itself is not refered unless this is made explicit via an expression such as "this type ...". Hence, the annotation of a second order type should refer to a first order type, and not directly to the instances of this type; a concise way to do that is to begin the annotation by "class of " or "type of " as in "class of cars which ... " or "type of relations which ...".

3. Some structural recommendations

3.1. Second order types, first order types and individuals

Second order types are useful and commonly used (e.g. in OWL and the SUMO) to permit the declaration of certain properties of classes and relations, e.g. reflexivivity, transitivity and symmetry. The OntoClean methodology encourages the implicit or explicit use of class properties (hence second-order types) such as rigidity and unity in order to avoid mis-uses of the subtype link (e.g. subtyping water by ocean, subtyping a type representing a role by a type not representing a role, and using subtype links instead of partOf or instance links).

However, some uses of second order types do not seem warranted. For example, the TAP ontology categorizes certain types of magazines or books as instances of a second-order type product_type which has no other supertype than Class from RDFS. Even if it had, the use of the first-order type product permits more (or easier) connections and comparisons and connections with other types, and hence more knowledge retrieval and checking possibilities. (A more debatable example is the classification of species as a second order type which has various types of plants or animals as instances. It seems better classify it as a subtype of collection which has various types of plants or animals as members; this is the case in WordNet).
If first order types such as product are duplicated into second order types, (i) many relations also have to be duplicated, and (ii) the ontology becomes more complex to search, understand and use, and for inference engines, harder to exploit.

Furthermore, even when second order types have to be introduced (e.g. TransitiveProperty in OWL since transitivity is a property of a relation type which is not necessarily inherited by its subtypes), the existence of first-order types such as transitive_relation is interesting: the list of instances of TransitiveProperty can be structured by subtyping transitive_relation and this eases the access, understanding and management of these relation types by people.

Individuals (i.e. instances of first-order types) may also be over-used. For example, it may be tempting to represent a certain doctrine, language, program or day of the week as an individual, but then what about their variants and their occurences? For example, Monday has a potentially infinite number of occurrences, and so has Whitmonday (the day after Whitsunday). The simplest scheme is to represent Whitmonday as a subtype of Monday and their occurrences as individuals (anonymous or not). Similarly, an alphabetic character (seen as a symbol) and the content of a book may also have (existing or potential) variants; for example, the Bible has many versions in many languages and the character A has "versions" too (uppercase, lowercase, ...). There are many ways to view, categorize and relate such "versions" but using subtype relations seems the simplest way.

3.2. Expressive high-level constructs or languages

It is the purpose of "language ontologies" (e.g. the Frame-Ontology, OWL and a good part of the SUMO) to define certain concepts or relations (e.g. subtype links and partitions, relation signatures and cardinalities, and numerical quantifiers) which support and partially normalize the knowledge entering in low-level languages (e.g. RDF and KIF). In high-level languages, syntactic sugar is often provided to ease and encourage the systematic use of these concepts or relations.

A restricted language and language ontology such as RDF+OWL restricts knowledge entering (or leads knowledge engineers to use ad-hoc or biased knowledge representations). This may be done purposefully to ensure that all entered knowledge representations can be exploited efficiently by certain kinds of inference engines. However, for knowledge re-use purposes, a more scalable and flexible approach is for each inference engine to filter out or warn about the (parts of the) knowledge representations it cannot exploit. It is problematic for the checking of users' updates in an ontology that RDFS and OWL do not propose a "proper subClassOf" relation: subClassOf cycles are permitted and interpreted as equivalentClasses relations.

Expressive high-level languages, or rich language ontologies, ease and normalize knowledge entering. The following two examples illustrate the interest of features supporting and encouraging the use of cardinalities, numerical quantifiers and (if this is acceptable) physical possibility. "E" is for "English" and "OWLD" is for "RDF+OWL_DL". The category identifiers come from the Multi-Source Ontology (MSO) of WebKB-2. In the second example there is no translation in RDF+OWL because the keyword aboutEach which permitted common uses of sets and contexts (although sometimes in a rather ad-hoc way) is no longer part of RDF. In KIF, both examples can be represented in a uniform way via the relation/quantifier atMostN (there would be advantages in introducing such relations/quantifiers in SUMO, OWL or other well-known language ontologies).

E: Any human body has at most 2 arms. Any arm belongs to at most 1 body. FL: pm#human_body part wn#arm [0..1,0..2]; KIF: (forall ((?b pm#human_body)) (atMostN 2 '?a wn#arm (pm#part ?b '?a))) (forall ((?a wn#arm)) (atMostN 1 '?b pm#human_body (pm#part '?b ?a))) OWLD: <rdf:Property rdf:ID="ArmPart"><rdfs:subPropertyOf rdf:resource="&pm;part"/> <owl:inverseOf rdf:ID="ArmPartOf"/> <rdfs:range rdf:resource="&wn;Arm"/> </rdf:Property> <owl:Class rdf:about="&pm;HumanBody"><rdfs:subClassOf> <owl:Restriction><owl:onProperty rdf:resource="#ArmPart"/> <owl:maxCardinality rdf:datatype="&xsd;nonNegativeInteger">2 </owl:maxCardinality></owl:Restriction> </rdfs:subClassOf></owl:Class> <owl:Class rdf:about="&wn;Arm"><rdfs:subClassOf> <owl:Restriction><owl:onProperty rdf:resource="#ArmPartOf"/> <owl:maxCardinality rdf:datatype="&xsd;nonNegativeInteger">1 </owl:maxCardinality></owl:Restriction> </rdfs:subClassOf></owl:Class> With: (defrelation atMostN (?num ?var ?type ?predicate) := (exists ((?s set)(?n)) (and (size ?s ?n) (=< ?n ?num) (truth ^(forall (,?var) (=> (member ,?var ,?s) (and (,?type ,?var) ,?predicate))))))) E: At most 300,000 persons speak Tahitian. FCG: [at most 300000 #person, can be agent of: (a #speaking, instrument: some #Tahitian)] KIF: (atMostN 300000 '?p #person (pm#modality '(exists ((?s #speaking) (?t #Tahitian)) (and (pm#agent ?s ?p) (pm#instrument ?s ?t))) pm#physical_possibility))

Other examples and features for a language or a language ontology can be found in this comparison of FL, FE and FCG with KIF, RDF+OWL and UML. Please e-mail "pm .@. phmartin dot info" is you think some parts of the KIF definitions could be improved.

3.3. Keeping the relation hierarchy small and organized

Section 2.2. suggested that relations should rather be primitive (and hence not represent processes) as this leads to more explicit and comparable statements. This also avoids double declarations (one type in the concept type hierarchy and one in the relation type hierarchy) and hence reduces the workload of users and inference engines. This applies not just to processes but to any concept that could also be represented via a relation, e.g. roles such as parent or employer, and attributes such as mass or color. Somes languages only allow a few predefined relations, e.g. knowledge graphs allows only 12 basic relations (however, this is not a common approach and the resulting knowledge representations seem quite cumbersome to read). Without going that far, features that avoid double declarations and lead to more normalized representations seem interesting. To that end, I have recently detailed one possibility on the SUO list: permitting users to associate relation signatures to concept types in such a way that (i) in low-level languages such as KIF, actual relation types are implied or generated, (ii) in high-level languages such as FCG, these concept types may be allowed to be used as if they were relation types.

Structuring the relation types into a hierarchy help their access and use by people. It also permits the inference engine to perform more checks on the additions of new relation types. John Sowa proposes a systematic classification of roles and relations (and in particular, case relations) according to their ontological nature. Another way to organize relation/function types is according to their source or destination arguments. This method is always applicable and quite straightforward. In the relation type hierarchy of the MSO of WebKB-2, this method turned out to be scalable and effective in structuring and systematically grouping semantically related relations, thus offering various ways to access to them. Here is an extract of the MSO; ">" refers to the subtype link, "*" refers to an unknown number of unknown types, "pm#collection+" refers to at least one occurrence of the type pm#collection.

pm#relation_from_collection (pm#collection, *)
 > pm#member ... pm#relation_to_another_collection pm#relation_from_type;

   pm#relation_to_another_collection (pm#collection, pm#collection+)
    > pm#sub_collection ... pm#relation_to_another_set_or_class;

      pm#relation_to_another_set_or_class (pm#set_or_class, pm#set_or_class+)
       > pm#subclass_of_or_equal ... sumo#power_set_fn pm#relation_to_another_class;

         pm#subclass_of_or_equal (pm#set_or_class, pm#set_or_class)
          > sumo#subclass rdfs#sub_class_of ;

            sumo#subclass__subclass_of (sumo#set_or_class, sumo#set_or_class)
             > sumo#immediate_subclass ;

         sumo#power_set_fn (sumo#set_or_class -> sumo#set_or_class);

         pm#relation_to_another_class (rdfs#class, rdfs#class+)
          > rdfs#sub_class_of owl#equivalent_class ... sumo#exhaustive_decomposition;

            sumo#exhaustive_decomposition (sumo#class, sumo#class+);

3.4. Being precise and connecting to lexical concepts

Associating constraints to categories reduce the chances of mis-uses. Being precise when representing statements (e.g. by using precise types and contextualizing statements in space, time and, if appropriate, modalities) reduce the chances of conflicts between statements. This may sound obvious for knowledge sharing purposes but is not always followed. For example, (i) at one stage, the RDF documentation stated that for some uses, representing "the price of that pencil is 75" was better than representing "the price of that pencil is 75 US cents", (ii) when integrating WordNet into the MSO, the string "_USA" had to be added to category names such as "North" (hence now "North_USA"), and the string "in USA" had to be added within the annotations of some categories with names such as "Department_of_Education", and (iii) statements such as "any bird flies" are more often represented than statements such as "a study made by Dr Foo found that in 1999, 93% of adult healthy birds could fly" (in this particular case, it may be appropriate to specialize the type bird with two exclusive subtypes such as bird_that_can_fly_when_adult_and_healthy and pm#bird_that_cannot_fly_when_adult_and_healthy).

Connecting the categories of specialized ontologies to a large lexical ontology seems to have several advantages: (i) it saves the ontology creators a lot of work and guides them since many distinctions that they would not have thought about are already recorded and inter-related, (ii) the constraints associated to those distinctions support some semantic checking bu inference engines, and (iii) the lexical ontology acts as a large "hat" which documents, structures, and permits to compare and retrieve the specialized categories and the knowledge representations that use them. These are some of the reasons why WordNet has been used to extend some top-level ontologies, e.g. for the SUMO, for DOLCE in OntoWordNet, or in the MSO. On the WebKB site, some examples of domain ontologies represented by specializing WordNet categories and interconnecting the new categories between themselves are the representation of the ADFP9 glossary and the representation of the CADM model.