WordNet
is a lexical database connecting English words/expressions to
categories representing their meanings. It can also be seen as an ontology for
natural language since the categories are connected by various kinds of
semantic links, e.g.
generalization, similar, exclusion, member, part
and
substance
.
To ease and guide knowledge re-use/sharing/retrieval/entering, I initialized the knowledge base (KB) of WebKB-2 with the content of WordNet 1.7 related to nouns: 108,000 nouns and 74,500 categories referred by nouns (in accordance with my lexical recommendations, I ignored information regarding verbs, adverbs and adjectives).
A first problem was that, although WordNet categories have intuitive
names (English nouns or nominal expressions),
they do not have intuitive identifiers
(the WordNet API mainly uses numbers).
Intuitive identifiers are mandatory for permitting people to read, write and
update knowledge statements in text files, i.e. outside the graphical interface
of a particular tool. This is a minimal requirement for knowledge sharing/re-use
and also greatly simplifies the development of knowledge-based tools.
Hence, I designed an algorithm to create intuitive
identifiers for WordNet categories based on their names.
This algorithm combines various heuristics I learnt from many trials. Although
the final version worked quite well, I still had to update a few generated
category identifiers manually. This algorithm is detailed below.
A second problem was that WordNet has a poorly structured top-level
and does not always classify categories according to distinctions that are
important for the use of these categories in knowledge representations.
To permit this use and some semantic checkings on it, I have inserted
WordNet top-level categories and some medium-level categories into a top-level
ontology synthesizing and complementing various other top-level ontologies.
This has led us to break a few WordNet generalization links.
Click here for rationales behind my
top-level ontology. Click
here
to explore the first specializations of my top-level concept type
pm#thing
, and
here
to explore the specializations of my top-level relation type
pm#relation
.
A third problem was that WordNet confuses subtypeOf
and
instanceOf
links into generalization
links,
or in other words, does not distinguish types from individuals (categories
that cannot have subtypes or instances). This distinction is important for
knowledge checking -- although
instanceOf link
should not be over-used -- to avoid forcing arbitrary
choices or to compensate for wrong choices, WebKB-2 permits the use of certain
types without quantifiers (i.e. as if they were individuals) within
statements.
I have isolated 6211 true individuals within WordNet 1.7.
Click here for a list.
A fourth problem was that WordNet 1.7 countains inconsistencies and
redundancies. Conversely, some categories for common English words are missing.
Click here for a list of my
semantic corrections (more than 300) and additions (more than 150).
It should be noted that most of the inconsistencies I corrected were
automatically detected thanks to the exclusive links in my top-level ontology
(and as mentioned above, the generalization of WordNet categories by categories
in my top-level ontology). Two kinds of links, equal
('=') and
location
('l') had to be introduced to correct certain erroneous
uses of the generalization link.
A fifth problem was that some categories did not have explicit enough names, or their ordering was not correct (category names in WordNet are ordered by decreasing frequency of use, but this ordering is generated from a few concordance files and therefore can be misleading). Click here for a list of the lexical modifications that I made to the WordNet ontology.
A sixth problem for knowledge representation is the lack of structuration
of WordNet and the fact that many categories have a lexical rather than
semantic nature. Some structuration was added via semantic links (the above
cited 161 additions). I also added sub-annotations at the beginning of some
category annotations, e.g. $(value)$
to represent the fact that
the category represent a value, and $(artificial)$
to represent
that it has a lexical nature and/or should not be used for knowledge
representation. Click here for a list of
value/artificial sub-annotations.
Finally, it should be noted that the semantics of the links
part, member, substance
and object
in WordNet is
not always clear or inconsistent.
For instance, does a part
link from the category
airplane
to the category wing
mean that
"any airplane has for part at least 1 wing" or
"all airplanes have for part the same wing", "any wing is part of a plane",
"a wing is part of any plane", etc.
For graph matching (and hence inferencing) in WebKB-2, I have assumed the
first interpretation is correct; however, this is just an heuristic.
I integrated WordNet 1.7. in January 2002. When representing knowledge between January and June, I sometimes made some updates to the key names of the WordNet categories, and occasionally corrected some links, but more and more rarely. The WordNet part of the KB (and my top-level ontology) can now be considered quite stable. Hence, the identifiers can be used by people in their own files, and support knowledge sharing.
It is best to explore and filter parts of the ontology of WebKB-2 via my Category Search tool. However, if you do need all or parts of the ontology, see the file that permits to loads all the other input files, including the top_level ontology and the representation of WordNet (10.3Mb file). These files are up-to-date and in the FT format which permits to get a good understanding of the content of the ontology. Old versions of the whole ontology are also available in other formats:
More top-level ontologies, e.g. from the SUO Library and the DAML Library, will be incorporated into WebKB-2 knowledge base.
This work has now been published.
It has been done to help principled and manual knowledge
representation. It is insufficient for the inter-operation
of fully automatic software agents, e.g. for e-commerce or database integration
purposes; this
article by R.
Colomb gives some of the reasons why general automatic inter-operation
(not pre-programmed business-to-business inter-operation) is not going to
happen anytime soon.
My work is also very insufficient to help knowledge-based automatic
natural language processing. One of the steps in this direction
are provided by the
ThoughTreasure
TM project and its downloadable resources.
The Cyc and
OpenCyc projects should of course also be
cited. See also the pages about the
Open Mind
projects and the
Natural Language
Processing group at USC/ISI.
WordNet connects words to categories representing the meanings of these words. Each category has at least one name (word) and each name may be shared by several categories (since a word may have several meanings) Category keys (or "key names") need to be chosen for uniquely representing categories. (I use the expression "key name" instead of "category identifier" because in WebKB an identifier for a category is generally composed of a user identifier, a key name, and optionally other names separated by "__").
In the WordNet API and database files, a category is referred either with the
offset of its description in one of the database files
(e.g. the offset "12558316" for the category with names "Friday" and "Fri"), or
some sense indices which are the names of the category with some
suffixes to make unique key names (e.g. "friday%1:28:00::" and "fri%1:28:00::";
the "1" after the "%" indicates that the name is a noun; the "28" is a number for
the lexicographer file containing this name; "00" is the order of the category in
the list of the categories sharing this name in this lexicographer file).
Given WebKB only stores categories representing the meaning of nouns
(i.e. categories having nouns as names), I could have adapted sense indices
to make relatively readable key names, e.g. #Friday-28
and #Fri-28
. However, I experienced that knowledge is not easy to
read or write when all the category identifiers have such suffixes.
Ideally, the key name of a category should look like one of the English words or expressions most commonly used for referring to what the category represents, and be unambiguous enough for a human reader to distinguish its meaning from the meanings of other categories. In WordNet, the most common name for a category is the first in the list of its names, but less ambiguous names may appear after. When one of the other names is a compound name beginning or ending with the first name (as "Steve_Martin" begins with "Steve" and ends with "Martin"), it constitutes a better choice for a key name than the first name.
Hence, here are the first rules (ordered by decreasing order of priority) that
I chose to generate key names:
1) when the 1st name of a category begins or ends one of the other names, select
this other name as key name (unless it is shared by another category without
generated key name yet);
2) select the 1st name of a category as key (unless it is shared by another
category without generated key name yet);
3) try the first two rules on the 2nd name instead of the 1st;
4) try the first two rules on the 3rd name instead of the 1st;
5) etc.
To respect the decreasing order of priority of these rules, I have scanned
the KB many times (each time, testing all remaining categories without key name),
allowing the test of a lower priority rule only when the application of
rules of greater prority did not lead to any more change. (The order of the rules
was also respected when testing each category). This may not be an efficient
approach but it was efficient enough given WebKB-2 could scan the whole KB
quite quickly (0.45 second in average).
The application of the first two rules (i.e. trying to use only the 1st name of
each category) permitted the affectation of key names to 75% of categories
(56074 out of 74488). The gradual use of the other category names permitted to
reach 84% of affected categories (62873 out of 74488). This means that each
category in the remaining 16% shared all its names with another category
(being in this 16% too).
To go further, I had to generate suffixes. I used numbers when I integrated
WordNet 1.6 but, when using categories in knowledge representations, I realized
that this option was not user-friendly enough and that a much clearer option
was to use the key name of the first supertype. Such suffixes often help people
to guess the meaning of a category without having to access all its supertypes.
However, I did not want to give a key name with a suffixe to all remaining
unaffected categories. Hence, I added the following rules (by decreasing order
of priority and with a lower priority than the previous rules) to
select the categories to which key names with suffix would be affected:
- select the category with a frequency-of-use number far lower than the other
categories sharing all the same names (this number is given by WordNet and
represents the frequency of appearence of the category in a few concordance
documents; it is an indication but not of paramount importance; "far lower"
was first set at 30 and then to decreasing values);
- select the category with a far lower number of subtypes than the other categories
sharing all the same names (actually, in these last two rules, I used combinations
of gradually decreasing values of frequency-of-use and number of subtypes;
I also tried to reduce the affectation of suffixes to subtypes of
#action
, as these types are more frequently used than the others
in knowledge representations).
After several additional scans of the KB with all the rules, there were still
a few dozens of categories that were unaffected. To fix this, I added
more precise names to these categories and/or re-ordered their names (some of
my lexical additions to WordNet come
from this phase). I also had to correct some attributions of suffixes and
some choices of key names. For example, #Republic_of_Singapore
,
instead of #Singapore
, had been selected as key name
(in application of the 1st rule) but #Singapore
is a more convenient
identifier, while the island of Singapore and the capital of Singapore are better
referred to via #Singapore.island
and #Singapore.capital
than #Singapore
.
To fix that, before re-running the key name affectation
procedure from scratch, I semi-automatically pre-affected suffixes to many
categories, especially the specializations of #location
.
For example, I added the suffixes ".capital", ".city", ".island", ".country",
and ".colony" to desambiguify many category names. However, instead of using
the generalizing category for creating the suffix, I sometimes followed the
partOf
link. For example, in WordNet 1.7, #town
has
three instances with unique name "Bangor" but part of different regions.
Hence, I named them #Bangor.Northern_Ireland, #Bangor.Wales
and
#Bangor.Maine
.
I have not listed these manual and automatic additions of suffixes in
my lexical additions to WordNet. However,
you can click here for the
current list of 5944 WordNet categories having been affected a key name with a
suffix.