WebKB-2 was designed to permit users to store any kind of knowledge, in a precise, intuitive and normalized way into a large shared knowledge base ("normalized" means less alternative ways of representing something, and hence more graph matching possibilities). This implies expressive notations, and hence a rich data model. The notations and model of WebKB-2 are extensions of those of Conceptual Graphs (CGs) (click here for grammars and descriptions and here for an article comparing CGLF, CGIF, KIF, Frame-CG and Formalized-English). I will keep enriching the notations and data model of WebKB-2 as long as I encounter information (e.g. natural language sentences) that cannot yet be represented in an intuitive and normalized way. However, these notations and data model are now quite stable. They have been published, are public domain, and may be re-used in any way (including commercially).
As opposed to notations, data models/structures are implementation dependant, and much more goal/application dependant than notations. Knowledge exchange/entering require high-level expressive normalised notations; knowledge exploitation require tailored data structures. Hybrid solutions where the distinctions between notations and models are blurred, typically XML-based languages, may be sufficient for "data" but make "knowledge representation, comparison and exploitation" difficult to perform and debug. WebKB-2 is a large-scale KBMS oriented toward knowledge retrieval. As in DBMSs, the model used by WebKB-2 results from compromises between keeping the data structures small and easing the retrieval of the stored knowledge.
In many knowledge representation languages, especially in the terminological
logics of the 1990s, the notations and the model were restricted to permit
the inference engine to exploit all the entered knowledge in a sound
and complete way.
This approach of tailoring a language (notation and model) to the
possibilities of a particular inference engine is quite restrictive and leads
the knowledge providers to bias the knowledge to make it fit the model.
Common knowledge representation tools, even terminological logics tools such as
Loom,
have features that do not permit the inference engine to use all the
information and be sound and complete.
With WebKB-2, the users are encouraged to represent knowledge as precisely
as possible, and re-use, "correct" or annotate other users' knowledge.
WebKB-2 does not exploit all the entered information for consistency checks
and information retrieval, just the usual pieces of information: relation
signatures, exclusion and subsumption links, graph structures (cycle detection,
graph matching, etc.).
Applications that require more information to be taken into account
may re-use the same knowledge with more complete or more domain-dependant
inference engines. This is not possible when the knowledge is biased to
fit a restricted model.
WebKB-2 is written in C++ and re-uses the OODBMS FastDB (or Gigabase if the knowledge base exceeds 1 Gb). The data model (or "database schema") is specified by C++ classes (although Gigabase also has APIs for Java and Perl). Information about relations between the field descriptors or the way they should be indexed is described via macros within the classes (as explained below).
The data model of WebKB-2 is mainly composed of five classes:
Term, TermName, User, Node
and Relation
.
The first three classes permit the storage of the "ontology" of the knowledge,
i.e. the names, identifiers and inter-relations of the objects (categories)
in the statements. The last two classes permit the storage of the statements.
WebKB-2's notions of category, category name, category identifier, user,
link (between categories or between a category and a name), statement, node
and relation are explained in the
documentation home page of WebKB-2.
A "term" is to be understood as a "formal term"; it is a synonym for a
"category" (i.e. an individual or a
concept/relation type).
Below is the C++/FastDB description of the classes required for the ontology
(without the declaration of the methods associated with the classes).
The classes dbArray
, dbDateTime
and
dbReference
are FastDB classes for respectively a dynamic array,
a date and an internal reference to an object in the database.
The macro TYPE_DESCRIPTOR
permits to associate database descriptors
to class fields, via the following macros:
FIELD
permits to specify that a field should not be indexed by FastDB.
KEY
permits to specify which indexation technique to use.
RELATION
specifies that two fields are in a dependance
relationship: 1-N, N-1, 1-1, or N-N (depending on the fact that the
source/destination field is a dbArray or not). FastDB exploits
this information to manage reverse links and, in some cases, optimize queries
to the database.
INDEXED_RELATION
is a combination of KEY
and
RELATION
.
RDBMSs and most deductive DBMSs have relation-based models (and hence, at least one CG sytem has it too, Bernd Groh's RDBMS-based CG system; click here for his PhD thesis. Many first-order logic based systems have relation-based notations, e.g. KIF or CycL. The notations and models of frame-based systems and WebKB-2 are node-based (i.e. information such as quantifiers and relations are represented within the nodes). Theoretically, the two approaches are equivalent. However, for representing natural language sentences or similar general complex knowledge, I believe the node-based approach lead to much more intuitive linear notations (since most often, no variable or arc is needed to represent relations between nodes) and models easier to exploit (the information related to a node, e.g. its contexts, relations and quantifier, is directly accessible from it, instead of being distributed over many relations or small graphs). On the other hand, for specifying complex combinations of quantifications, the relation-based approach is easier to use. Hence, as a low-level interchange format (permitting to define the semantics of higher-level notations) KIF's lisp-like notation is adequate. However, for knowledge entering and exchange, a high-level format such as Frame-CG (FCG) is required. (More details in our comparison of CGLF, CGIF, KIF, Frame-CG and Formalized-English).
In the data model of WebKB-2, the relations are sub-structures of the concept nodes, and a graph is just a particular concept node. The data structures can be seen as trees (although each relation is stored in both source and destination nodes) and hence the scope of each quantifier (stored in each concept node) is well delimited (reminder: in FCG and Formalized-English (FE) the order of the concept nodes is used for defining the scope of the quantifiers). If a user provides a graph without specifying a unique identifier, i.e. if s/he does not encapsulate the graph into a concept node via representing an individual (graph), WebKB-2 automatically generates this individual and node before storing the graph (in such a case, the "head node" of the provided graph is the first of the sub-nodes of the generated node).
In WebKB-2, graph retrieval is done by first accessing candidate nodes and then checking that the relations of each of the candidate nodes match the relations of the selected query node (if it has relations). The alternative, accessing candidate relations matching a relation in the query (if there are relations) and then check to see if the nodes match, seems to be less efficient if relations types are much more re-used than concept types (this is likely to be the case, at least for what WebKB-2 is intended for).
WebKB-2 notations, models and conventions discourage the use of non-binary relations (for precision and normalization reasons). Strictly speaking, FCG and FE actually do not permit non-binary relations; however, FCG permits to enter "functions" (or functional relations) in their usual format and without restriction on their number of paramaters; this feature may be used for representing non-binary relations but is strongly discouraged. The graph-matching mechanisms of WebKB-2 do not yet compare normal relations with "functions".