After noting that neither informal documents nor totally formal knowledge bases are good media for people to share, compare or discuss about technical knowledge, we propose mechanisms to support the sharing, re-use and cooperative update of semi-formal semantic networks, assign values to contributions and credits to the contributors. We then propose ontological elements to guide and normalize the construction of such knowledge repositories, and an approach to permit the comparison of tools or techniques.
The majority of technical information is currently published in mostly unstructured forms within documents such as articles, e-mails and user manuals. Therefore, finding and comparing tools or techniques to solve a problem is a lengthy process (with most often sub-optimal results) that involves reading many documents partly redundant with each other. This process heavily relies on human memory and manual cross-checking, and its outcomes, even if published, are lost to many people with similar goals. Writing documents is also a lengthy process that involves summarizing what has been described elsewhere and making choices and compromises on which ideas or techniques to describe and how: level of detail, order, etc.
To sum up, whatever the field of study, there is currently no well structured semantic network of techniques or ideas that a Web user could (i) navigate to get a synthetic view of a subject or, as in a decision tree, quickly find its path to relevant information, and (ii) easily update to publish a new idea (or a new way to explain an idea) and link it to other ideas via semantic relations. Such a system is indeed difficult to build and initialize. However, it is part of a vision for a semi-formal "standardized online archive of computer science knowledge" (Smith, 1998) and dwarfed by the much more ambitious visions on a "Digital Aristotle" which would be capable of teaching much of the world's scientific knowledge by (i) adapting to its students' knowledge and preferences (Hillis, 2004), and (ii) preparing and answering (with explanations) test questions for them (this implies the encoding of the knowledge in a formal way and meta-reasonings on the problem-solving strategies). The current approachs that are related to the above cited problems can be divided into three groups.
First, the approaches indexing (parts of) documents by metadata, such as Dublin Core metadata, DocBook metadata, topics (generated, e.g. by Latent Semantic Analysis or other keyword co-occurence analysis techniques, or manually decided as in the Open Directory Project and the topic hierarchies of Yahoo), categories from ontologies (e.g. WordNet or a specialized lightweight ontology as in the KA2 project and some other Semantic Web projects), or more rarely, formal summaries (e.g. in Conceptual Graphs). These approaches are useful for retrieving or exploiting a large base of documents (e.g. Iridescent helps researchers and companies find keyword relationships within the 13 million abstracts of the MEDLINE database) but do not lead to any browsable/updatable semantic network synthesizing and comparing facts or ideas. The same can be said about most document-related query answering systems (e.g. those exploiting Natural Language Understanding techniques).
Second, the approaches aiming to represent elements of a domain into formal or semi-formal knowledge bases (KBs). Examples are Open GALEN (an ontology of medical knowledge), the KBs of Fact Guru (one on Canadian Animals, one on Astronomy, one on Java, and one on Object-Oriented Software Engineering), the QED Project (which aims to build a formal KB of all important, established mathematical knowledge), the KBs of the Halo project (the long term goal of this project is the design of a "Digital Aristotle"; in its first phase, three research teams have each represented the content of a 70 page Chemistry textbook and used this KB to design a system that answers questions from an Advanced Placement exam and explaining the provided answers). The first two KBs are essentially term definitions (formal in Open Galen, semi-formal in Fact Guru) and hence are interesting to re-use for representing or indexing parts of documents but would be insufficient to learn about a domain or find and compare techniques to solve a problem. The second two are formal and interesting for automatic inferencing but are not meant to be directly read and browsed.
Third, the hypertext-based Web sites describing and organizing or comparing resources (researchers, discussion lists, journals, concepts, theories, tools, etc.) of a domain, e.g. MathWorld and the American Mathematical Society. Some Web sites permit their users to collaborate or discuss by adding or updating documents, e.g. via wiki systems or annotation systems. Because these systems do not use semantic relations, the resulting information is often as poorly structured as in mailing lists and hence includes many redundancies, and arguments and counter-arguments for an idea are hard to find and compare. However, Wikipedia, an on-line hypertext encyclopedia which is also a wiki, albeit a very controlled one, has good quality articles on a wide variety of domains. These articles are well connected and permit their readers to get an overview of a subject and explore it to find information. Yet, Wikipedia's content structure and support for collaboration and IR could be improved. An easy-to-use and easy-to-implement feature would be a support for certain semantic relations (e.g. subtypeOf, instanceOf, and partOf) and especially argumentation relations (e.g. proof, example, hypothesis, argument, correction), e.g. as in pre-Web hypertext systems like AAA (Schuler & Smith, 1990) but also allowing the introduction and use of additional relations in an ontology. A semi-formal English-like syntax such as ClearTalk (the notation used in CODE4 and Fact Guru) would support more knowledge processing while still being user-friendly.
It would be utopic to think that even motivated knowledge engineers would be (in the near future) able/willing to represent their research ideas completely into a formal, shared, well-structured readable semantic network that can be explored like a decision tree: there are too many things to enter, too many ways to describe or represent a same thing, and too many ways to group and compare these things. On the other hand, representing the most important structures into such a semantic network and interconnecting them with informal representations seems achievable and extremely interesting for education and IR purposes. Section 2 proposes some mechanisms to support the sharing, re-use and cooperative update of such semantic networks, including some mechanisms to assign values to the contributions and credits to the contributors. Section 3 proposes some ontological elements to guide and normalize the construction of these knowledge repositories. Section 4 shows an approach to permit the comparison of tools or techniques. The domain of ontology-related tools is used as example.
Here, we only consider asynchronous cooperation since it both underlies and is more scalable than exchanges of information between co-temporal users of a system.
The most decentralized knowledge sharing strategy is the one the W3C envisages for the "Semantic Web": many small ontologies, more or less independently developed and thus partially redundant, competing and very loosely interconnected. Hence, these ontologies have problems similar to those we listed for documents: (i) finding the relevant ontologies, choosing between them and combining them require commonsense (and hence is difficult and sub-optimal even for a knowledge engineer, let alone for a machine), (ii) a knowledge provider cannot simply add one concept or statement "at the right place" and is not guided by a large ontology (and a system exploiting it) into providing precise concepts and statements that complement existing ones and are more easily re-used, and (iii) the result is not only more or less lost to others but increases the amount of "data" to search.
A more knowledge-oriented strategy is to have a knowledge server permitting registered users to access and update a single large ontology on a domain and upload files mixing natural language sentences with knowledge representations (e.g. in a controlled language). WebKB-1, WebKB-2, OntoWeb/Ontobroker and Fact Guru are examples of servers allowing this. This was also the strategy used in the well publicized KA2 project (Benjamins & al., 1998) which re-used Ontobroker and aimed to let Knowledge Acquisition researchers index their resources, but (i) the provided ontology was extremely small (more details in Section 3.1) and could not be directly updated by users, and (ii) the formal statements had to be stored within an invented attribute (named "onto") of HTML hyperlink tags via a poorly expressive language. Thus, the approach was limiting which may be one of the reasons why this project achieved limited success.
We know of only two knowledge servers having special protocols to support cooperation between users: Co4 and WebKB-2 (note: most servers, including WebKB-2, support concurrency control (e.g. via KB locking) and several, like Ontolingua, support users' permissions on files/KBs; cooperation support is not so basic: it is about helping knowledge re-use, conflict prevention and the solving of each conflict once it has been detected by the system or a user). The approach of Co4 is based on peer-reviewing; the result is a hierarchy of KBs, the uppermost ones containing the most consensual knowledge while the lowermost ones are the private KBs of contributing users. We believe the approach of WebKB-2 which has a KB shared by all its users leads to more relations between categories (types or individuals) or statements from the different users and may be easier to handle (by the system and the users) for a large amount of knowledge and large number of users. Details can be found in Martin (2003a) but here is a summary of its principles.
To avoid lexical problems, each category identifier is prefixed by a short
identifier of its creator (who is also represented by a category and thus may have
associated statements). Each statement also has an associated creator and
hence, if it is not a definition, may be considered as a belief. Both this
namespace mechanism and the embedding of statements can be seen as ways to
represent explicit modules, i.e. modules that can be reasoned upon (as opposed
to file based modules).
Any object (category or statement) may be re-used by any user within his/her
statements.
The removal of an object can only be done by its creator but a user may
"correct" a belief by connecting it to another belief via a
"corrective relation" (e.g. pm#corrective_restriction
).
(Definitions cannot be corrected since they cannot be false.)
If entering a new belief introduces a redundancy or an inconsistency that is
detected by the system, it is rejected. The user may then either correct his/her
belief or re-enter it again but connected by a "corrective relation" to each
belief it is redundant or inconsistent with: this allows and makes explicit
the disagreement of one user with (her interpretation of) the belief of another
user. This also technically removes the cause of the problem: a proposition A
may be inconsistent with a proposition B but a belief that
"A is a correction of B" is not technically inconsistent with a belief in B.
(Definitions from different users cannot be inconsistent with each other,
they simply define different categories/meanings;
a system of "category cloning" could be used to handle this situation
automatically but the resulting ontology would be much more complex than
via the manual handling of the situation by each category creator that is
occasionally faced to it; hence, such a system has not been implemented in
WebKB-2).
Choices between beliefs may have to be made by people re-using the KB for an
application but then they can exploit the explicit relations between beliefs,
e.g. by always selecting the most specialized ones. The query engine of WebKB-2
always returns a statement with its meta-statements, hence with the
associated corrective relations. Finally, in order to avoid seeing the objects
of certain creators during browsing or within query results, a user may set
filters on these creators, based on their identifiers, types or descriptions.
For the construction of knowledge repositories, an interesting aspect of
this approach to encourage re-use, precision and object connectivity is that it
also works for semi-formal KBs. Here, regarding a statement,
semi-formal means that if it is written in a natural language
(whether it uses formal terms or not)
it must at least be related to another statement by a formal relation,
e.g. a generalization relation (pm#corrective_generalization
,
pm#summary
, etc.) or an argumentation relation.
Thus, to minimize redundancies and to help information retrieval within
information repositories, this minimal semantic structure (which in many cases
is the only one bearable by many persons) could be used to organize ideas that
are otherwise repeated in many documents. For instance, for a Web site that
centralizes and organizes/represents in a formal, semi-formal and
informal way resources (tools, techniques, publications, mailing list, teams,
etc.) related to a domain, it would be very interesting to have some space where
discussions could be conducted in this minimal semi-formal way, and hence
index or partly replace the mailing list: this would permit to avoid recurring
discussions or presentations of arguments, show the tree of arguments and
counter-arguments for an idea, permit incremental additions, encourage deeper
or more systematic explorations of each idea, and record the various reached
status-quos.
Below is an extract from
the beginning of a semi-formal discussion about the
statement "a Knowledge Representation Language (KRL) should (also) have an
XML notation to ease knowledge sharing".
This example
shows how three important constructs can be represented:
the relation from a statement, the relation on a relation (or more exactly,
the relation from a statement connecting two statements), and the conjunctive
set of statements. These constructs are important for representing structured
discussions even though few argumentation-oriented hypertext systems offer them
(ArguMed
is one of the exceptions; see also
this analysis of
Toulmin's argumentation structures).
Notes.
1) The following structures are not expected to be the direct result of a
discussion but they may be the result of a semi-automatic re-organization of
discussions and then they may be refined by further semi-formal discussions,
2) Relations such as "specialization" or "corrective_restriction" may seem
odd to use between informal statements but they are essential for checking the
updates of the argumentation structures and hence guiding or exploiting them;
specialization relations are used in several argumentation systems
(for example, the (counter-)arguments for a statement are valid for its
specializations and the (counter-)arguments of the specializations are
(counter-)examples for their generalizations);
3) The author of each statement (and hence also each relation between statement)
is not shown below but is recorded (the next section illustrates an exploitation
of this meta-information and other ones),
4) Each of the statements can be re-used independently in various structures
and hence cannot refer to some other statement implicitely (the keyword
this used below is a shortcut that is automatically
generated by the system when displaying a structure: the actual statements
do not contain such a shortcut),
5) The statements do not systematically begin by a capital letter in order
not to limit their re-use; for example, if parts of these structures are directly
re-used to generate English sentences, the problem of converting (or not) the
initial uppercase into a lowercase does not have to be solved.
"a KRL (Knowledge Representation Language) can have an XML notation" extended_specialization: "a KRL should have an XML notation" (pm), argument: ("the data model of a KRL can be stored into a tree-based structure" argument: - "a graph-based model can be stored into a tree-based structure" (pm) - "the data model of a KRL has to be graph-based" (pm) )(pm); "a KRL should (also) have an XML notation" specialization: "the Semantic Web KRL should have an XML notation" (pm), argument: "an XML notation permits a KRL to use URIs and Unicode" (fg, objection: ("most syntaxes can easily be adapted to have object identifiers using URIs and Unicode" argument_by_authority: "this was noted by Berners-Lee" (pm) )(pm)), argument: "XML can be used for knowledge exchange or storage" (fg, objection: "XML is useless or detrimental for knowledge representation, exchange or storage" (pm)), argument: "a KRL may have various notations in addition to an XML-based notation" (pm, objection: "the more notations there are the less one of them is going to be commonly adopted for knowledge exchange" (pm)), argument: "not using XML for a notation implies that a plug-in has to be installed for each syntax" (pm, objection: "XML tools need to be complemented for the semantics of the knowledge representation to be handled" (pm), objection: "installing a plug-in is likely to take less time than always loading XML files" (Sowa)); "the data model of a KRL has to be graph-based" argument_by_popularity: "this is acknowledged by about everyone" (pm), argument_by_authority: "this is acknowledged by the W3C" (pm); "XML can be used for knowledge exchange or storage" argument: - "an XML notation permits classic XML tools (parsers, XSLT, ...) to be re-used" (pm) - "classic XML tools are usable even if a graph-based model is used" (pm); "classic XML tools are usable even if a graph-based model is used" specialization: "classic XML tools work on RDF/XML" (pm); "XML is useless or detrimental for knowledge representation, exchange or storage" argument: ("using XML tools for KBSs is a useless additional task" argument: "KBSs do not use XML internally" (pm, objection: "XML can be used for knowledge exchange or storage" (fg, objection: "it is as easy to use other formats for knowledge exchange or storage" (pm), objection: "a KBS (also) have to use other formats for knowledge exchange or storage" (pm))) )(pm), argument: "XML is not a good format for knowledge exchange or storage" (pm); "XML is not a good format for knowledge exchange or storage" argument: - ("XML-based knowledge representations are hard to understand" argument_by_popularity: "this is acknowledged by about everyone" (pm), argument_by_authority: "this is acknowledged by the W3C" (pm) )(pm) - "a knowledge interchange format should be easy to read and understand with a simple editor, by trained people" (pm);
My home page for structured discussions gives access to other examples of structured discussions.
We shall have to do many experiments to see if most of the content of mails can be directly organized into an argumentation tree for each idea and thus permit people to compare and evaluate arguments and counter-arguments (this is very difficult when they are spread across many emails), or if the result will still be difficult to follow and useless because the participants have different goals, assumptions or terminologies (e.g. many discussions on the CG and SUO lists occur because some Semantic Webers use words such as "knowledge", "semantic" and "logic inferencing" when, for the same referred concepts, others would only use words such as "data", "structured" and "data exploitation"). Thus, it may be that the above approach requires or leads to deeper discussions (and possibly using some formal terms) and hence that most of the content of mails cannot be directly organized.
The above described knowledge sharing mechanism of WebKB-2 records and exploits annotations by individual users on statements but does not record and exploit any measure of the "usefulness" of each statement, a value representing its "global interest", acceptation, popularity, originality, etc. Yet, this seems interesting for a knowledge repository and especially for semi-formal discussions: statements that are obvious, un-argued, or for which each argument has been counter-argued, should be marked as such (e.g. via darker colors or smaller fonts) in order to make them less visible (or invisible, depending on the selected display options) and discourage the entering of such statements. More generally, the presentation of the combined efforts from the various contributors may then take into account the usefulness of each statement. Furthermore, given that the creator of each statement is recorded, (i) a value of usefulness may also be calculated for each creator (and displayed), and (ii) in return, this value may be taken into account to calculate the usefulness of the creator's contributions; these are two additional refinements to both detect and encourage argued and interesting contributions, and hence regulate them.
Ideally, the system would accept user-defined measures of usefulness for a statement or a creator, and adapt its display of the repository accordingly. Below, we present the default measures that we shall soon implement in WebKB-2 (or more exactly, its successor and open-source version, AnyKB). We may try to support user-defined measures but since each step of the user's browsing would imply dynamically re-calculating the usefulness of all statements (except those from WordNet) and all creators, the result is likely to be very slow. For now, we only consider beliefs: we have not yet defined the usefulness of a definition.
To calculate the usefulness of a belief, we first associate two more basic
attributes to the belief:
1) its "state of confirmation" and 2) its "global interest".
1) The first is equal to 0 if the belief has no argument nor counter-argument
connected to it (examples of counter-argument relation names:
"counter-example", "counter-fact", "corrective-specialization").
It is equal to 1 (i.e. the belief is "confirmed") if
(i) each argument has a state of confirmation of 0 or 1, and
(ii) there exists no confirmed counter-argument.
It is equal to -1 if the belief has at least one confirmed counter-argument.
It is also equal to 0 in the remaining case: no confirmed counter-argument but
each of the argument has a state of confirmation of -1.
All this is independent of whom authored the (counter-)arguments.
2) Each user may give a value to the interest of a belief, say between -5 and 5
(the maximum value that the creator of the belief may give is, say, 2).
Multiplied by the usefulness of the valuating user, this gives an "individual
interest" (thus, this may be seen as a particular multi-valued vote).
The "global interest" of a belief is defined as the average of its individual
interests (thus, this is a voting system where more competent people in the
domain of interest are given more weight).
A belief that does not deserve to be visible, e.g. because it is clearly a
particular case of a more general belief, is likely to receive a negative
global interest.
We prefer letting each user explicitly give an interest value rather than
taking into account the way the belief is generalized by or connected to
(or included in) other beliefs because interpreting an interest from such
relations is difficult. For example, a belief that is used as a counter-example
may be a particular case of another belief but is nevertheless very interesting
as a counter-example.
Finally, the usefulness of a belief is equal to the value of the global interest if the state of
confirmation is equal to 1, and otherwise is equal to the value of the
state of confirmation (i.e. -1 or 0: a belief without argument has no
usefulness, whether it is itself an argument or not).
In argumentation systems, it is traditional to give a type to each
statement, e.g. fact, hypothesis, question, affirmation, argument, proof.
This could be used in our repositories too (even though the connected relations
often already give that information) and we could have used it as a factor to
calculate the usefulness (e.g. by considering that an affirmation is worth
more than an argument) but we prefer a simpler measure only based on
explicit valuations by the users.
Our formula for a user's usefulness is:
sum of the usefulness of the beliefs from the user +
square root (number of times the user voted on the interest of beliefs)
.
The second part of this equation acknowledges the participation of the
user in votes while decreasing its weight as the number of votes increases.
(Functions decreasing more rapidly than square root
may perhaps
better balance originality and participation effort).
These measures are simple but should incite the users to be careful and precise in their contributions (affirmation, arguments, counter-arguments, etc.) and give arguments for them: unlike in traditional discussions or anonymous reviews, careless statements here penalise their authors. Thus, this should lead users not to make statements outside their domain of expertise or without verifying their facts. (Using a different pseudo when providing low quality statements does not seem to be an helpful strategy to escape the above approach since this reduces the number of authored statements for the first pseudo). On the other hand, the above measures should hopefully not lead "correct but outside-the-main-stream contributions" to be under-rated since counter-arguments must be justified. Finally, when a belief is counter-argued, the usefulness of its author decreases, and hence he/she is incited to deepen the discussion or remove the faulty belief.
In his description of a "Digital Aristotle", Hillis (2004) describes a "Knowledge Web" to which researchers could add ideas or explanations of ideas "at the right place", and suggests that this Knowledge Web could and should "include the mechanisms for credit assignment, usage tracking, and annotation that the [current] Web lacks", thus supporting a much better re-use and evaluation of the work of a researcher than via the current system of article publishing and reviewing. However, Hillis does not give any indication on such mechanisms. Although the mechanisms we proposed in this sub-section and the previous one were intended for one knowledge repository/server, they seem usable for the Knowledge Web too. To complement the approach with respect to the Knowledge Web, the next sub-section proposes a strategy to achieve knowledge sharing between knowledge servers.
Again, an alternative (or, in the long term, complementary) approach is the one of Co4 which, via its hierarchy of KBs generated by peer-reviewing of statements from the users' private KBs, supports knowledge sharing and makes explicit various consensuses. However, assuming there are N statements shared by the users of Co4, in the worst case, we assume that there could be 2N possible KBs if the protocols accept all groupings. Even though this is surely not the case, which KBs should a person look at for finding relations between statements or evaluating the usefulness of a statement/author? Furthermore, the uppermost KBs only represent consensus, not usefulness.
Although independently developed, our approach appears to be an extension of the version of SYNVIEW designed in 1985. In this hypertext system, statements had to be connected by (predefined or user-invented) relations and each statement was valuated by users (this value and another one calculated from the value of arguments and counter-arguments for the statement was simply displayed near the statement as to "summarize the strengths assigned to the various items of evidence within the given contexts"). In 1986, to ease information entering and thus hopefully permit the collaborative work of a small community to create an information repository large enough to interest other people and lead them to participate and store information too, the authors of SYNVIEW removed the constraint of using explicit relations between statements (the statements must be organized hierarchically but the relations linking them are unknown) and replaced the possibility of grading each statement by the possibility of ranking them within the list of (sibling) statements having a same direct super-statement. A similar move away from structured representations was made by Shum, Motta and Domingue (1999) for the same reason and the idea of making the approach more "scalable". Although such a move clearly makes information entering easier, in our viewpoint it makes the system far less likely to scalable because the information is far less retrievable and exploitable, and hence of interest for people to search or complement. Such moves have apparently failed to attract more interest than the original more structured approaches. Since unstructured approaches have strong inherent limitations, we are opting for a move towards improving the entering and sharing of structured forms.
One knowledge server cannot support the knowledge sharing of all researchers. It has to be specialized or to act as a broker for more specialized servers. If competing servers had an equivalent content (today, Web search engines already have "similar" content), a Web user could query or update any general server and, if necessary, be redirected to use a more specialized server, and so on recursively (at each level, only one of the competing servers has to be tried since they mirror each other). If a Web user directly tried a specialized server, it could redirect him/her to use a more appropriate server or indicate which other servers may provide more information for his/her query (or directly forward this query to these other servers).
To permit this, our idea is that each server periodically checks
related servers (more general servers, competing servers
and slightly more specialized servers) and
1) integrates (and hence mirrors) all the objects (categories and
statements) generalizing the objects in a reference collection that
it uses to define its "domain" (if this is a general server, this collection
is reduced to pm#thing
, the uppermost concept type),
2) integrates either all the objects that are more specialized than the
objects in the reference collection, or if a certain depth of specialization
is fixed, associates to its most specialized objects the URLs of the servers
that can provide specializations for these objects (note: classifying servers
according to fields/domains is far too coarse to index/retrieve knowledge
from distributed knowledge servers, e.g. knowledge about "neurons" or "hands"
can be relevant to many domains; thus, a classification by objects is
necessary), and
3) also associates the URLs of more general servers to the direct
specializations of the generalizations of the objects in the reference
collection (this is needed since the specializations of some of these
specializations do not generalize nor specialize the objects in the
reference collection).
Integrating knowledge from other servers is certainly not obvious but this is a more scalable and exploitable approach than letting people and machines select and re-use or integrate dozens or hundreds of (semi-)independently designed small ontologies. A more fundamental obstacle to the widespread use of this approach is that many industry-related servers are likely to make it difficult or illegal to mirror their KBs. However, other approaches will likely suffer from that too.
By default, the shared KB of WebKB-2 includes an ontology derived from the noun-related part of WordNet and various top-level ontologies (Martin, 2003b). A large general ontology like this is necessary to ease and normalize the cooperative construction of knowledge repositories but is still insufficient: an initial ontology on the domain the repository will be dedicated to is also necessary. As a proof of concept for our tools to support a cooperatively-built knowledge repository, we initially chose to model two related domains: (i) Conceptual Graphs (CGs), since this domain is the most well known to us, and (ii) ontology related tools, since Michael Denny's "Ontology editor survey" attracted some interest, or at least the idea did because the result was frustratingly superficial, poorly structured and hence did not permit to compare the tools and was probably misleading for non-specialists (indeed, ontology tool authors -- including us, regarding WebKB-2 -- were given a short list of rather vague criterias to use, and the answers, instead of being analysed and used to construct various tables to compare similar tools on the same criterias, were abbreviated and directly put into one big table).
Modelling these two domains implies partially modelling other related domains, and we soon had the problem of modularizing the information into several files to support readability, search, checking and systematic input. These files are also expected to be updatable by users when our knowledge-oriented wiki is completed. Although the users of WebKB-2 can direcly update the KB one statement at a time, the documentation discourages them to do so because this is not a scalable way to represent a domain (as an analogy a line command interface is not a scalable way to develop a program). Instead, they are encouraged to create files mixing formal and informal statements and ask WebKB-2 to parse these files, and in the end when the modelling is complete and if the users wish to, integrate them to the shared KB. In order to be generic, we have created six files: Fields of study, Systems of logic, Information Sciences, Knowledge Management, Conceptual Graph and Formal Concept Analysis. The last three files specialize the others. Each of the last four files is divided into sections, the uppermost ones being "Domains and Theories", "Tasks and Methodologies", "Structures and Languages", "Tools", "Journals, Conferences, Publishers and Mailing Lists", "Articles, Books and other Documents" and "People: Researchers, Specialists, Teams/Projects, ...". This is a work in progress: the content and number of files will increase but the sections seem stable. We now give examples of their content.
Names used for domains ("fields of study") are very often also names for tasks. Task categories are more convenient for representing knowledge than domain categories because (i) organizing them is easier and less arbitrary, and (ii) many relations (e.g. case relations) can then be used. Since for simplicity and normalization purposes a choice must be made, whenever suitable we have represented tasks instead of domains. When names are shared by domain categories and task categories (in WebKB-2, categories can share names but not identifiers), we advise using the task categories in indexations or representations.
When studying how to represent and relate document subjects/topics
(e.g. technical domains),
Welty & Jenkins (1999)
concluded that representing them as types was not semantically correct but
that mereo-topological relations between individuals were appropriate.
Our own analysis confirmed this and we opted for (i) an interpretation of
theories and fields of study as large "propositions" composed of many
sub-propositions (this seems the simplest, most precise and most
flexible way to represent these notions), and
(ii) a particular part relation that we named ">part" (instead of
"subdomain") for several reasons: to be generic, to remind that it
can be used in WebKB-2 as if it was a specialization relation (e.g. the
destination category needs not be already declared) and to make clear that
our replacement of WordNet hyponym relations between synsets about fields of
study by ">part" relations refines WordNet without contradicting it.
Our file on
"Fields of study" details these choices.
Our file on "Systems of logics"
illustrates how for some categories the represented field of study is
a theory (it does not refer to it) thus simplifying and normalizing
the categorization. Below is an example
(in the FT notation)
of relations from WordNet category #computer_science
,
followed by an example about logical domains/theories.
When introducing general categories in Information Sciences and Knowledge
Management, we used the generic users "is" and "km". In WebKB-2, a
generic user is a special kind of user that has no password: anyone can create
or connect categories in its name but then cannot remove them.
#computer_science__computational_science (^branch of engineering science that studies computable processes and structures^) >part: #artificial_intelligence, //according to WordNet, AI is ">part:" of CS >part: is#software_engineering_science (is), //"(is)": relation created by "is" >part: is#database_management_science (is), >part of: #engineering_science__engineering__applied_science__technology, part: #information_theory, //relation coming from WordNet: "(wn)" is implicit part of: #information_science; //WordNet has some part relations between domains km#substructural_logic (^system of propositional calculus that is weaker than the conventional one^) >part of: km#non-classical_logic__intuitionist_logic, >part: km#relevance_logic km#linear_logic, url: http://en.wikipedia.org/wiki/Intuitionistic_logic; km#CG_domain__Conceptual_Graphs__Conceptual_Structures >part of: km#knowledge_management_science, object: km#CG_task km#CG_structure km#CG_tool km#CG_mailing_list, url: http://www.cs.uah.edu/~delugach/CG/ http://www.jfsowa.com/cg/;
For guiding the sharing, indexation or representation of techniques in
Knowledge management, hundreds of domains, theories or tasks
need to be represented in a shared ontology which anyone can easily complement.
We have begun this work.
On the other hand, as noter earlier, the ontology of the KA2 project was small
and additions had to be suggested by e-mail. Most of this ontology is shown
below in FT (loss-less-translation from the
source in Frame_logic): the whole subtype hierarchy is shown (types in
italics are concept types with no subtype), the relations that can be
associated to an instance of the type organization are shown but for the other
general concept types such relations have been omitted.
Furthermore, most of this "ontology" is composed of about 36 "domains"
organized by subtype relations.
The names of these domains also represent tasks, structures, methods and
experiments (e.g. "reuse_in_KA > ontologies PSMs;
PSMs > Sysiphus-III_experiment
").
Not representing them as objects prevent their use in knowledge representations.
Finally, this domain decomposition is far from being a decision tree and
what some domains refers to is quite ambiguous.
The comments in this example are from us.
root > organization project event person publication product research_topic; organization > enterprise university department institute research_group, name: string, location: string, employs: person, publishes: publication, technical_report: technical_report, carries_out: project, develops: product, finances: project; project > research_project development_project; development_project > software_project; event > conference workshop activity special_issue meeting; person > student employee; student > PhD_student; employee > academic_staff administrative_staff; //should be named "..._staff_member" academic_staff > lecturer researcher; researcher > PhD_student; administrative_staff > secretary technical_staff; publication > book article journal online_publication; article > technical_report journal_article article_in_book conference_paper workshop_paper; journal > special_issue; research_topic //this "specialization" hierarchy is far from being a decision tree > KA_through_machine_learning reuse_in_KA KA_methodologies specification_languages validation_and_verification KA_evaluation //difference between the two?? knowledge_management knowledge_elicitation; KA_through_machine_learning //machine learning techniques > abduction case_base_reasoning__CBR cooperative_KA //what does that refer to? knowledge_based_refinement knowledge_discovery_in_datasets data_mining learning_apprentice_systems reinforcement_learning; reuse_in_KA > ontologies PSMs; ontologies > theoretical_foundations software_applications methodologies; PSMs > PSM_evaluation PSM_libraries PSM_notations automated_PSM_generation Sysiphus-III_experiment Web_mediated_PSM_selection software_reuse; specification_languages > specification_methodology specification_of_control_knowledge support_tools_for_formal_methods automated_code_generation_from_specification executable_specification_languages; validation_and_verification > anomaly_detection anomaly_repair_and_knowledge_revision formalisms methodology validation_and_verification_of_MAS;
In most model libraries in Knowledge Acquisition, each non-primitive task is linked to techniques that can be used for achieving it, and conversely, each technique combines the results of more primitive tasks. We tried this organization but at the level of generality of our current modelling it turned out to be inadequate: it led (i) to arbitrary choices between representing sometimes as a task (a kind of process) or a technique (a kind of process description), or (ii) to the representation of both notions and thus to introduce categories with names such as "KA_by_classification_from_people"; both cases are problematic for readability and normalization. Similarly, instead of representing methodologies directly, i.e. as another kind of process description, it seems better to represent the tasks advocated by a methodology (including their supertask: following that methodology). Furthermore, with tasks, many relations can then be used directly: similar relations do not have to be introduced for techniques or methodologies (the relation hierarchy should be kept small if only for normalization purposes). Hence, we represented all these things as tasks and used multi-inheritance. This considerably simplified the ontology and the source files. Here are some extracts.
km#KM_task__knowledge_management__KM (^a K.M. (sub)task^) < is#information_sciences_task, > km#knowledge_representation km#knowledge_extraction_and_modelling km#knowledge_comparison km#knowledge_retrieval_task km#knowledge_creation km#classification km#KB_sharing_management km#mapping/merging/federation_of_KBs km#knowledge_translation km#knowledge_validation {km#monotonic_reasoning km#non_monotonic_reasoning} {km#consistent_inferencing km#inconsistent_inferencing} {km#complete_inferencing km#incomplete_inferencing} {km#structure-only_based_inferencing km#rule_based_inferencing} km#teaching_a_KM_related_subject km#language/structure_specific_task km#KM_methodology_task, object of: km#knowledge_management_science, object: km#KM_structure; //between types, the default cardinality is 0..N //The general relation "object" has different (more specialized) meanings depending on // the connected categories: in the last relation, the meaning is "task object" // (object worked on or generated by the task) not "domain object". km#knowledge_retrieval_task < is#IR_task, > {km#specialization_retrieval km#generalization_retrieval} km#analogy_retrieval km#structure_only_based_retrieval {km#complete_knowledge_retrieval km#incomplete_knowledge_retrieval} {km#consistent_knowledge_retrieval km#inconsistent_knowledge_retrieval}; km#CG_task < km#language/structure_specific_task, > km#CG_extraction_by_NLP km#CG-based_KR km#CG_matching km#mapping/merging/federation_of_CG-based_KBs km#conversion_between_CG_and_other_models_or_notations km#teaching_CGs; km#conversion_between_CG_and_other_models_or_notations > km#conversion_between_RDF_and_CG fca#FCA-based_storage_of_CGs; km#teaching_CGs object: km#CGs;
In our top-level ontology (Martin, 2003b),
pm#description_medium
(supertype for languages, data structures, ontologies, ...) and
pm#description_content
(supertype for fields of studies, theories,
document contents, softwares, ...) have for supertype
pm#description__information
because
(i) such a general type that groups both notions is needed for
the signatures of many basic relations and is actually much more used that in
these signatures than its subtypes, and
(ii) classifying WordNet categories
according to the two notions would have often led to arbitrary choices.
Although we attributed a section to each notion, we represented
the default ontology of WebKB-2 as a part of WebKB-2 (see below) and hence
allowed part relations between any kind of information.
To ease knowledge entering and certain exploitations of it, we allow the
use of generic relations such as "part", "object" and "support" when,
given the types of the connected objects, the relevant relations (e.g.
pm#subtask
or pm#physical_part
) can
automatically be found.
For similar reasons, to represent "sub-versions" of ontologies, softwares,
or more generally, documents, we use types connected by subtype relations.
Thus, for example, km#WebKB-2
is a type and can be used with
quantifiers.
km#KM_structure < is#symbolic_structure, > {km#base_of_facts/beliefs km#ontology km#KB_category km#statement} km#KB km#KA_model km#KR_language km#language_specific_structure; km#KB__knowledge_base part: km#ontology km#base_of_facts/beliefs; km#ontology__set_of_category_definitions/constraints > km#lexical_ontology km#language_ontology km#domain_ontology km#top_level_ontology km#concept_ontology km#relation_ontology km#multi_source_ontology, part: 1..* km#KB_category 1..* km#category_definition; km#top_level_ontology > km#DOLCE_light km#SUMO km#top_level_of_ontology_of_John_Sowa; km#multi_source_ontology (^ontology where the creator of each category and statement is recorded and represented via a category^) > km#default_MSO_of_WebKB-2; km#default_MSO_of_WebKB-2 (^an ontology provided as default by a version of WebKB-2^) part of: km#WebKB-2, part: km#DOLCE_light km#top_level_of_ontology_of_John_Sowa; //km#DOLCE km#SUMO /*an adaptation of*/km#WordNet; km#KR_language__KRL__KR_model_or_notation > {km#KR_model/structure km#KR_notation} //not km#semantics: not a structure km#predicate_logic_oriented_language km#frame_oriented_language km#graph_oriented_language km#KR_language_with_query_commands km#KR_language_with_scripting_capabilities, attribute: km#semantics; km#CG_structure < km#language_specific_structure, > km#CG_statement km#CG_language km#CG_ontology;
We first illustrate some specialization relations between tools then we use the FCG notation to give some details on WebKB-2 and Ontolingua. (The FT notation does not yet permit to enter such details. As in FT, relation names in FCG may be used instead of relations identifiers when there is no ambiguity).
km#CG_related_tool < km#language/structure_specific_tool, > km#CG-based_KBMS km#CG_graphical_editor km#NL_parser_with_CG_output; km#CG-based_KBMS < km#KBMS, > {km#CGWorld km#PROLOG\+CG km#CoGITaNT km#Notio km#WebKB}; km#WebKB > {km#WebKB-1 km#WebKB-2}, url: http://www.webkb.org; km#input_language (*x,*y) = [*x, may be support of: (a km#parsing, input: (a statement, formalism: *y))]; [any pm#WebKB-2, part: (a is#user_interface, part: {a is#API, a is#HTML_based_interface, a is#CGI-accessible_command_interface, no is#graph_visualization_interface}), part: {a is#FastDB, a km#default_MSO_of_WebKB-2}, input_language: a km#FCG, output_language: {a km#FCG, a km#RDF}, support of: a is#regular_expression_based_search, support of: a km#specialization_structural_retrieval, support of: a km#generalization_structural_retrieval, support of: (a km#specialization_structural_retrieval, kind: {km#complete_inferencing, km#consistent_inferencing}, input: (a km#query, expressivity: km#PCEF_logic), object: (several km#statement, expressivity: km#PCEF_logic) )]; //"PCEF": positive conjunctive existential formula [any km#Ontolingua, part: {a is#HTML_based_interface, no is#graph_visualization_interface}, input_language: a km#KIF, output_language: a km#KIF, part: {a km#ontolingua_library, no DBMS}, support of: a is#lexical_search];
To permit the comparison of tools, many more details should be entered and
similar structures or relations should be used by the various contributors,
for example when expressing what the input languages of a tool can be.
To that end, we re-used basic relations as much as possible (we did not
introduce relations with names such as "re-used_DBMS" or
"default_ontology"). The above examples show that for many features
a simple normalized form can be found.
However, for many other features this is more difficult.
For example, we have not yet found a satisfactory way to represent
(i) that WebKB-2 provides a special support (two attributes plus three
classes, special notations, lots of code) for storing, searching and
exploiting relations between categories and their creators or various names, and
(ii) that Ontolingua supports those relations but the user has to
define them in KIF and then their exploitation in Lisp.
Representing this in detail is time consuming and representations from
different persons are unlikely to be matchable and
also very difficult to use for comparing the tools via a generated table
(as illustrated in Section 4). Less detailed descriptions using
same relations should (instead or in addition) be provided. For our
example, a short representation could be
[any WebKB-2, special_support: a support_for_link_from_category_to_names]
even though this would lead to introduce many categories for such
"supports" in the ontology: from other viewpoints, it would have been
preferable to re-use existing relations such as km#category_name
.
Just a few examples.
km#CG_mailing_list < km#KM_mailing_list, url: majordomo@cs.uah.edu; km#ICCS__International_Conference_on_Conceptual_Structures instance: km#ICCS_2001 km#ICCS_2002 km#ICCS_2003 km#ICCS_2003 km#ICCS_2005; is#publisher_in_IS < #publishing_house, instance: is#Springer_Verlag is#AAAI/MIT_Press is#Cambridge_University_Press, object of: #information_science;
This example shows how a simple document indexation using Dublin Core relations (we have done this for all the articles of ICCS 2002). Representing ideas from the article would be more valuable.
[an #article, dc#Coverage: km#knowledge_representation, pm#title: "What Is a Knowledge Representation?", dc#Creator: "Randall Davis, Howard E. Shrobe and Peter Szolovits", pm#object of: (a #publishing, pm#time: 1993, pm#place: (the #object_section "14:1 p17-33", pm#part of: is#AI_Magazine)), pm#url: http://medg.lcs.mit.edu/ftp/psz/k-rep.html];
We have not yet dealt with this section in our files. However, this example reminds that every introuced domain category is superseded by a category from WordNet.
is#researcher_in_IS < #researcher; is#team_in_IS < #team;
Fact Guru (which is a frame-based system) permits the comparison of two
objects by generating a table with the object identifiers as column
headers, the identifiers of all their attributes as row headers,
and for each cell either a mark to signal that the attribute does
not exist for this object or a description of the destination object.
The common generalizations of the two objects
are also given. However, this strategy is insufficient for comparing tools.
Even for people, creating detailed tool comparison tables is often a
presentation challenge and involves their knowledge of
which features are difficult or important and which are not.
A solution could be to propose predefined tables for easing
the entering of tool features and then compare them.
However, this is restricting. Instead or in complement, we think
that a mechanism to generate good comparison tables is necessary
and can be found.
The following query and generated table
illustrates an approach that we propose. The idea is that
a specialization hierarchy of features is generated according to
(i) the uppermost relations and destination types specified in the query,
and (ii) only the descriptions used in at least one of the tools and
the common generalizations of these descriptions are shown.
To that end, some FCG-like descriptions of types can be generated.
In the cells, '+' means "yes" (the tool has the feature), '-' means "no",
and '.' means that the information has not been represented.
We invite the reader to compare the content of this table with the
representations given above; then, its meaning and the possibility to
generate it automatically should hopefully be clear.
A maximum depth of automatic exploration may be given; past this depth, the
manual exploration of certain branches (like the opening or closing of
sub-folders) should permit the user to give the comparison table
a presentation better suited to his/her interest.
Any number of tools could be compared, not just two.
> compare pm#WebKB-2 km#Ontolingua on (support of: a is#IR_task, output_language: a KR_notation, part: a is#user_interface), maxdepth 5 WebKB-2 Ontolingua support of: is#IR_task + + is#lexical_search + + is#regular_expression_based_search + . km#knowledge_retrieval_task + . km#specialization_structural_retrieval + . (kind: {km#complete_inferencing, km#consistent_inferencing}, input: (a km#query, expressivity: km#PCEF_logic), object: (several statement, expressivity: km#PCEF_logic)) + . km#generalization_structural_retrieval + . output_language: km#KR_notation + + (expressivity: km#FOL) + + km#FCG + . km#KIF . + km#XML-based notation + . km#RDF + - part: is#user_interface + + is#HTML_based_interface + + is#CGI-accessible_command_interface + . is#OKBC_interface . . is#API + . is#graph_visualization_interface - -
In the general case, the above approach where the descriptions
are put in the rows and organized in a hierarchy is likely to be
more readable, scalable and easier to specify via a command
than when the descriptions are put in the cells, e.g. as in Fact Guru.
However, this may be envisaged as a complement for
simple cases, e.g. to display {FCG, KIF}
instead of '+'
for the output_language
relation.
In addition to generalization relations, "part" relations could also
be used, at least the ">part" relation. For example, if
Cogitant was a third entry in the above table, since it has a
complete and consistent structure-based and rule-based mechanism
to retrieve the specializations of a simple CG in a base of simple CGs
and rules using simple CGs, we would expect the entry ending by
km#PCEF_logic
to be specialized by an entry ending by
km#PCEF_with_rules_logic
.
Knowledge repositories, as we have presented them, have many of the advantages of the "Knowledge Web" and "Digital Aristotle" but seem much more achievable. To that end, we have described some techniques and ontological elements, and we are: (i) implementing a knowledge oriented wiki to complement our current interfaces, (ii) experimenting on how to best support and guide semi-formal discussions, and more generally, organize technical ideas into a semantic network, (iii) implementing and refining our measures of statement/user usefulness, (iv) completing the above presented ontology to permit at least the representation of the information collected in Michael Denny's "Ontology editor survey" (we tend to think that our current ontology on knowledge management will only need to be specialized, even though we have not yet explored the categorization of the basic features of multi-user support such as concurrency control, transactions, CVS, file permissions, file importation, etc.), (v) permitting the comparison of tools as indicated above, and (vi) providing forms or tables to help tool creators represent the features of their tools.
Once implemented, the presented techniques and especially those supporting semi-formal discussions will be applicable to many domains, including PORT. The ontology on knowledge management, in addition to its above cited applications, might guide works on the automatic extraction and organization of technical information from documents in Information Sciences.
The first author thanks the members of the LOA and Christopher Welty for the interesting discussions related to some ideas mentioned in this article.
V. R. Benjamins, D. Fensel, A. Gomez-Perez, S. Decker, M. Erdmann, E. Motta and M. Musen (1998).
Knowledge Annotation Initiative of the Knowledge Acquisition Community: (KA)2.
Proc. of the 11th Banff Knowledge Acquisition for Knowledge Based System Workshop (KAW98),
Banff, Canada, April 18-23, 1998.
The ontology is at http://ontobroker.semanticweb.org/ontologies/ka2-onto-2000-11-07.flo.
W.D. Hillis (2004). "Aristotle" (The Knowledge Web). Edge Foundation, Inc., No 138, May 6, 2004.
Ph. Martin (2003a). Knowledge Representation, Sharing and Retrieval on the Web. Chapter of a book titled "Web Intelligence", (Eds.: N. Zhong, J. Liu, Y. Yao), Springer-Verlag, Jan. 2003.
Ph. Martin (2003b). Correction and Extension of WordNet 1.7. Proc. of ICCS 2003 (Dresden, Germany, July 2003), Springer Verlag, LNAI 2746, 160-173.
W. Schuler and J.B. Smith (1990). Author's Argumentation Assistant (AAA): A Hypertext-Based Authoring Tool for Argumentative Texts. Proc. of ECHT'90 (INRIA, France, Nov. 1990), Cambridge University Press, 137-151.
D. Skuce and T.C. Lethbridge (1995).
CODE4: A Unified System for Managing Conceptual Knowledge.
International Journal of Human-Computer Studies, 42, 413-451.
See also the successor / commercial version: Fact Guru.
D.A. Smith (1998). Computerizing computer science. Communications of the ACM, 41(9), 21-23.
C.A. Welty & J. Jenkins (1999). Formal Ontology for Subject. Journal of Knowledge and Data Engineering (Sept. 1999), 31(2), 155-182.