Scientific knowledge should be semantically encoded: a graph of nanopublications.

We have failed to store and organize much of the rapidly accumulating scientific information in rigorous, principled ways, so that finding what we want and understanding what is already known become exhausting, frustating, stressful and increasingly costly experiences. For information to be usable, it must be stored and organized in ways that allow us to access it, to analyze it, to annotate it and to relate it to other information. Only then can we begin to understand what it means. Only with the adquisition of meaning we adquire knowledge.

Nanopublications are a Linked Data format for scholarly data publishing that has received considerable uptake in the last few years. We briefly explain here an approach to use and link nanopublications as a unifying framework to represent knowledge in a semantic way.

A nanopublication is a single publishable and citable entity that combines an assertion, the provenance of the assertion, and the provenance of the nanopublication. This information can be about anything, for example a relation between a gene and a disease. Nanopublications are fully expressed in a formal and machine-interpretable way, making scientific communication more effective and user-friendly. Furthermore, because nanopublications can be attributed and cited, they provide incentives for researchers to make their data available in standard formats that drive data accessibility and interoperability.

We illustrate here the uniqueness of the nanopublication-based scheme by representing three instances of the class Nanopublication, two instances of the classes Provenance and PublicationInfo , and one instance of the classes Assertion, Subject, Predicate and Object.

Nanopublications are identified by their URIs, that it, by pointing the address where the file is stored. Many nanopublications can share the same Assertion (e.g., the one identified by the following statement: "Escherichia coli is resistant to ampicillin"). However, they can only share a second class: either the Provenance of the assertion, e.g., the one identified by the PubMed ID 16562043 (i.e., a research paper published in 1965 in Journal of Bacteriology) or the PublicationInfo, e.g., the one identified by the ORCID 0000-0002-8374-1941 (i.e., the author of the nanopublication).

This means that more than one author (uniquely identified by his/her ORCID) can report the same assertion from the same provenance (e.g., a research paper), and the same author can report the same assertion from different provenances (e.g., distinct research papers). In fact, this is something good and desirable to happen.

Let's focus now on the instance of the Assertion class to understand how a collection of nanopublications, as the one shown in the above illustration, becomes a knowledge graph.

The instance of the class Assertion identified by the statement: "Escherichia coli is resistant to ampicillin" is a RDF graph (i.e., a triple consisting on a Subject, Predicate and Object).

The instance of the class Assertion takes its identifier (i.e., statement) by combining the three identifiers (id) of the instances of the classes it is linked to. If two instances of the class Assertion have the same identifiers for their Subject, Predicate and Object instances, the assertions will be linked by a synonymous relationship, regardless of the statement identifiers of the assertions (i.e., the same researcher might extract the same claim from the same publication but write it down in a slightly different natural language).

In RDF, the Subject of this triple is the URI of the taxon "Escherichia coli" (defined in the NCBI Taxon ontology ) the Predicate is the URI of the term "resistant to" (defined in the Ontology of Microbial Phenotypes), and the Object is the URI of the antibiotic "ampicillin" (defined in the NCI Thesaurus).

The instance of the class PublicationInfo identified by orcid: 0000-0003-3974-6515 is, in fact, another RDF graph (i.e., a triple consisting on a Subject, Predicate and Object).

The Subject of this triple is the URI of the nanopublication, the Predicate is the ORCID, and the Object is the literal containg the value of the ORCID.

All resources in a triple (here, only the Subject and the Predicate) are represented in RDF by their URIs (they were meant to be read by machines). That's why the Predicate contains URI: http://purl.obolibrary.org/obo/APOLLO_SV_00000496 instead of ORCID. The Semantic Web encourages the use of URIs as names of things to use a common vocabulary, pointing at places where the meaning of the terms are clearly defined. The term ORCID, is defined in the Apollo Structured Vocabulary (Apollo-SV) ontology.

When the Object of a triple is not a resource (which means it cannot be the Subject of any triple), the datatype of the literal is indicated. We use here the XML Scheme Definition (http://www.w3.org/2001/XMLSchema) to indicate that the ORCID value is a string.

In natural language, this PublicationInfo triple reads: "This nanopublication was created by Carlos J. Melián".

Likewise, the instance of the class Provenance identified by pmid:336612 is, in fact, another RDF graph (i.e., a triple consisting on a Subject, Predicate and Object).

The Subject of this triple is the statement of the assertion, the Predicate is the URI of the PubMed ID term (i.e., pmid is defined in the EDAM Ontology), and the Object is the literal containg the value of the pmid (in this case, a research paper published in 1977 in Journal of Bacteriology).

In natural language, this Provenance triple reads: "The statement Escherichia coli is resistant to ampicillin was extracted from a publication having PubMed ID = 336612".

In summary, we should make published scientific literature more machine-readable, and hence more dynamic and informative. Nanopublications are a way of annotate semantically the scientific literature to understand the meaning of the published information and enable the linking to related documents.

Scientific knowledge should be semantically encoded: a graph of nanopublications.

Miguel A. Fortuna

Topics