rdf4r library: build your semantic RDF database from R

rdf4r library: build your semantic RDF database from R

Transforming the data into information and knowledge that can be understood by machines requires a semantic approach. Unfortunately, most scientists are unware of the Semantic Web effort. The Resource Description Framework (RDF) is a standard model for data interchange on the Web. The building block of the RDF model is the triple (subject - predicate - object) and this implies an atomic decomposition of your data in individual statements.

An RDF graph is a set of tiples or statements where both the subject and the predicate are resources (uniquely identified by URIs) and the object can be either another resource or a literal. There are two key characteristics of RDF stores (aka triple stores): the first and by far the most relevant is that they represent, store and query data as a graph. The second is that they are semantic,which means that they can store not only data but also explicit descriptions of the meaning of that data (i.e., ontologies).

Making the semantics of your data explicit in an ontology will enable data and/or knowledge exchange and interoperability which will be useful in some situations. In other scenarios, you may want to use your ontology to run generic inferencing on your data to derive new facts from existing ones. Another similar use of explicit semantics would be to run domain-specific consistency checks on the data.

In previous posts we have described a semantic model for scientific publishing (nanopublications) and a property graph database (Neo4j). This time we will introduce our work for extending the rdf4r library and accelerate the process of building, storing, and uploading thousands of RDF triples directly from the R programming language. This has been a great advance in developing and setting up our semantic graph databases in remote graphDB repositories.

rdf4r (extended) library for R.

We have extended the functionality of the rdf4r library, originally developed by Viktor Senderov for working with Resource Description Framework (RDF) data in the R programming environment, to improve performance when uploading lots of RDF triples into a graphDB triplestore hosted in a remote server. Please, see the original readme file for getting more details of the rdf4r library.

New functions added:

  • add_triples_extended():
  • serialize_to_file():
  • add_trig_file_to_graphdb():
  • ntriples():

add_triples_extended():

The function add_triples() included in the rdf4r original package takes huge amount of time adding triples to its internal dynamic vector. Time consumption increases exponentialy when thousands of triples are added, and resizing the dimension of the dynamic vectors does not help. Our new function add_triples_extended() creates instead a list of dynamic vectors by adding them to the list every time the function is called on the same RDF object. It also skips some data processing that are not required when data are well-curated in the source dataframe (i.e., before adding triples to the RDF object). Therefore, using our new function add_triples_extended() accelerates the proccess of adding thousands of triples.

The function takes a dataframe as input data and iterates throught its rows to build the triples. Then, it adds the triples to the RDF object using only a single instruction. In addition, a progress bar is shown by default during the process of adding triples, although it can be hidden by setting param progres_bar = FALSE.

The predicate of the triples is defined once for all triples that will be added to the RDF object, using an identifier before calling the add_triples_extended() function. To build the subject and the object of the triples from each row of the dataframe, different rules will be applied to the add_triples_extended() function depending on the values of the following parameters:

  • WHEN subject_column_label = "" AND subject_column_name != "", THEN subject = identifier(id = subject_column_name, prefix = subject_rdf_prefix)
  • WHEN subject_column_label != "" AND subject_column_name != "", THEN subject = identifier(id = paste0(subject_column_label, subject_column_name), prefix = subject_rdf_prefix)
  • WHEN subject_column_label != "" AND subject_column_name = "", THEN subject = literal(text_value = subject_column_label)
  • WHEN subject_blank = TRUE THEN subject = identifier(id = paste0(subject_column_label, subject_column_name), prefix = NA, blank = TRUE)

The objects of the triples are build in the same way but replacing subject by object in the above rules.

example:
# define a RDF object:
rdf_infects <- ResourceDescriptionFramework$new()
    
# define the prefixes:
prefixes <- c(
   rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
   rdfs = "http://www.w3.org/2000/01/rdf-schema#",
   owl = "http://www.w3.org/2002/07/owl#",
   phageon = "http://owl.fortunalab.org/phageon#",
   obo = "http://purl.obolibrary.org/obo/"
)

# phageon prefix:
phageon <- prefixes[4]

# read data from a file and store them as a dataframe:
phage_host <- read_csv("phage_host_db.csv", col_types = cols(phage_taxid = col_character(), host_taxid = col_character()))
  
# define the predicate:
infects <- identifier(id = "PHAGEON_0000001", prefix = phageon)
  
# define the subject and object while building and adding the triples from the data frame:
rdf_infects$add_triples_extended(
   data = phage_host,
   subject_column_label = "NCBITaxon_", subject_column_name = "phage_taxid", subject_rdf_prefix = phageon,
   predicate = infects,
   object_column_label = "NCBITaxon_", object_column_name = "host_taxid", object_rdf_prefix = phageon
)

serialize_to_file():

The function serialize() from the original rdf4r package takes a huge amount of time serializing thousands of triples. Our new function serialize_to_file() iterates through the list of dynamic vectors built previously to extract the triples and store them into a trig output file, skiping some data processing that are not required. Therefore, using our new function serialize_to_file() accelerates the proccess of saving the triples into trig files. In addition, a progress bar is shown by default during the process, although it can be hidden by setting param progres_bar = FALSE.

example:
# set the context (for named graphs):
rdf_infects$set_context(identifier(id = "phageon", prefix = phageon))

# serialize RDF object to file:
rdf_infects$serialize_to_file("rdf_infects.trig")

add_trig_file_to_graphdb():

Our new function add_trig_file_to_graphdb() retrieves the triples stored in a trig file and inserts each triple into an element of a list. Then, this list is imported to graphdb using the rdfr4::add_factory() function from the original package.

example:
# set the graphdb access options to the repository:
graphdb = rdf4r::basic_triplestore_access(
   server_url = "http://your_graphdb_url/",
   user = "your_username",
   password = "your_password",
   repository = "graphdb_repository_name"
)
# import the trig file:
rdf4r::add_trig_file_to_graphdb(graphdb_access_options = graphdb, prefixes = prefixes, trig_file = "rdf_infects.trig")

ntriples():

Our new function ntriples() returns the number of triples added to the RDF object using the add_triples_extended() function.

example:
# build the sparql query to retrieve the triples:
query = "
   PREFIX phageon: <http://owl.fortunalab.org/phageon#>
   SELECT ?s ?p ?o
   WHERE {
      GRAPH phageon:phageon {
         ?s phathe:PHAGEON_0000001 ?o .
         ?s ?p ?o
      }
   }
"

# submit the query to the graphdb repository:
n_inserted_triples <- rdf4r::submit_sparql(query = query, access_options = graphdb)

# check and show results:
cat(paste0("Number of triples inserted: ", rdf_infects$ntriples(), "\n"))
cat(paste0("Number of triples retrieved from the triplestore: ", nrow(n_inserted_triples)), "\n")
n_inserted_triples %>% head

New data types added:

Some missing XSD data types have been added:

  • xsd_double
  • xsd_boolean

Installation.

devtools::install_git("https://gitlab.com/fortunalab/rdf4r.git")

Source code.

The extended rdf4r library was developed by Raúl Ortega.

Show Comments