Jena tip: optimising database load times
Loading lots of data into a persistent Jena model can often take quite a bit of time. There are, however, some tips for speeding things up.
Let’s get the baseline established first. Assume that our data source is
encoded in RDF/XML, and the load routine is loadData
. I generally use a
couple of helper methods make things a bit smoother in my database code. In
particular, I use a short name or alias for each database I’m working with, and
store the connection URI, model name, user name etc in a table (usually in
code, but it could be loaded from a file). I’m not going to dwell on this
pattern in this blog entry, since it’s not the point of the article. Suffice to
say that getDBUrl
returns the connection URL for the database (i.e. the JDBC
URL) and so on for the other methods.
Given that, the primary method here is loadData
, which opens the named model
from the database, then reads in the contents of a file or URI. source
is the
file name or URL pointing to the input document:
protected void loadData( String dbAlias, String source ) {
ModelMaker maker = getRDBModelMaker( dbAlias );
ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
FileManager.get().readModel( model, source );
}
private ModelMaker getRDBModelMaker( String dbAlias ) {
return ModelFactory.createModelRDBMaker( getConnection( dbAlias ) );
}
private IDBConnection getConnection( String dbAlias ) {
try {
Class.forName( DBDRIVER\_CLASS );
}
catch (ClassNotFoundException e) {
throw new RuntimeException( "Failed to load DB driver " + DBDRIVER\_CLASS, e );
}
return new DBConnection( getDBUrl( dbAlias ),
getDBUserName( dbAlias ),
getDBPassword( dbAlias ),
DB );
}
This works, but given any significant amount of data to read in it will usually be very slow. The first tweak is always to do the work inside a transaction. This won’t hurt if the underlying DB engine doesn’t handle transactions, but will greatly help if it does:
protected void loadData( String dbAlias, String source ) {
ModelMaker maker = getRDBModelMaker( dbAlias );
ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
model.begin();
FileManager.get().readModel( model, source );
model.commit();
}
In practice, there should be a try/catch block there to roll back the transaction if an exception occurs, but I’m leaving out clutter for educational purposes!
This probably still isn’t fast enough though. One reason is that, to fulfill the Jena model contract, the database driver checks that there are no duplicate triples as the data is read in. This requires testing for the existence of the statement prior to inserting it in the triple table. Clearly this is going to be a lot of work for large sets of triples. It’s possible to turn off duplicate checking:
protected void loadData( String dbAlias, String source ) {
ModelMaker maker = getRDBModelMaker( dbAlias );
ModelRDB model = (ModelRDB) maker.openModel( getDBModelName( dbAlias ) );
model.setDoDuplicateCheck( false );
model.begin();
FileManager.get().readModel( model, source );
model.commit();
}
The problem with this is that it moves the responsibility for ensuring that there are no duplicates from the db driver to the calling code. Now, it may well be that this is known from the context: the data may be generated in a way that ensures that it’s free of duplicates. In which case, no problem. But what if that’s not certain? One solution is to scrub the data externally, using commonly available tools on Unix (or Cygwin on Windows).
First we migrate the data to the n-triple format. N-triple is a horrible
format for the human reader to read, but ideal for machine processing: every
triple is on one line, and there is no structure to the file. This means, for
example, that cat
can be used to
join multiple documents together, something that can’t be done with the RDF/XML
or N3 formats. Jena provides a command line utility for converting between
formats: rdfcat
.
Let’s take a simple example. Here’s a mini OWL file:
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns="http://example.com/a#"
xml:base="http://example.com/a">
<owl:Ontology rdf:about="">
<owl:imports rdf:resource="http://example.com/b" />
</owl:Ontology>
<owl:Class rdf:ID="AClass" />
</rdf:RDF>
Which we then convert to n-triples:
$ java jena.rdfcat -out ntriple a.owl
<http://example.com/a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Ontology> .
<http://example.com/a> <http://www.w3.org/2002/07/owl#imports> <http://example.com/b> .
<http://example.com/a#AClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
Assume that we have a collection of n-triple files (a.nt
, b.nt
, etc) and we
want to remove all of the duplicate triples. Using common Unix utilities, this
can be done as:
cat a.nt b.nt c.nt | sort -k 1 | uniq > nodups.nt
The sort
utility sorts the input lines into lexical order, while -k 1
tells
it to use the entire line not just the first field (sort splits lines into
fields, using whitespace as a separator). uniq
condenses adjacent duplicate
lines into one, which is where the duplicate triple removal happens.
Finally, what do we need to change in the original program to load n-triples
instead of RDF/XML or OWL files? Happily, nothing! The Jena FileManager
uses
the extension of a file to guess the content encoding. *.nt
triggers the
n-triple parser, so since we used that convention in naming the file we’re
done.
On a recent project, loading a million-triple model into a MySQL 4 database took me just about 10 minutes using these tips, while before optimisation it was taking hours.