SDB/NotesPostgreSQL
From Jena wiki
PostgreSQL specific notes.
Contents |
[edit] Databases must use UTF-8 encoding
Create SDB stores with encoding UTF-8.
International character sets can cause corrupted databases otherwise. The database will not pass the SDB test suite.
Set this when creating the database with pgAdmin or if you use the command line, for example:
CREATE DATABASE "YourStoreName"
WITH OWNER = "user"
ENCODING = 'UTF8'
TABLESPACE = pg_default;
[edit] Improving loading rates
The index layout ("layout2/index") usually loads faster than the hash form.
Existing store
When loading into an existing store, where there is existing data and ANALYZE has been run, the process is:
- Drop indexes
sdbconfig --drop
- Load data
sdbload file
- Redo the indexes
sdbconfig --index
Fresh store
PostgreSQL needs statistics to improve load performance through the use of ANALYSE.
When loading the first time, there are no statistics so, for a large load, it is advisable to load a sample, run ANALYSE and then load the whole data.
- Create the database without indexes (just the primary keys).
sdbconfig --format
- Load a sample of the triples (say, a 100K or a million triples - until
the load rate starts to drop appreciably). The sample must be representative of the data.
sdbload --time sample
- Run
ANALYZEon the database.
- If your sample is one part of a large set of files, this set is not necessary at all. If you are loading one single large file then you might wish to empty the database. This is only needed if the data has bNodes in
it because the load process suppresses duplicates.
sdbconfig --truncate
- Now load the data or rest of the data.
sdbload --time file
- Add the indexes. This only takes a few minutes even on a very large store but calculating them during loading (that is,
--create, not--format) is noticeably slower.
sdbconfig --index
- Run
ANALYZEon the database again.
[edit] Tuning
It is essential to run the PostgreSQL ANALYZE command on a database, either during or after building. This is done via the command line psql or via pgAdmin. The PostgreSQL documentation describes ways to run this as a background daemon.
Various of the PostgreSQL configuration parameters will affect performance, particularly effective_cache_size. The parameter enable_seqscan may help avoid some unexpected slow queries.
