Files
DSpace/dspace/docs/html/ch09.html
Jeffrey Trimble f297b45dce Minor changes
git-svn-id: http://scm.dspace.org/svn/repo/dspace/trunk@4611 9c30dcfa-912a-0410-8fc2-9e0234be79fd
2009-12-03 15:11:29 +00:00

82 lines
27 KiB
HTML
Raw Blame History

<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Chapter&nbsp;9.&nbsp;DSpace System Documentation: Storage Layer</title><meta content="DocBook XSL Stylesheets V1.75.2" name="generator"><link rel="home" href="index.html" title="DSpace Manual"><link rel="up" href="index.html" title="DSpace Manual"><link rel="prev" href="ch08.html" title="Chapter&nbsp;8.&nbsp;DSpace System Documentation: System Administration"><link rel="next" href="ch10.html" title="Chapter&nbsp;10.&nbsp;DSpace System Documentation: Directories and Files"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF" marginwidth="5m"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">Chapter&nbsp;9.&nbsp;DSpace System Documentation: Storage Layer</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="ch08.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="ch10.html">Next</a></td></tr></table><hr></div><div class="chapter" title="Chapter&nbsp;9.&nbsp;DSpace System Documentation: Storage Layer"><div class="titlepage"><div><div><h2 class="title"><a name="N159EA"></a>Chapter&nbsp;9.&nbsp;<a name="docbook-storage.html"></a>DSpace System Documentation: Storage Layer</h2></div></div><div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="ch09.html#N159F6">9.1. RDBMS</a></span></dt><dd><dl><dt><span class="section"><a href="ch09.html#N15A97">9.1.1. Maintenance and Backup</a></span></dt><dt><span class="section"><a href="ch09.html#N15AD0">9.1.2. Configuring the RDBMS Component</a></span></dt></dl></dd><dt><span class="section"><a href="ch09.html#N15B25">9.2. Bitstream Store</a></span></dt><dd><dl><dt><span class="section"><a href="ch09.html#N15C1C">9.2.1. Backup</a></span></dt><dt><span class="section"><a href="ch09.html#N15C30">9.2.2. Configuring the Bitstream Store</a></span></dt><dd><dl><dt><span class="section"><a href="ch09.html#N15C3A">9.2.2.1. Configuring Traditonal Storage</a></span></dt><dt><span class="section"><a href="ch09.html#N15C68">9.2.2.2. Configuring SRB Storage</a></span></dt></dl></dd></dl></dd></dl></div><p>
<a class="link" href="ch11.html#docbook-architecture.html">Back to architecture overview</a>
</p><div class="section" title="9.1.&nbsp;RDBMS"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N159F6"></a>9.1.&nbsp;<a name="docbook-storage.html-rdbms"></a>RDBMS</h2></div></div><div></div></div><p>DSpace uses a relational database to store all information about the organization of content, metadata about the content, information about e-people and authorization, and the state of currently-running workflows. The DSpace system also uses the relational database in order to maintain indices that users can browse.</p><p>
<a class="ulink" href="image/db-schema.gif" target="_top">Graphical visualization of the relational database</a>
</p><p>Most of the functionality that DSpace uses can be offered by any standard SQL database that supports transactions. Presently, the browse indices use some features specific to <a class="ulink" href="http://www.postgresql.org/" target="_top">PostgreSQL</a> and <a class="ulink" href="http://www.oracle.com/database/" target="_top">Oracle</a>, so some modification to the code would be needed before DSpace would function fully with an alternative database back-end.</p><p>The <code class="literal">org.dspace.storage.rdbms</code> package provides access to an SQL database in a somewhat simpler form than using JDBC directly. The main class is <code class="literal">DatabaseManager</code>, which executes SQL queries and returns <code class="literal">TableRow</code> or <code class="literal">TableRowIterator</code> objects. The <code class="literal">InitializeDatabase</code> class is used to load SQL into the database via JDBC, for example to set up the schema.</p><p>All calls to the <code class="literal">Database Manager</code> require a <a class="link" href="ch13.html#docbook-business.html-core">DSpace <code class="literal">Context</code> object</a>. Example use of the database manager API is given in the <code class="literal">org.dspace.storage.rdbms</code> package Javadoc.</p><p>The database schema used by DSpace is created by SQL statements stored in a directory specific to each supported RDBMS platform:
<div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>PostgreSQL schemas are in <code class="literal">[dspace-source]/dspace/etc/postgres/</code></p></li><li class="listitem"><p>Oracle schemas are in <code class="literal">[dspace-source]/dspace/etc/oracle/</code></p></li></ul></div>
The SQL (DDL) statements to create the tables for the current release, starting with an empty database, aer in <code class="literal">database_schema.sql</code>. The schema SQL file also creates the two required e-person groups (<code class="literal">Anonymous</code> and <code class="literal">Administrator</code>) that are required for the system to function properly.</p><p>Also in <code class="literal">[dspace-source]/dspace/etc/[database]</code> are various SQL files called <code class="literal">database_schema_1x_1y</code>. These contain the necessary SQL commands to update a live DSpace database from version 1.<code class="literal">x</code> to 1.<code class="literal">y</code>. Note that this might not be the only part of an upgrade process: see <a class="link" href="ch04.html#docbook-update.html">Updating a DSpace Installation</a> for details.</p><p>The DSpace database code uses an SQL function <code class="literal">getnextid</code> to assign primary keys to newly created rows. This SQL function must be safe to use if several JVMs are accessing the database at once; for example, the Web UI might be creating new rows in the database at the same time as the batch item importer. The PostgreSQL-specific implementation of the method uses <code class="literal">SEQUENCES</code> for each table in order to create new IDs. If an alternative database backend were to be used, the implementation of <code class="literal">getnextid</code> could be updated to operate with that specific DBMS.</p><p>The <code class="literal">etc</code> directory in the source distribution contains two further SQL files. <code class="literal">clean-database.sql</code> contains the SQL necessary to completely clean out the database, so use with caution! The Ant target <code class="literal">clean_database</code> can be used to execute this. <code class="literal">update-sequences.sql</code> contains SQL to reset the primary key generation sequences to appropriate values. You'd need to do this if, for example, you're restoring a backup database dump which creates rows with specific primary keys already defined. In such a case, the sequences would allocate primary keys that were already used.</p><p>Versions of the <code class="literal">*.sql*</code> files for Oracle are stored in <code class="literal">[dspace-source]/dspace/etc/oracle</code>. These need to be copied over their PostgreSQL counterparts in <code class="literal">[dspace-source]/dspace/etc</code> prior to installation.</p><div class="section" title="9.1.1.&nbsp;Maintenance and Backup"><div class="titlepage"><div><div><h3 class="title"><a name="N15A97"></a>9.1.1.&nbsp;Maintenance and Backup</h3></div></div><div></div></div><p>When using PostgreSQL, it's a good idea to perform regular 'vacuuming' of the database to optimize performance. This is performed by the <code class="literal">vacuumdb</code> command which can be executed via a 'cron' job, for example by putting this in the system <code class="literal">crontab</code>:</p><pre class="screen">
# clean up the database nightly
40 2 * * * /usr/local/pgsql/bin/vacuumdb --analyze dspace &gt; /dev/null
2&gt;&amp;1
</pre><p>The DSpace database can be backed up and restored using usual methods, for example with <code class="literal">pg_dump</code> and <code class="literal">psql</code>. However when restoring a database, you will need to perform these additional steps:</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p> The <code class="literal">fresh_install</code> target loads up the initial contents of the Dublin Core type and bitstream format registries, as well as two entries in the <code class="literal">epersongroup</code> table for the system anonymous and administrator groups. Before you restore a raw backup of your database you will need to remove these, since they will already exist in your backup, possibly having been modified. For example, use:</p><pre class="screen">
DELETE FROM dctyperegistry;
DELETE FROM bitstreamformatregistry;
DELETE FROM epersongroup;
</pre></li><li class="listitem"><p> After restoring a backup, you will need to reset the primary key generation sequences so that they do not produce already-used primary keys. Do this by executing the SQL in <code class="literal">[dspace-source]/dspace/etc/update-sequences.sql</code>, for example with:</p><pre class="screen">
psql -U dspace -f <span class="emphasis"><em>
[dspace-source]</em></span>/dspace/etc/update-sequences.sql
</pre></li></ul></div><p>Future updates of DSpace may involve minor changes to the database schema. Specific instructions on how to update the schema whilst keeping live data will be included. The current schema also contains a few currently unused database columns, to be used for extra functionality in future releases. These unused columns have been added in advance to minimize the effort required to upgrade.</p></div><div class="section" title="9.1.2.&nbsp;Configuring the RDBMS Component"><div class="titlepage"><div><div><h3 class="title"><a name="N15AD0"></a>9.1.2.&nbsp;Configuring the RDBMS Component</h3></div></div><div></div></div><p>The database manager is configured with the following properties in <code class="literal">dspace.cfg</code>:</p><div class="informaltable"><table border="0"><colgroup><col><col></colgroup><tbody><tr><td>
<p>
<code class="literal">db.url</code>
</p>
</td><td>
<p>The JDBC URL to use for accessing the database. This should not point to a connection pool, since DSpace already implements a connection pool.</p>
</td></tr><tr><td>
<p>
<code class="literal">db.driver</code>
</p>
</td><td>
<p>JDBC driver class name. Since presently, DSpace uses PostgreSQL-specific features, this should be <code class="literal">org.postgresql.Driver</code>.</p>
</td></tr><tr><td>
<p>
<code class="literal">db.username</code>
</p>
</td><td>
<p>Username to use when accessing the database.</p>
</td></tr><tr><td>
<p>
<code class="literal">db.password</code>
</p>
</td><td>
<p>Corresponding password ot use when accessing the database.</p>
</td></tr></tbody></table></div></div></div><div class="section" title="9.2.&nbsp;Bitstream Store"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N15B25"></a>9.2.&nbsp;<a name="docbook-storage.html-bitstreams"></a>Bitstream Store</h2></div></div><div></div></div><p>DSpace offers two means for storing content. The first is in the file system on the server. The second is using <a class="ulink" href="http://www.sdsc.edu/srb" target="_top">SRB (Storage Resource Broker)</a>. Both are achieved using a simple, lightweight API.</p><p>SRB is purely an option but may be used in lieu of the server's file system or in addition to the file system. Without going into a full description, SRB is a very robust, sophisticated storage manager that offers essentially unlimited storage and straightforward means to replicate (in simple terms, backup) the content on other local or remote storage resources.</p><p>The terms "store", "retrieve", "in the system", "storage", and so forth, used below can refer to storage in the file system on the server ("traditional") or in SRB.</p><p>The <code class="literal">BitstreamStorageManager</code> provides low-level access to bitstreams stored in the system. In general, it should not be used directly; instead, use the <code class="literal">Bitstream</code> object in the <a class="link" href="ch13.html#docbook-business.html-content">content management API</a> since that encapsulated authorization and other metadata to do with a bitstream that are not maintained by the <code class="literal">BitstreamStorageManager</code>.</p><p>The bitstream storage manager provides three methods that store, retrieve and delete bitstreams. Bitstreams are referred to by their 'ID'; that is the primary key <code class="literal">bitstream_id</code> column of the corresponding row in the database.</p><p>As of DSpace version 1.1, there can be multiple bitstream stores. Each of these bitstream stores can be traditional storage or SRB storage. This means that the potential storage of a DSpace system is not bound by the maximum size of a single disk or file system and also that traditional and SRB storage can be combined in one DSpace installation. Both traditional and SRB storage are specified by <a class="link" href="ch05.html#docbook-configure.html">configuration parameters</a>. Also see Configuring the Bitstream Store below.</p><p>Stores are numbered, starting with zero, then counting upwards. Each bitstream entry in the database has a store number, used to retrieve the bitstream when required.</p><p>At the moment, the store in which new bitstreams are placed is decided using a configuration parameter, and there is no provision for moving bitstreams between stores. Administrative tools for manipulating bitstreams and stores will be provided in future releases. Right now you can move a whole store (e.g. you could move store number 1 from <code class="literal">/localdisk/store</code> to <code class="literal">/fs/anotherdisk/store</code> but it would still have to be store number 1 and have the exact same contents.</p><p>Bitstreams also have an 38-digit internal ID, different from the primary key ID of the bitstream table row. This is not visible or used outside of the bitstream storage manager. It is used to determine the exact location (relative to the relevant store directory) that the bitstream is stored in traditional or SRB storage. The first three pairs of digits are the directory path that the bitstream is stored under. The bitstream is stored in a file with the internal ID as the filename.</p><p>For example, a bitstream with the internal ID <code class="literal">12345678901234567890123456789012345678</code> is stored in the directory:</p><pre class="screen">
(assetstore dir)/12/34/56/12345678901234567890123456789012345678
</pre><p>The reasons for storing files this way are:</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p> Using a randomly-generated 38-digit number means that the 'number space' is less cluttered than simply using the primary keys, which are allocated sequentially and are thus close together. This means that the bitstreams in the store are distributed around the directory structure, improving access efficiency.</p></li><li class="listitem"><p> The internal ID is used as the filename partly to avoid requiring an extra lookup of the filename of the bitstream, and partly because bitstreams may be received from a variety of operating systems. The original name of a bitstream may be an illegal UNIX filename.</p></li></ul></div><p>When storing a bitstream, the <code class="literal">BitstreamStorageManager</code> DOES set the following fields in the corresponding database table row:</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p>
<code class="literal">bitstream_id</code>
</p></li><li class="listitem"><p>
<code class="literal">size</code>
</p></li><li class="listitem"><p>
<code class="literal">checksum</code>
</p></li><li class="listitem"><p>
<code class="literal">checksum_algorithm</code>
</p></li><li class="listitem"><p>
<code class="literal">internal_id</code>
</p></li><li class="listitem"><p>
<code class="literal">deleted</code>
</p></li><li class="listitem"><p>
<code class="literal">store_number</code>
</p></li></ul></div><p>The remaining fields are the responsibility of the <code class="literal">Bitstream</code> content management API class.</p><p>The bitstream storage manager is fully transaction-safe. In order to implement transaction-safety, the following algorithm is used to store bitstreams:</p><div class="orderedlist"><ol class="orderedlist" type="1"><li class="listitem"><p> A database connection is created, separately from the currently active connection in the <a class="link" href="ch13.html#docbook-business.html-core">current DSpace context</a>.</p></li><li class="listitem"><p> An unique internal identifier (separate from the database primary key) is generated.</p></li><li class="listitem"><p> The bitstream DB table row is created using this new connection, with the <code class="literal">deleted</code> column set to <code class="literal">true</code>.</p></li><li class="listitem"><p> The new connection is <code class="literal">commit</code>ted, so the 'deleted' bitstream row is written to the database</p></li><li class="listitem"><p> The bitstream itself is stored in a file in the configured 'asset store directory', with a directory path and filename derived from the internal ID</p></li><li class="listitem"><p> The <code class="literal">deleted</code> flag in the bitstream row is set to <code class="literal">false</code>. This will occur (or not) as part of the current DSpace <code class="literal">Context</code>.</p></li></ol></div><p>This means that should anything go wrong before, during or after the bitstream storage, only one of the following can be true:</p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"><p> No bitstream table row was created, and no file was stored</p></li><li class="listitem"><p> A bitstream table row with <code class="literal">deleted=true</code> was created, no file was stored</p></li><li class="listitem"><p> A bitstream table row with <code class="literal">deleted=true</code> was created, and a file was stored</p></li></ul></div><p>None of these affect the integrity of the data in the database or bitstream store.</p><p>Similarly, when a bitstream is deleted for some reason, its <code class="literal">deleted</code> flag is set to true as part of the overall transaction, and the corresponding file in storage is <span class="emphasis"><em>not</em></span> deleted.</p><p>The above techniques mean that the bitstream storage manager is transaction-safe. Over time, the bitstream database table and file store may contain a number of 'deleted' bitstreams. The <code class="literal">cleanup</code> method of <code class="literal">BitstreamStorageManager</code> goes through these deleted rows, and actually deletes them along with any corresponding files left in the storage. It only removes 'deleted' bitstreams that are more than one hour old, just in case cleanup is happening in the middle of a storage operation.</p><p>This cleanup can be invoked from the command line via the <code class="literal">Cleanup</code> class, which can in turn be easily executed from a shell on the server machine using <code class="literal">/dspace/bin/cleanup</code>. You might like to have this run regularly by <code class="literal">cron</code>, though since DSpace is read-lots, write-not-so-much it doesn't need to be run very often.</p><div class="section" title="9.2.1.&nbsp;Backup"><div class="titlepage"><div><div><h3 class="title"><a name="N15C1C"></a>9.2.1.&nbsp;Backup</h3></div></div><div></div></div><p>The bitstreams (files) in traditional storage may be backed up very easily by simply 'tarring' or 'zipping' the <code class="literal">assetstore</code> directory (or whichever directory is configured in <code class="literal">dspace.cfg</code>). Restoring is as simple as extracting the backed-up compressed file in the appropriate location.</p><p>Similar means could be used for SRB, but SRB offers many more options for managing backup.</p><p>It is important to note that since the bitstream storage manager holds the bitstreams in storage, and information about them in the database, that a database backup and a backup of the files in the bitstream store must be made at the same time; the bitstream data in the database must correspond to the stored files.</p><p>Of course, it isn't really ideal to 'freeze' the system while backing up to ensure that the database and files match up. Since DSpace uses the bitstream data in the database as the authoritative record, it's best to back up the database before the files. This is because it's better to have a bitstream in storage but not the database (effectively non-existent to DSpace) than a bitstream record in the database but not storage, since people would be able to find the bitstream but not actually get the contents.</p></div><div class="section" title="9.2.2.&nbsp;Configuring the Bitstream Store"><div class="titlepage"><div><div><h3 class="title"><a name="N15C30"></a>9.2.2.&nbsp;Configuring the Bitstream Store</h3></div></div><div></div></div><p>Both traditional and SRB bitstream stores are configured in <code class="literal">dspace.cfg</code>.</p><div class="section" title="9.2.2.1.&nbsp;Configuring Traditonal Storage"><div class="titlepage"><div><div><h4 class="title"><a name="N15C3A"></a>9.2.2.1.&nbsp;Configuring Traditonal Storage</h4></div></div><div></div></div><p>Bitstream stores in the file system on the server are configured like this:</p><pre class="screen">
assetstore.dir = <span class="emphasis"><em> [dspace]</em></span>/assetstore
</pre><p>(Remember that <span class="emphasis"><em>[dspace]</em></span> is a placeholder for the actual name of your DSpace install directory).</p><p>The above example specifies a single asset store.</p><pre class="screen">
assetstore.dir = <span class="emphasis"><em> [dspace]</em></span>/assetstore_0
assetstore.dir.1 = /mnt/other_filesystem/assetstore_1
</pre><p>The above example specifies two asset stores. assetstore.dir specifies the asset store number 0 (zero); after that use assetstore.dir.1, assetstore.dir.2 and so on. The particular asset store a bitstream is stored in is held in the database, so don't move bitstreams between asset stores, and don't renumber them.</p><p>By default, newly created bitstreams are put in asset store 0 (i.e. the one specified by the assetstore.dir property.) This allows backwards compatibility with pre-DSpace 1.1 configurations. To change this, for example when asset store 0 is getting full, add a line to <code class="literal">dspace.cfg</code> like:</p><pre class="screen">
assetstore.incoming = 1
</pre><p>Then restart DSpace (Tomcat). New bitstreams will be written to the asset store specified by <code class="literal">assetstore.dir.1</code>, which is <code class="literal">/mnt/other_filesystem/assetstore_1</code> in the above example.</p></div><div class="section" title="9.2.2.2.&nbsp;Configuring SRB Storage"><div class="titlepage"><div><div><h4 class="title"><a name="N15C68"></a>9.2.2.2.&nbsp;Configuring SRB Storage</h4></div></div><div></div></div><p>The same framework is used to configure SRB storage. That is, the asset store number (0..n) can reference a file system directory as above or it can reference a set of SRB account parameters. But any particular asset store number can reference one or the other but not both. This way traditional and SRB storage can both be used but with different asset store numbers. The same cautions mentioned above apply to SRB asset stores as well: The particular asset store a bitstream is stored in is held in the database, so don't move bitstreams between asset stores, and don't renumber them.</p><p>For example, let's say asset store number 1 will refer to SRB. The there will be a set of SRB account parameters like this:</p><pre class="screen">
srb.host.1 = mysrbmcathost.myu.edu
srb.port.1 = 5544
srb.mcatzone.1 = mysrbzone
srb.mdasdomainname.1 = mysrbdomain
srb.defaultstorageresource.1 = mydefaultsrbresource
srb.username.1 = mysrbuser
srb.password.1 = mysrbpassword
srb.homedirectory.1 = /mysrbzone/home/mysrbuser.mysrbdomain
srb.parentdir.1 = mysrbdspaceassetstore
</pre><p>Several of the terms, such as <code class="literal">mcatzone</code>, have meaning only in the SRB context and will be familiar to SRB users. The last, <code class="literal">srb.parentdir.n</code>, can be used to used for addition (SRB) upper directory structure within an SRB account. This property value could be blank as well.</p><p>(If asset store 0 would refer to SRB it would be <code class="literal">srb.host =</code> ..., <code class="literal">srb.port =</code> ..., and so on (<code class="literal">.0</code> omitted) to be consistent with the traditional storage configuration above.)</p><p>The similar use of <code class="literal">assetstore.incoming</code> to reference asset store 0 (default) or 1..n (explicit property) means that new bitstreams will be written to traditional or SRB storage determined by whether a file system directory on the server is referenced or a set of SRB account parameters are referenced.</p><p>There are comments in dspace.cfg that further elaborate the configuration of traditional and SRB storage.</p></div></div></div></div><HR><p class="copyright">Copyright <20> 2002-2009
<a class="ulink" href="http://www.dspace.org/" target="_top">The DSpace Foundation</a>
</p><div class="legalnotice" title="Legal Notice"><a name="N1001D"></a><p>
<a class="ulink" href="http://creativecommons.org/licenses/by/3.0/us/" target="_top">
<span class="inlinemediaobject"><img src="http://i.creativecommons.org/l/by/3.0/us/88x31.png"></span>
Licensed under a Creative Commons Attribution 3.0 United States License
</a>
</p></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="ch08.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="ch10.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">Chapter&nbsp;8.&nbsp;DSpace System Documentation: System Administration&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;Chapter&nbsp;10.&nbsp;DSpace System Documentation: Directories and Files</td></tr></table></div></body></html>