Technical Setup

The ARCHE repository system is based on the well-established open-source repository software Fedora Commons version 4 which provides a sound technological basis for implementing the Open Archival Information System (OAIS) reference model.

The overall architecture is built in a modular service-oriented manner, consisting of multiple interconnected components communicating through well-defined APIs:

  • Fedora Commons 4 - main software component, used for storing resources and their metadata.
  • triplestore - Fedora 4 was coupled with an external triplestore (Blazegraph) exposing a convenient metadata search API. It allows other components as well as external clients to perform complex SPARQL search queries across the whole metadata. The synchronisation is done using the official Fedora 4 plugin (fcrepo-camel-toolbox/fcrepo-indexing-triplestore). Triplestore authorisation is provided by the doorkeeper.
  • doorkeeper - provides the single point of access to Fedora and the triplestore. It ensures adherence to established business rules, like transactions, metadata validation, authentication, etc. And it provides client authentication for both Fedora and the SPARQL endpoint, and authorisation for the SPARQL endpoint.
    From the client perspective the doorkeeper acts as a transparent proxy to the Fedora 4 API and the SPARQL endpoint, thus it is compatible with any Fedora 4 compliant client, but will refuse data which doesn’t fulfil the established business rules.
  • repository browser - the user facing component, allowing to navigate, search and view the repository content. It is a default dissemination service. It is implemented as a Drupal 8 module.
  • OAI-PMH service - an application implementing the OAI-PMH protocol, which is a protocol widely used for harvesting repository contents.
  • dissemination services - an open growing set of web applications specialised in processing and displaying specific data types or formats. A typical example is TEI (or any other XML) to HTML or PDF transformation. But it could also be plotting geographical data on a map or visualising graph-based data as an interactive network. The binding of resources to certain dissemination services is provided by the repo-resolver service.
  • repo-resolver - a service resolving a resource URI to a particular representation of the resource, e.g. its default view in the repository browser, its metadata in RDF, its binary content or a PDF generated from the resource. All representations are provided by the dissemination services. Binding between resources and dissemination services can be expressed either on the resource level, depending on its type, or in terms of general rules related to resource metadata content (e.g. “if a resource has a metadata property X and its value is Y”). Bindings are stored in Fedora 4 as dedicated repository resources.
  • repo-file-checker - a command line application used to validate every dataset before it is ingested into the repository. It automatically checks the data structure and formats, and provides an overview of the dataset with respect to size, structure and file formats used.
  • Ingest scripts -  a set of PHP scripts used to ingest the data into the repository, which make use of the methods provided by repo-php-util.
  • repo-php-util - a PHP library providing both a medium and a high level API to communicate with Fedora 4 and the SPARQL triplestore. It provides a convenient way to manipulate repository resources in a consistent way which is quite troublesome if the Fedora 4 API is used directly. The library is used by most other components: doorkeeper, repository browser, OAI-PMH service, repo-resolver, and ingest scripts.

The whole custom-built software stack (except for some dissemination services) is implemented in PHP and most of the source is code published on GitHub.

Resources published via the repository get assigned a Handle-based persistent identifier issued by the PID-service run by the GWDG. By using this PIDs, we ensure that the resources remain referenceable and citable even if their actual location or the underlying repository system should change in the future.

The system runs in a Docker environment on one of the virtual machines hosted on dedicated servers maintained by the Computing Centre of the Academy (ARZ). The data is secured against any case of emergency or data loss with a multi-layered backup strategy, cf. Storage Procedures.

The technological development of the repository infrastructure is a continuous process, which is driven by the qualified staff of ACDH-OeAW and supported by the Academy’s computing centre. Furthermore the team is in embedded in a broad network of data centres via the research infrastructures CLARIN and DARIAH as well as the working group Datenzentren of the DHd alliance.

Architecture of ARCHE

Storage Procedures

Data storage and backup procedures are essential parts of our data management system. To avoid data loss due to deterioration of physical storage, malicious threats or other emergencies, redundancy is key for the preservation of data.
The primary server storage is a RAID-6 configuration allowing to sustain read and write operations in the presence of up to two concurrent disk failures.

Our backup policies follow a multi-layered setup: The live data stored on the repository production server is copied up every night to the ARZ NetApp production storage, of which numerous snapshots are stored on ARZ NetApp backup system in a separate location.  
Daily snapshots are kept for 28 days, weekly snapshots for 52 weeks. In addition the data is encrypted and copied to a long-term storage in the computing center run by the Max Planck Computing and Data Facility (MPCDF) in Garching.

ARCHE storage procedures