ARCHE’s underlying system, initially based on the open-source repository software Fedora Commons version 4, has been completely reworked in 2020. It is now based on a bespoke software stack. All data, stable identifiers (PIDs), and functionality both of the user interface and the APIs exposing data to external applications of the original application were preserved.
Resources published via the repository get assigned a Handle-based persistent identifier issued by the PID-service run by the GWDG. By using this PIDs, we ensure that the resources remain referenceable and citable even if their actual location or the underlying repository system should change in the future.
The system runs in a Docker environment on one of the virtual machines hosted on dedicated servers maintained by the Computing Centre of the Academy (ARZ). The data is secured against any case of emergency or data loss with a multi-layered backup strategy, cf. Storage Procedures.
The technological development of the repository infrastructure is a continuous process, which is driven by the qualified staff of ACDH-CH and supported by the Academy’s computing centre. Furthermore the team is in embedded in a broad network of data centres via the research infrastructures CLARIN and DARIAH as well as the working group Datenzentren of the DHd alliance.
The system is built in a modular, service-oriented manner, consisting of multiple interconnected components communicating through well-defined APIs. The full code is available on GitHub. Detailed technical documentation can be found under https://acdh-oeaw.github.io/arche-docs.
The main software stack implemented in PHP consists of the following components:
- arche-core provides the REST API for CRUD operations and transactions support. Writing to the repository is only possible through this component.
Its REST API is documented in https://app.swaggerhub.com/apis/zozlak/arche.
- arche-doorkeeper implements ACDH-CH-specific business logic. It integrates with arche-core using arche-core’s handle system.
- arche-resolver is the service for handling the URI namespace in use (in our case https://id.acdh.oeaw.ac.at). It resolves URIs against identifiers in the repository and provides redirection to proper dissemination methods.
- arche-oaipmh provides an OAI-PMH endpoint for the repository.
- arche-gui is ARCHE's graphical user interface for browsing its content as well as the API endpoint for other tools like the metadata editor. This part of the system is based on Drupal making use of some of its features, like multilinguality or static pages.
A reference deployment of the components is provided by:
- arche-docker, a docker image providing the runtime environment for the ARCHE software stack.
- arche-docker-config exemplary configuration settings.
To ease development of ARCHE components and client software a set of client libraries is provided. Documentation of the libraries is available at https://acdh-oeaw.github.io/arche-docs.
- arche-lib provides a convenient and uniform PHP API for repository search and CRUD operations. The API can communicate with the repository either by using the REST API provided by arche-core (with support for both read and write operations) or by using direct database access, which is limited to read-only operations. The first communication mode is aimed at external clients. The direct database access is used by the internal repository components arche-doorkeeper, arche-resolver, arche-oaipmh, arche-gui.
- arche-lib-schema provides object mappings for the ACDH-CH ontology. It is used by arche-doorkeeper and arche-gui.
- arche-lib-disserv provides a PHP API for handling of dissemination services, i.e. matching a repository resource with proper dissemination services and creating a redirection URL. It is used by arche-resolver and arche-gui.
- arche-lib-ingest provides a high-level PHP API for ingesting data into ARCHE. To this end RDF graphs with metadata are parsed and ingested as well as files from a given directory are indexed and ingested. It is used by curators to ingest data into ARCHE.
The repository hosts a large variety of data types. While it provides a uniform default view on all the collections and digital objects with metadata and description, it also integrates smoothly with a growing set of specialised web applications designed to process and visualise specific data types or formats. These dissemination services run independently of the repository proper and are dynamically registered to be applied on certain types of data in the repository, so that for example a TEI (or any other XML) document can be rendered and viewed as HTML, geographical data plotted on a map or graph-based data visualized as an interactive network.
The binding/matching of resources to certain dissemination services is dynamically configured based on certain characteristics of the resources. The binding is primarily governed by the format of the resources, however the binding mechanism allows flexible matching based on any metadata properties.
Data storage and backup procedures are essential parts of our data management system. To avoid data loss due to deterioration of physical storage, malicious threats or other emergencies, redundancy is key for the preservation of data.
The primary server storage is a RAID-6 configuration allowing to sustain read and write operations in the presence of up to two concurrent disk failures.
Our backup policies follow a multi-layered setup: The live data stored on the repository production server is copied up every night to the ARZ NetApp production storage, of which numerous snapshots are stored on ARZ NetApp backup system in a separate location.
Daily snapshots are kept for 28 days, weekly snapshots for 52 weeks. In addition the data is encrypted and copied to a long-term storage in the computing center run by the Max Planck Computing and Data Facility (MPCDF) in Garching.