Persistent Data Storage for STAGE

An analysis of design considerations for persistent data storage in STAGE


Introduction

Persistent data storage for a multiplayer online game presents a number of interesting design challenges. WorldForge, in particular, presents further challenges, as designing towards any one specific database technology might limit the scalability or free nature of the project. In this paper, the author will discuss the need for storage pragmatism, ways in which to minimize the impact of this pragmatism upon server programmers, the project's needs for its storage model, and varying philosophies in storage architecture. The reader is expected to have at least a novice understanding of object-oriented methodology and relational database models. The reader may also wish to read the Entity Heirarchy documentation for the Shepherd module of STAGE.


Storage Pragmatism

In the interests of preserving the free, open-source nature of WorldForge, there is often an inclination towards embracing external technologies with similar licenses, when it is deemed necessary to use "off the shelf" (OTS) extensions of the project's systems. This, however, is not always a viable approach, as not all application-types have been successfully tackled by the free software community to the degree of quality that might be needed for a project of this nature.

The solution to this problem is to leave the choice of OTS extension up to the game administrator in a seamless and convenient manner, so that he can choose whichever storage and licensing model is most appropriate for his particular game.

While it is certainly possible for the WorldForge project to write its own data storage system, there are very good reasons not to. It would draw developers away from other parts of the project that need them, when there are already fully-featured, extensively tested OTS solutions, which are in popular use today. There are a few free databases (e.g. mSQL, MySQL, postgres, DBM), but most of these have reduced feature-sets, or present speed and scalability problems which may become insurmountable in a sufficiently large world. Some of the larger, commercial databases (e.g. Personal Oracle, Sybase) have recently become free for non-commercial use under Linux, and there would certainly be those who prefer to use these for their relational database needs.

In short, there is a strong argument to be made for allowing administrators to choose their preferred data storage application. This leads us to some very important questions. Firstly, how do we make this flexibility invisible to other STAGE programmers?


Hiding the Implementation

The choice of database and representation model should be entirely transparent to those working on other STAGE modules. The choice of implementation should be administrator-configurable, either as a compile-time static link, or (even better) a runtime dynamic link, based on a configuration setting. The interface to the rest of STAGE should be consistant across each different implementation library. The designers should follow standard object oriented data-encapsulation techniques.


Project Needs

Whatever technology STAGE uses for its persistent datastore needs to be scalable. It needs to be portable. If it's slow, the latency needs to at least be concealable (See "Hiding Database Latency" for more on this). It needs to be reliable.


Database Technologies

Filesystem Datastore

Entities can be stored in files, individually, or in batches, and organized using directories. LP-MUDs use a file-based datastore. The original notes for STAGE's persistent datastore recommended using a file-based storage system, at least in the early stages.

AdvantagesDisadvantages
  • Fast
  • Highly adaptable to future changes
  • Scalable
  • Requires a product everyone has: An operating system
  • Requires writing custom insert, update, and query functionality
  • Potentially cumbersome and huge
  • Not optimized for use as database
  • Difficult to move to new system
  • Difficult and slow to query on anything other than the name/index
  • Indexed Data

    There are numerous ways to store data stored in a hash-table-like indexed data format. Examples include DBM, Berkely DB (which is fancy DBM, mostly), and LDAP. TinyMUSHes store their objects in binary format in a DBM database, indexed by object ID.

    AdvantagesDisadvantages
  • Reliable
  • Has lengthy history of use
  • Put/Get functionality exists
  • Outstanding free solutions available
  • Difficult and slow to query on anything other than the name/index
  • Limited data representation architecture options
  • Generally not used for huge databases
  • SQL-Relational Database

    SQL-Relational databases are used for a large variety of data storage needs. Examples of SQL-Relational databases include Oracle, Sybase, mSQL, and MySQL. As of TinyMUSH 3.0, a TinyMUSH administrator can eschew the traditional DBM database, and use a SQL-Relational Database for his MUSH, instead.

    AdvantagesDisadvantages
  • Reliable
  • Full insert, update, and query functionality
  • Extensive tools available
  • Excellent data representation architecture flexibility
  • Opensource solutions still need work
  • Setup and configuration can be daunting
  • Some product/system combinations require special partitions
  • Higher system requirements
  • Other

    There are many other data storage methodologies available to those who are willing to search. These include fairly exotic object-oriented databases, and other strange beasts. Many of these are merely of academic interest. However, PostgreSQL, which is an "Object-Relational" database, is one of the better free SQL databases available, and may be well worth some attention.

    Discussion

    While filesystem datastores and DBM were popular to use in MUDs in the days of yore, most of these servers were designed before Relational-SQL databases became available to people in an inexpensive way. Today, there are numerous low-end and high-end solutions available to the public, without robbing Fort Knox. There will certainly be room, in the future, for people to write alternative implementations of our database routines for non SQL-Relational solutions, but STAGE will get the most flexibility for the least work if it targets SQL-Relational databases from the getgo. This provides flexibility to experiment with different entity representation models, and enables us to target numerous potential database products (including PostgreSQL) with only one implementation, by making good use of ODBC.

    ODBC stands for "Open Database Connectivity." It is a basic standard for interacting with numerous database products with one consistant interface. There are a couple of different LGPL ODBC projects for Linux and other unix flavours. The most noteworthy of these are iODBC and unixODBC. We can save even more time by using the LGPLed ODBC++ library, which provides handy JDBC-like C++ wrappers for ODBC functionality. ODBC is not the fastest solution for interfacing with a database; there is some amount of overhead involved, but for many purposes, it is a reasonable tradeoff. If database efficiency ends up being a sticking point, later on, it would make sense to write native database interfaces for the more popular products. For the interim, however, ODBC should suffice.


    Entity Representation

    Indexed XML

    One model for storing our data is to represent all entities as XML blobs, indexed by some sort of identifier (name or number). This can later be refined to storing blobs of binary data, instead, reducing the space taken up by the data, dramatically.

    AdvantagesDisadvantages
  • Extensible on the fly
  • Reflects XML usage elsewhere in WF
  • Can be implemented in a wide variety of databases, including DBM-style
  • Web-based tools for entity maintenance would require XML library.
  • Not easily searchable for things other than index
  • Not efficient for anything other than gets and puts
  • Involves XML encoding overhead
  • Extremely difficult to manually modify parameter values in database during debugging
  • Relational Model

    Alternatively, we can store our data in a Relational database, in a format which makes sense, based on the entity hierarchy already used in STAGE. In fact, that architecture can be mirrored perfectly, if deemed necessary.

    AdvantagesDisadvantages
  • Minimal overhead
  • Can map directly to entity hierarchy architecture
  • Easy searching and reporting
  • Easy to manually modify parameter values in database during debugging
  • Easy to develop web-based tools for entity maintenance.
  • Requires a SQL-Relational database
  • Hard to modify in response to server patches and re-architectures
  • Discussion

    The indexed-blob method of storage provides the greatest flexibility in database choice and entity structure. However, the relational model would provide the least space and time overhead. Some of the structural flexibility that is lost in choosing the relational model is regained, if we mirror the architecture we're using for entities in memory, as it stores extended parameters in parameter tables, rather than representing them as table columns.

    Searchability may not be of a high importance to an administrator if he oes not wish to do much analysis of the data in the world; however, searchability is very important if entities need to be dynamically loaded, according to criteria which are stored in their properties. This drawback can be reduced by storing lists of identifiers for objects which should be loaded together.

    Searchability can be improved in the blob approach by supplementing it with a parameter table, like the one used by the relational model (and by Shepherd). However, this takes up extra space, and requires careful synchronization of data between the blobs and the parameters. This would also beg the question of what is gained by storing that data in the blob, to begin with. Certainly, once the server knows what entity it wishes to load, it can find those values to load into memory faster, without having to look them up in the parameter table. However, some of that time gained will be lost parsing the XML. Even more of that time may have been lost, trying to find out which entities needed to be loaded.

    Ultimately, neither of these approaches should be ruled out, for future development. However, the relational model is not only efficient, but exposes the data in such a way that it is easy to make manual changes to entities during debugging. Editing XML or binary blobs in a database without tools developed for that purpose, can result in corrupt, unparsable data, For these reasons, the relational model is a good solution for the near future.


    Hiding Database Latency

    All databases are bottlenecks. The delays which would be created by updating the database every time any parameter on any entity in the world changed would be absolutley crippling to the server. Because of this, updates must happen in batches. Entities which have changed should be marked for update. Updates should happen at regular intervals, which should probably be configurable by the administrator.

    The time it would take to do a synchronous update for a batch of entities could become prohibitively long. The server certainly can't afford to take time off from its other duties to wait for a database update to take place. One solution to this problem is to make all database updates entirely asynchronous. The database save functionality could run on a separate thread, or in another process which shares memory with Shepherd. Off-loading this operation minimizes the impact of a slow database or database connection on the regular functioning of the server.

    Because the latency is hidden, it is possible to hide further functionality in the save code, if one so chooses. This could include data threshholding to flag cheating or even possibly gathering usage statistics.

    Collision Prevention

    Asynchronous database saves present a potential problem with read-write resource collisions. In short, it is possible for an item which is marked for update to change after it is read from the memory by the storage module, but before the update mark is cleared, resulting in updated information that never makes it into the database. One method for avoiding this problem would be to perform some sort of resource locking on whatever is currently being saved by the database. This, however, does not only create a non-trivial amount of overhead for any entity change but it is a little like playing latency Russian Roulette, and could create intermittent server lag. Creating an intermediate flat file at regular intervals would not eliminate the need for locking, but might reduce the slowdown which takes place when the server wishes to update a locked piece of data.

    This entire collision issue is avoided if one does away with the update-marking idea, and dumps all currently loaded entities with every save. Whether this is a desirable tradeoff is a subject open for debate. This may be something that will not resolve itself until some practical performance testing has taken place. In the meantime, per the documented STAGE requirements, the designer should choose the best-understood approach to pursue in the near-future.


    Loading Data

    One still needs to address the problem of loading data from the database. In an ideal world, STAGE would load all entities associated with the game world at startup time. However, the larger a world gets (or the smaller the memory on the server platform is), the less feasible this becomes. In short, it is not a highly scalable solution.

    This leads to a discussion of dynamic entity loading. When should an entity be loaded? Which entities should be loaded? How many entities should be loaded? When should entities be un-loaded? Many of the answers are best determined in practice, and will often vary from one game to the next, based on platform specifications. Thus, it should be a design goal to make the loading and un-loading rules and threshholds highly configurable.

    A simple model for loading is to break the world into buckets (which may correspond with top-level containers, depending on our final world representation decisions). If someone enters a bucket, we want to make sure that the entities in that bucket are all loaded. However, it is less-than-optimal for the player to have to wait for the entities to load. The system should define threshholds, near the borders of the buckets, which trigger advanced-loading (and could also be used for un-loading, if the bucket no longer has players in it).

    This model for predictive-loading is not perfect. It is possible for a player to turn around, immediately after crossing a loading-threshhold, making the entire load unneccessary. Also, the system will need to delay un-loads for a few minutes, so a player doesn't overburden the system by running back and forth over a threshhold line, repeatedly forcing loads and un-loads to occur. Further, the designers will need to address what to do in the event that an entity load fails to happen in time. Ultimately, loading decisions are probably best postponed until we understand more about our geography and map format.

    Note that the server should never, ever load over data which is in memory, unless it is corrupt. The data in memory is more recent, and thus, is authoritative, and should not be overwritten.


    Summary

    It is important for STAGE to support numerous potential database solutions, to best serve the needs of both the developers and future users. This will require a design which includes a genericized interface for database operations. For now, many SQL-Relational database formats can be supported by using an ODBC interface. Modelling our data storage architecture after the tables which have been defined for Shepherd is an efficient solution, which serves our needs well. We can reduce the effects of database latency on the server by running the routines that backup the data in memory asynchronously. Until certain problems are addressed, we cannot save only entities (or parameters) which have changed, but must save all. Likewise, we cannot load and unload entities or groups of entities on the fly until there is a clearer understanding of how geography is represented.