IDIS Database Design and Implementation
This document presents how the Integrated Database Information System (IDIS)
database was designed and how it is being implemented. The first part
introduces the design process, its principles and best practices. The second part
presents how these guidelines were applied in designing and implementing the
? Database design principles
1. Introduction This section presents the terms, concepts, approach, standards and best practices associated with the design of a database. It is important to bear these principles in mind when reading the next section which presents the design of the IDIS database. The database design phase is a very important step for all IT projects developing systems that rely on a database to adequately store, query, import & export data and support reporting. For such systems the operation of the database is critical hence its design and implementation must be long lasting, flawless and perfectly tailored to meet the requirements of the system.
2. Definition of terms and concepts used
A database is a collection of data that is organized in a systematic way so
that its contents can easily be accessed, managed and updated. The most
prevalent type of database is the relational database, a tabular database in
which data is defined so that it can be reorganized and accessed in a number
of different ways. A distributed database is one that can be dispersed or
replicated among different points in a network. The software used to manage
and query a database is known as a database management system (DBMS).
2. Database Management System
A Database Management System is a software environment that structures
and manipulates data, and ensures data security, recovery, and integrity. The
Data Platform relies on a database management system (RDBMS) to store
and maintain all of its data as well as execute all the associated queries.
There are two types of RDBMS : the first group consists of single software
packages which support only a single database, with a single user access
and are not scalable (i.e. cannot handle large amounts of data). Typical
examples of this first group are MS Access and FileMaker.
The second group is formed by DBMS composed of one or more programs
and their associated services which support one or many databases for one
or many users in a scalable fashion. For example an enterprise database
server can support the HR database, the accounting database and the stocks database all at the same time. Typical examples of this second group include MySQL, MS SQL Server, Oracle and DB2. The DBMS selected for the Data
Platform is MS SQL Server from the second group.
3. Table A table is set of data elements that has a horizontal dimension (rows) and a vertical dimension (columns) in a relational database system. A table has a specified number of columns but can have any number of rows. Rows stored
in a table are structurally equivalent to records from flat files. Columns are often referred as attributes or fields. In a database managed by a DBMS the format of each attribute is a fixed datatype. For example the attribute date can only contain information in the date time format.
An identifier is an attribute that is used either as a primary key or as a foreign key. The integer datatype is used for identifiers. In cases where the number of records exceed the allowed values by the integer datatype then a biginteger datatype is used.
2. Primary key
A column in a table whose values uniquely identify the rows in the table. A primary key value cannot be NULL.
3. Foreign key
A column in a table that does not uniquely identify rows in that table, but is used as a link to matching columns in other tables.
4. Relationship A relationship is an association between two tables. For example the
relationship between the table "hotel" and "customer" maps the customers to the hotels they have used.
5. Index An index is a data structure which enables a query to run at a sublinear-time. Instead of having to go through all records one by one to identify those which match its criteria the query uses the index to filter out those which don't and focus on those who do.
6. View A view is a virtual or logical table composed of the result set of a pre-compiled query. Unlike ordinary tables in a relational database, a view is not part of the physical schema: it is a dynamic, virtual table computed or collated from data in the database. Changing the data in a view alters the data stored in the database
7. Query A query is a request to retrieve data from a database with the SQL SELECT instruction or to manipulate data stored in tables.
Structured Query Language (SQL), pronounced "sequel", is a language that provides an interface to relational database systems. It was developed by IBM in the 1970s for use in System R. SQL is a de facto standard, as well as an ISO and ANSI standard.
1. Introduction The most difficult part of the database design activity is to propose a design approach that efficiently accommodates storage and querying of data without generating high costs or compromising database integrity. Two database design approaches are used to for this purpose. The first approach is called "Entity-Relationship" and the second approach is called "dimensional model". Both these approaches as well as their respective advantages and
shortcomings are presented in the next sections. A separate section introduces the metamodeling approach and the last section discusses all these approaches.
2. ER The Entity-Relationship (ER) approach is based on a the ER model originally proposed by Peter in 1976 as a way to unify the network and relational database views. Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represents data objects. Since Chen wrote his paper the model has been extended and today it is commonly used for database design For the database specialists, the utility of the ER model is:
? It maps well to the relational model. The constructs used in the ER model
can easily be transformed into relational tables. ? It is simple and easy to understand with a minimum of training. Therefore,
the model can be used by the database team to communicate the design
to the end user.
? In addition, the model can be used as a design plan by the database
developer to implement a data model in a specific database management
ER is a logical design technique that seeks to remove the redundancy in data. The goal of this discipline is to illuminate the microscopic relationships among data elements. The highest art form of ER modeling is to remove all redundancy in the data. This is immensely beneficial to the processing of
transactions because transactions are made very simple and deterministic i.e.
they can be determined easily. Transaction oriented databases are typically
associated with business model dealing with banking, accounting, stock
management, Customer Relationship Management (CRM), etc. For example,
the transaction of updating a customer's address may devolve to a single
record lookup in a customer address master table. This lookup is controlled
by a customer address key, which defines uniqueness of the customer
address record and allows an indexed lookup that is extremely fast. It is safe to say that the success of transaction processing in relational databases is mostly due to the discipline of ER modeling.
Due to the very high and minute decomposition of data into separate tables
the ER approach generates database models which
? End Users cannot understand, remember and navigate.
? cannot be queried by database query software. Software that attempt to
do this are notorious for making the wrong choices, with disastrous
consequences for performance.
? are quite complex to query by humans as usually a large amount of often
complex tables need to be joined to retrieve only a few fields. ? defeats the high-performance retrieval of data.
1. Dimensional Model The dimensional model (DM) approach is a logical design technique that
seeks to present the data in a standard, intuitive framework that allows for high-performance access. It is inherently dimensional, and it adheres to a
discipline that uses the relational database model with some important
restrictions. The DM predates the ER modeling approach as it was already
used in the late 60s. It is probably accurate to say that the DM approach was not invented by any single person. It is an irresistible force in the design of databases that will always appear when the design efforts put
understandability and performance as the highest goals. Due to the
shortcomings of the ER approach a large share of the database specialists
have switched from ER to the more "simpler" designs whose designs all look
similar. In a natural, almost unconscious way, they returned to the roots of the original relational model because they knew a database cannot be effectively queried by end users unless it is packaged simply. The DM approach has a
number of important advantages that the ER approach lacks.
? First the models resulting from the DM approach are predictable e.g. it is
simple to predict how they evolve.
? User interfaces, query tools and report generators can all make strong
assumptions about the model to make the processing more efficient. ? Rather than using a cost based optimizer, a database engine can make
very strong assumptions about constrains hence highly improving
In the DM approach every dimensional model is composed of one table with a multipart key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table. This characteristic "star-like" structure is often called a star join. The term star join dates back to the earliest days of relational databases. A fact table, because it has a multipart primary key made up of two or more foreign keys, always expresses a many-to-many relationship. The most useful fact tables also contain one or more numerical measures, or "facts," that occur for the combination of keys that define each record. The most useful facts in a fact table are numeric and additive. Additivity is crucial because data warehouse applications almost never retrieve a single fact table record; rather, they fetch back hundreds, thousands, or even millions of these records at a time, and the only useful thing to do with so many records is to add them up. Dimension tables, by contrast, most often contain descriptive textual information. Dimension attributes are used as the source of most of the interesting constraints in data warehouse queries, and they are virtually always the source of the row headers in the SQL answer set.
1. Metamodeling Metamodeling is a variant of the ER modeling approach that focuses on elegantly solving complex modeling situations that do not seem to comply with the ER modeling approach. Instead of focusing on real world entities a metamodel aims at modeling the language which supports the situation. A typical example of a situation where metamodeling is used is when a database model is asked to establish the translation between multiple standards that have similarities but present significant differences in vocabulary, sections, definitions, etc. According to the ER modeling approach
each standard first needs to be modeled so that it can be operated individually. Once each standard is modeled then the ER approach suggests that mappings between all models are established. When such exercise is carried for more than 4 or 5 models the task becomes extremely complex with hundreds of mappings to be established. Very rapidly it becomes obvious that there is a fundamental flaw in the modeling approach that can only be balanced with huge efforts to maintain the model. Such a situation is an excellent indicator of the need for a metamodeling approach. The above modeling problem can easily and elegantly be solved by a metamodel that focuses not on operating each individual standard and their mappings but instead which focuses on the vocabulary mappings between each model. 2. Discussion For experienced database designers, the design of a new database always starts by considering which of the three approaches, e.g. ER, dimensional
and metamodeling, best applies to meeting its requirements. If the database is targeted at supporting a transaction system then the ER approach is used as it is the best fit approach. Transaction systems are used by banks, retail,
real estate, etc to manage their businesses along their business models. If
the database is targeted at establishing complex linkages between large sets of concepts. Typical examples of metamodeling applications include standards translation, currency translation and language translation. If the database is targeted at supporting end-user queries then the Dimensional Model approach is used. ER modeling defeats end-user delivery and should not be used for this purpose.
It is important to understand that ER based transaction databases usually serve very well the purpose of a specific application or domain but usually fall short of supporting multiple application or domains at the same time. For example an organization uses a database to support its hydrology data and another one to support its environmental data. In this example both databases operate well individually but cannot be directly linked or mapped together as their common founding concepts are implemented in different ways. For example, both database contain a table corresponding to the wetland feature but the definition of this feature varies and its attribute data is different. In this case, both databases can hardly be merged into a single database that would serve both purposes. This example also applies to the more conventional accounting and HR business domains.
2. Normalization 1. Introduction When designing a database the first and most important principle is to efficiently organize data in a database according to a set of principles called the normal forms. This process called ‘normalization’ ensures that the database is consistent (reduces chances of errors and conflicts) and complies with best practices which aim at eliminating data redundancy (for example, storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in a table). The resulting ‘normalized’ database is hence logically stored and uses the minimum amount of storage space. The normal form design principles are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In practical applications only the first 3 normal forms are implemented (1NF, 2NF and 3NF) and occasional the fourth normal form (4NF) is used. Fifth normal form is very rarely used. It's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have and account for possible inconsistencies.
2. The first normal form
The first normal form (1NF) sets the very basic rules for an organized database:
? Eliminate duplicative columns from the same table.
? Create separate tables for each group of related data and identify
each row with a unique column or set of columns (the primary key).
An example can be found here.
1. The second normal form The second normal form (2NF) further addresses the concept of removing
• Meet all the requirements of the first normal form.
• Remove subsets of data that apply to multiple rows of a table and place
them in separate tables.
• Create relationships between these new tables and their predecessors
through the use of foreign keys.
• An example can be found here.
2. The third normal form The third normal form (3NF) goes one large step further:
• Meet all the requirements of the second normal form.
• Remove columns that are not dependent upon the primary key.
• An example can be found here.
3. The fourth normal form The fourth normal form (4NF) has one additional requirement:
• Meet all the requirements of the third normal form.
• A relation is in 4NF if it has no multi-valued dependencies.
• An example can be found here.
4. The fifth normal form The fifth normal form (5NF) addresses situations where a relation cannot be
decomposed into two separate relations :
• Meet all the requirements of the third normal form.
• A relation is in 5NF if it provides a lossless decomposition of join-
dependencies that have no multi-valued dependencies.
• An example can be found here.
2. Database Model 1. Introduction The preparation of the database model represents the main activity of the database design activity. The database model not only acts as the repository of the design but also as a very important medium to communicate with stakeholders and all technical specialists involved with the implementation of the database as well as its integration within the target application. The design of a database is usually carried out by database architects and/or database administrators. A Computer Assisted Software Engineering (CASE) tool is used in almost all cases to support the design process by automating all repetitive tasks and store all documentation in a dictionary. Once system requirements are analyzed and understood the database design work goes into a 3 phase cycle which is described in the following sections. In theory the database design will go through a single cycle of those phases. In practice it is very common to iterate through a couple of cycles that will ensure the outputs of all 3 phases are all in line and adequately linked. The following sections present each of these phases.
2. Conceptual Model
The first phase produces a conceptual data model which presents a conceptual plan of how the tables, attributes and relationships will be implemented. This model is very often used as a medium to communicate with stakeholders and illustrate how data will be stored. This conceptual model focuses on establishing the tables that will store the required information (called ‘attribute’) and the relationships between the tables. The
conceptual model is always prepared according to a standard formalism which clearly indicates how tables and their relations have to be presented and documented. The Unified Modeling Language (UML) class diagram is the de facto formalism being used by IT database professionals to deliver the
conceptual model of the target database. In the past two other formalisms have been used. The first one is the Entity-Relationship (E/R) formalism which was very popular in North America. The second one is MERISE which was very popular in France. Once the conceptual data model tables and relationships have been defined the documentation of each one of them is usually carried by filling the required standardized documentation in the conceptual model dictionary. This dictionary is usually completed directly through the CASE tool directly in the background of the conceptual model. Various standards and best practices are used to document the conceptual model dictionary. Several ISO standards can be used to document the conceptual model into its dictionary depending on the database context. In many cases only mandatory fields of the dictionary are being filled with information as they contain information that will be used by the CASE tool to transform the conceptual data model into a logical model. Although the conceptual data model does not explicitly require to be normalized it is a best practice to ensure it does comply with at least the first (1NF) and second
(2NF) normal forms. In practice, experienced IT professionals often already address in the conceptual model important issues which they know will arise in the following logical and physical data models to ensure the smallest number of iterations between the phases. It is important to note that the conceptual data model is a graphical representation of the information requirements of a system, it is not a database.
3. Logical Model
The second phase produces a logical database model that translates the conceptual model into a fully descriptive logical data model. As a result whereas some join tables did not appear in the conceptual model the logical model will include all tables. This model is very often used as a medium to communicate with technical IT staff involved into the programming and implementation of the database driven target system. The logical data model represents the realistic and normalized design of the database that will be implemented regardless of the technology that will be selected to implement it. The logical data model must be normalized and comply with the first 3 normal forms (1NF, 2NF and 3 NF). In some rare cases it must also comply with the fourth and fifth normal forms (4NF & 5NF). CASE tools can usually handle and automate most of the simple work required by the conversion of a conceptual data model into a logical data model. Following normal forms and best practices they usually offer solutions to most of the common issues associated with this conversion. Once these solutions are validated and accepted the more complex cases need to be addressed independently to ensure that the logical model complies with normal forms, is not technology dependant and is in line with the selected approach for the database design. It is important to note that the conceptual data model is a graphical representation of the information requirements of a system, it is not a database.
4. Physical model
The third and last phase produces the physical data model which is in fact the database design that will be implemented. In practice this materializes as an implemented schema which is an empty set of tables linked be relationships that are ready to receive data. This design incorporates all required best practices to adequately implement the database logical model into an operational database schema that can be handled and operated by a database management system (DBMS). As such the physical data model is technology dependant. For example, if an attribute field was called ‘social security number’ in the conceptual and logical model it will be called ‘sosecnum’ in the physical data model as the DBMS does not support long names with space characters. The physical data model is not a graphical representation of the information requirements of a system, it is the empty database ready to receive data.
3. Database Implementation
2. Hardware tuning
4. Best practices
? Design of the IDIS database
1. Approach In accordance with best practices this project has selected the ‘dimensional model’ approach to meet those requirements. This approach is the de facto
solution implemented for databases with similar requirements for over 25 years across industry and academia. It follows a very strict and very clear generic logic that does not depend on the domain being modeled. In short, data is stored in a data table called the ‘fact table’ and the attributes associated with
this data are stored in attribute tables called ‘dimension tables’. A ‘dimension table’ can be linked to more than one fact table. This approach can be equally implemented in a single database or in many databases depending on system specific requirements.
In the data platform instead of storing all data from all basins into a single database this approach is implemented as following: a database is created for each basin and the basin’s data is stored into this database according to the
dimensional model approach. Figure 1 presents Basin tables and Index tables.
Two important motives guide this implementation approach. First, this approach allows for basin specific databases sizes to remain manageable for the backup procedures. The larger the database the more complex the backup procedure. Second, having basin specific databases greatly facilitates the replication procedures which will be required when the data platform moves into a decentralized mode where local mirrors are maintained at the basin locations. If all data tables for all basins are all cluttered into a single database the replication procedures are far more complex to establish and operate.