logo

lunes, 7 de agosto de 2017

Data and Data Models



Data and Data Models

2.1 INTRODUCTION

In this chapter, we introduce more concepts that are essential to our presentation
of the design of a database. We defined a database as a collection
of related facts (related data). In this chapter, we explore database models
and introduce the relational database model. While there are many kinds
of database models, most of the databases in use today are relational; our
focus in this book is to design a good relational database. In the next chapter,
we introduce the concept of functional dependencies to define what is a
good (and a not-so-good) relational database. The purpose of this chapter is
mostly historical; it engenders an appreciation for the simplicity and power
of the relational model.

2.2 FILES, RECORDS, AND DATA ITEMS

Data has to be stored in some fashion in a file to be useful. Suppose there
were no computers—think back to a time when files were paper documents.
A business kept track of its customers and products. A doctor’s
office kept track of patients. A sports team kept statistics on its players. In
all of these cases, data was recorded on paper. The files with data in them
could be referred to as a “database.” A database is most simply a repository
of data about some specific entity. A customer file might be as plain and
minimal as a list of people that did business with a merchant. There are
two aspects to filing: storage and retrieval. Some method of storing data
that will facilitate retrieval is most desirable.
Suppose we have a file of customer records. The whole file might be
called the customer file, whereas the individual customer’s information

12 • Database Design Using Entity-Relationship Diagrams

is kept in a customer record. Files consist of records. More than likely,
more information than a list of just customer’s names would be stored.
Most likely, at the very least a customer’s name, address, and phone number
would constitute a customer record. Each of these components of the
record is called a data item or field. The customer file contains customer
records that consist of fields of data.
Here is an example of some data (you can imagine each line as a 3 × 5
card, with the three cards making up a file):

CUSTOMER File

Record 1: Jones, J 123 4th St Mobile, AL
Record 2: Smith, S 452 Main St Pensacola, FL
Record 3: Harris, H 92 Adams Lane Elberta, AL

This file contains three records. There is one record for each customer.
The records each consist of four fields: record number, name, address,
city. As more customers are added, they also are assigned a 3 × 5 card,
and these records are placed in the file. Several interesting questions and
observations arise:

1. The merchant who keeps this file may well want to add information
in the future. Suppose that the merchant wanted to add a telephone
number. Would the merchant add a number to all 3 × 5 cards, or
would the adding be done “as necessary”? If it were done as necessary,
then some customers would have telephone numbers, and
some would not. If a customer had no phone number, then the phone
number for that customer would be “null” (unknown).

2. How will the file be organized? Imagine not three customers, but 300 or
3,000. Would the 3 × 5 cards be put in alphabetical order? Perhaps, but
what happens if you get another J. Jones or S. Smith? The field on which
the file is organized is called a key. Perhaps the file should be organized
by telephone number, which might not have a duplicate value?

3. Suppose the file were organized by telephone number. What if the
telephone number for some customer was not recorded? If there were
no telephone number, the common terminology is that we say the
telephone field for that record is null. It would make no sense to have
the file organized by telephone number if some values were null.
Clearly, the key of a file cannot be null. Also, if telephone number
were the key, then the person finding a record in the file would have
to know the phone number to find the appropriate record efficiently.

Data and Data Models • 13

4. The format of the file given is

CUSTOMER(record_number, name, address, city)
The format of the file dictates the order of the fields in any record. In this
record, record_number is first, followed by a name, and so on. The file
design could have the fields in some other order, but once defined, the
order of the fields stays constant.
If a telephone number field were added, then the file format could be
CUSTOMER (record_number, name, address, city, telephone)
This shorthand format notation is referred to as the file design. If the file
were set up to find data by name and name were the key, then the name
would be underlined, as follows:
CUSTOMER (record_number, name, address, city, telephone)
5. You might ask, “Why not use the record number to organize the
file?” On one hand, it is unique (which is desirable for a key), but on
the other hand, you would have to know the record number to find
a customer. The example is organized by record number; however,
imagine 300 or more customers. You want to find Smith’s address—
you would have to know the record number. It makes more sense to
organize this file of 3 × 5 cards by name. Taking some of these points
into consideration, here is an enhanced version of the customer file

in which each line represents a 3 × 5 card:

CUSTOMER File

Record 1: Adams, A 77 A St Pensacola
FL
555-5847
Record 2: Charles,
X
365
Broad St
Mobile AL 555-8214
Record 3: Harris,
H
92 Adams
Lane
Elberta,
AL
555-1234
Record 4: Jones, A 22 Pine
Forest
Pensacola
FL
null
Record 5: Jones, J 123 4th
St
Mobile, AL 555-9978
Record 6: Morris, M 932
Dracena
Way
Gulf
Breeze FL
555-1111
Record 7: Smith, S 452 Main
St
Pensacola,
FL
555-0003

14 • Database Design Using Entity-Relationship Diagrams

CHECKPOINT 2.1

1. What does it mean to say a field has unique values?
2. Why is it desirable to have a key be unique?
3. Why does a file have to be organized by a key field?
4. What does null mean?
5. Consider this customer file:

Record 1: 77 A St Adams, A Pensacola
FL
555-5847
Record 2: Charles,
X
365 Broad
St
555-8214 Mobile AL
Record 3: 555-1234 Harris, H 92 Adams
Lane
Elberta,
AL
What is wrong here?

2.3 MOVING FROM 3 × 5 CARDS TO COMPUTERS

Let us return to our example of a merchant who kept his customer file on
3 × 5 cards. As time passed, the customer base grew and the merchant
probably desired to keep more information about customers. From a
data-processing standpoint, we would say that the techniques for storing
and retrieval led to enhanced forms or cards, more fields, and perhaps better
ways to store and find individual records. Were customer records kept
in alphabetical order? Were they stored by telephone number or record
number (which might be a customer number)? What happened if a new
field was required that was not on existing forms or cards? Such was a
data-processing dilemma of the past.
When computers began to be used for businesses, data was stored on
magnetic media. The magnetic media were mostly disks and tapes. The
way data was stored and retrieved on a computer started out like the 3
× 5 cards, but the data was virtual; it did not physically exist where you
could touch it or see it without some kind of software to load and find
records and some kind of display device to see what the “3 × 5 card”
had on it. The most common way data was fed into a computer was via
punched cards. Punched card systems for handling data were in use as
early as the 1930s; sorters were capable of scanning and arranging a pile
of cards. Using punched cards to input data into computers did not blossom
until the 1960s. The output or “display device” was typically a line
printer.

Data and Data Models • 15

As data was placed on a computer, and software could be developed to
handle the data, filing techniques evolved. In the very early days of databases,
the files kept on computers basically replicated the 3 × 5 cards. There
were many problems with computers and databases in the “early days.”
(Generally, early days in terms of computers and databases means roughly
early to mid-1960s.) Some problems involved input (how the data got into
the computer), output (how the data was to be displayed), and file maintenance
(for example, how the data was to be stored and kept up to date; how
records were to be added and deleted; how fields were to be added, deleted,
or changed). A person using a computer for keeping track of data could
buy a computer and hire programmers, operators, and data entry personnel.
In the early days, computers were expensive and large. Most small
businesses did not have the resources to obtain a computer, much less hire
people whose jobs were solely “on the computer.” Early attempts at filing
data and retrieving them were the purview of large businesses and large
organizations. One thing was clear: Many more virtual customer records
than the 300 or so we envisioned were possible.
In the early days, imagine that some company had a computer, and the
departments within the company wanted to keep files on the computer.
Suppose further that the company made some product and had several
departments (e.g., sales, accounting, production). Now, suppose that
each department wanted to keep data about customers. Each department
had a different view of customers. The sales department wanted
to know the name, address, telephone number, and some data related to
the propensity to buy the product. The accounting department wanted
to know roughly the same information, but wanted to keep track of billing
and payments. Production also wanted some of the same information,
but wanted to know what the customer wanted in the product and
how many they should make. Each department wanted about the same
thing, but each approached the problem in a different way. What actually
happened in the early days was that each department shared the
expensive computer but hired its own programming staff to keep “their
database.” While the sense of sharing the expensive computer was there,
the sense of sharing data was not. The idea of a “software package” to
store and retrieve data was not there either. Programmers used computer
languages like COBOL, RPG, ALGOL, PL/1, and FORTRAN to
store and retrieve data. Each department created its own records and its
own storage and retrieval methods, kept its own programs, and had its
own data entry groups.
16 • Database Design Using Entity-Relationship Diagrams

The earliest databases were filing systems maintained by some computer
language. For example, a programmer wrote a COBOL program to gather
input data on punched cards and to store the data in a computer file. Then,
the programmer wrote another set of programs to retrieve the data and
display it in whatever way a user wanted to see it. Early computer filing
systems were simple sequential files. The data on punched cards was read
and stored. Reconsider the customer file we introduced previously:


CUSTOMER File

Record 1: Adams, A 77 A St Pensacola
FL
555-5847
Record 2: Charles,
X
365
Broad St
Mobile AL 555-8214
Record 3: Harris, H 92 Adams
Lane
Elberta, AL 555-1234
Record 4: Jones, A 22 Pine
Forest
Pensacola
FL
null
Record 5: Jones, J 123 4th
St
Mobile, AL 555-9978
Record 6: Morris, M 932
Dracena
Way
Gulf Breeze
FL
555-1111
Record 7: Smith, S 452 Main
St
Pensacola,
FL
555-0003
If you could look at the data on a disk, it might look like this:
Adams, A 77 A St Pensacola
FL
555-5847
Charles, X 365 Broad
St
Mobile AL 555-8214
Harris, H 92 Adams
Lane
Elberta, AL 555-1234
Jones, A 22 Pine
Forest
Pensacola
FL
null
Jones, J 123 4th St Mobile, AL 555-9978
Morris, M 932 Dracena
Way
Gulf Breeze
FL
555-1111
Smith, S 452 Main St Pensacola,
FL
555-0003
Data and Data Models • 17
The records as addressed by COBOL had a structure like this:
01 CUSTOMER
05 NAME CHARACTER(20)
05 ADDRESS CHARACTER(20)
05 CITY-STATE CHARACTER(25)
05 PHONE CHARACTER(7)
The file was referred to as a “sequential file.” If a person wanted to see a
listing of data by address rather than name, the file had to be sorted and
the data redisplayed. If data was added to the file, it had to be put in its
proper place according to the sequential key, which in this example is the
name field.
Two other principal filing systems evolved in the 1960s: indexed and the
direct access filing systems. Indexed systems were based on the maintenance
of indexes for records rather than a strict sequential ordering of the records
themselves. Look again at the customer example in a different light:
CUSTOMER File
Record 1: Adams, A 77 A St. Pensacola
FL
555-5847
Record 2: Charles, X 365 Broad
St
Mobile AL 555-8214
...
In this case, suppose that Record 1 were not a numerical record but
rather a disk address. Then, suppose there was an index to the records
themselves that consisted of names and disk addresses like this:
Name Disk address
Adams, A 1
Charles, X 2
Harris, H 3
and so forth.

Now, suppose you added a customer called Baker, B. In a sequential filing
system, you would have to put Baker between Adams and Charles,
but in an indexed system, you could put Baker anywhere on the disk and
simply change the index to find Baker:




18 • Database Design Using Entity-Relationship Diagrams


Name Disk address
Adams, A 1
Baker, B 8
Charles, X 2
Harris, H 3

Disks were arranged by their geometry. A disk usually consisted of
an arrangement of platters that looked like flat, brown phonograph
records with a spindle in the center. The disk rapidly rotated, and data
was arranged on the plate concentrically and written and read magnetically.
If all of this sounds like an old phonograph system, that is likely
where the idea of stacking disks originated. If you could look at the data
on a specific plate, it would be arranged in a circle, and there would be
many circles, getting larger as you moved out on the plate. Each circle
was called a track. Data on a disk was thought to be located in a relative
position on some track and imaginary overlaid tracks called cylinders.
Here is a simplified example: If the tracks were designated as 1, 2, 3,
… starting from the data circle nearest the spindle, and the cylinders
were designated 1, 2, 3, … starting from the top of the disk, then a data
address would be referenced as follows, for example: cylinder 4, track 3,
record 24. This would mean that the data was on the 4th plate from the
top, on the 3rd track from the spindle, and in the 24th position on that
track.

The indexed systems became quite elaborate. The one illustrated was
called a relative record indexed system. In the relative record system, one
needed to know only which cylinders a file occupied, and then the position
of the record could be calculated relative to the starting position of
the file. Originally, relative record systems were not popular, whereas there
was a common indexed system used primarily on IBM computers that was
based on disk geometry (cylinders, tracks, and records) called an indexedsequential
system.

Whatever the index system, one had to keep indexes to store and find
records. Imagine not 30 or 300 customers, but 300,000. Now, suppose
that someone wanted to find customers by their address rather than
their name. You would have to have another index, or you would have
to look at each record in the filing system until you found the address
you needed. One other less-common way to manage early data access and
retrieval was to base the index on an exact location of data on a disk: the
direct access filing system. The popular indexed-sequential system was

Data and Data Models • 19

probably the best amalgamation of the properties of all filing systems in
that it efficiently handled record additions, deletions, and updating as well
as providing reasonable data access speeds. Early filing systems involved
choices: elaborate indexes or elaborate programs to sort and find data (or
both). Regardless of the choice of index, there were always problems with
maintaining the index (or indexes); adding, updating, or deleting records;
finding and displaying records and reprogramming all of these items for
new applications or hardware.

In the late 1960s, software packages called database systems began to
emerge. Database systems were purchasable programs that stored and
retrieved data as well as performed maintenance (adding, deleting, and
modifying fields and records). With a database system, one did not have
to write COBOL programs to handle data directly but rather relied on the
database program to handle data. With these systems, each department
could share data and resources. Instead of each department having its own
programmers and perhaps its own computer, there could be one central
computer to store data, one programming staff, and one database software
package. Data could be shared, with each department having its own
view of the data. All this sounds great, but in reality it took several years
to break away from the “my data” mold. In addition, there were hybrid
systems that emerged that focused mainly on retrieval of data that delayed
the move to a totally relational environment because of the investment
companies had in software and programmers.

Why was sharing data a good thing? It not only used expensive resources
more efficiently but also reduced redundancy. Redundancy means storing
the same information in different places. If each department stored its
own version of data, its own view of customers, then the customer name,
address, telephone number, and so on were recorded by each department.
Suppose the customer moved. Then, each department changed its data
when it found out the customer moved, and at some point the customer’s
address could be stored inconsistently: The accounting department had
one address, the sales department had another. The root problem here is
the lack of data sharing, and sharing was a central goal of the early database
systems.

The early database software evolved into two main data models: the
hierarchical and the network models. Although the relational model for
a database was recognized as a desirable technique in the early 1970s, the
relational model was treated as a really good theoretical technique for
which computers were not fast enough to implement.


20 • Database Design Using Entity-Relationship Diagrams

The database models (hierarchical, network, and relational) were logical
models—ways of logically perceiving the arrangement of data in a file structure.
One perceived how the data was to be logically stored, and the database
physically implemented the logical model. As we shall see, there is a close relationship
between the logical and physical implementations of the hierarchical
and network models. Since there were no practical relational implementations
on other than what was then supercomputers at research centers, the world of
commercial databases in the 1970s involved choosing between the hierarchical
and network models. The next sections give a little insight into each of
these three main models and an introduction to the relational model.
CHECKPOINT 2.2

1. What is a sequential file?
2. What is COBOL?
3. Why is ordering important in a sequential filing system?
4. What is a database program?
5. In the early days, how was data put into a file?

2.4 DATABASE MODELS

We now take a look back at database models as they were before the relational
database was practical. The look back shows why the “old systems”
are considered obsolete and why the relational model is the de facto standard
in databases today. The old systems were classified as two main types:
hierarchical and network. These two models defined a database before the
1980s. Although they might be considered “old fashioned,” there are systems
still in use today that depend on these models.

2.4.1 The Hierarchical Model

In hierarchical database models, all data are arranged in a top-down
fashion in which some records have one or more “dependent” or “child”
records, and each child record is tied to one and only one “parent.” The
parent-child relationship is not meant to infer a human familial relationship.
The terms parent and child are historical and are meant to conjure
up a picture of one type of data as dependent on another. As is illustrated
here, the child will be sports played by a parent-person. Another terminology
for the parent-child relationship is owner and objects owned, but
parent-child is more common.

Data and Data Models • 21

In this section, we present some versions of the hierarchical model for
several reasons: (a) to illustrate how older models were constructed from
file systems; (b) to show why file-based databases became outdated when
relational databases became practical; and (c) to see the evolution of filebased
systems.

We begin with an example of a hierarchical file situation. Suppose you
have a database of people who play a sport at some location. A sample
record might consist of Brenda who plays tennis at city courts and who
plays golf at the municipal links. The person, Brenda, would be at the top
of the hierarchy, and the sport-location would be the second layer of the
hierarchy. Usually, the connection between the layers of the hierarchy is
referred to as a parent-child relationship.

Each parent person may be related to many child sport-locations, but
each sport-location (each child) is tied back to the one person (one parent)
it “is played by.” You may be thinking, “Suppose the same sport is played
by two people at the same location?” That is a good question, but it is not
a situation that can be handled by hierarchical file systems in a direct way.
If, for example, Richard also plays golf at the municipal links, then golf at
municipal links will be recorded twice as a child for each parent. If the
same sport-location occurs, then it is recorded redundantly.

The one person, many sports tie may be called a one-to-many relationship.
Some more data might look like the following, with a person’s name
followed by the sport and location:

The model:

Person: sport, location; sport, location; sport,
location ...
The data itself:
Brenda: tennis, city courts; golf, municipal links
Richard: golf, municipal links; snorkeling, Pensacola
bay; running, UWF track
Abbie: downhill skiing, Ski Beech
Now this data could be stored in a database in many ways. You could
imagine information on each person on a 3 × 5 card with the person,s
data. You could also store the data on separate cards in a sequential file
like this:
Brenda
tennis, city courts
golf, municipal links

22 • Database Design Using Entity-Relationship Diagrams
Richard

golf, municipal links
snorkeling, Pensacola bay
running, UWF track
Abbie
downhill skiing, Ski Beech

The first thing you notice is that there is a series of text records (or 3 ×
5 cards), but it is hard to distinguish between person and sports without
semantics (you can tell the difference between person and sports by looking
at the entry, but the computer cannot do this unless you give the line of text
a meaning: semantics). For the computer, it is helpful to identify whether a
record is a person (P) or a sport (S). Hence, the sequential file could be

P Brenda
S tennis, city courts
S golf, municipal links
P Richard
S golf, municipal links
S snorkeling, Pensacola bay
S running, UWF track
P Abbie
S downhill skiing, Ski Beech

In actuality, this sequential file would be useful only if a person wanted
to list all the people followed by the sport-locations for that person.
Furthermore, the relationship of person to sport-location is one of position
in the file. The arrangement was sometimes called a header-trailer format;
persons are header records, sports are trailer records. If one wanted to
know all the people who played golf, they would have to read the entire file
and pick out those individuals. A better organization of the data would
be to have two files, one for person, one file for sport-locations. For the
two-file model to make sense (i.e., to have the files “related”), there would
have to be pointers or references of some kind from one file to the other.
One way to implement a pointer scheme would be a pointer from the sport
(child) to the person (parent) like this:

PERSON(record_number, name)
With data:
1. Brenda
2. Richard
3. Abbie
Data and Data Models • 23
SPORTS(sport, location, reference to person)
With data:
tennis, city courts, 1
golf, municipal links,1
golf, municipal links, 2
snorkeling, Pensacola bay, 2
running, UWF track, 2
downhill skiing, Ski Beech, 3

A diagram of this relationship is shown in Figure 2.1.
The actual location of the records on a disk could be arbitrary. The sense
of the data is not lost if the locations are disk addresses and if you imagine
that there are thousands of persons and sports:

PERSON(record address, name)
With data:
A45C. Brenda
C333. Abbie
B259. Richard
SPORTS(sport, location, reference to person)
With data:
golf, municipal links, B259
running, UWF track, B259
downhill skiing, Ski Beech, C333
snorkeling, Pensacola bay, B259
tennis, city courts, A45C
golf, municipal links, A45C
This system has a parent-child link. Here, we assume that the primary
key of the person file is the record_number. The “reference to person” in
the sports file refers to the primary key of the person file and is called a
Person (Parent record)
Sport Sport Sport
(Child Records)




FIGURE 2.1

A hierarchy of persons and sports with parent pointers.
24 • Database Design Using Entity-Relationship Diagrams
foreign key (FK) because it is a primary key of another file. The FK references
a primary key, hence completing the essential relationship of one
file to the other. If there were no relationship, then you would have two
independent files with no connection—the system would make no sense.
While we have established the relationship of child to parent in the
discussion, the database has some drawbacks. To answer a question like,
“Who plays golf at municipal links?” you start looking at the Sports file,
look for “golf at municipal links,” and see what parent records there are. If
your question were, “What sports does Richard play?” you would first find
Richard in the Person file, then look through the Sports file to find links
back to Richard.
If you were actually implementing this model, you could enhance the
system a bit. An improvement to this model would be to reference each
sport from within each parent record (let us go back to simple numbers
for this):
PERSON(record address, name, (sport address))
With data:
1. Brenda, (101, 102)
2. Richard (103, 104, 105)
3. Abbie (106)
SPORTS(sport, location, reference to person)
With data:
101, tennis, city courts, 1
102, golf, municipal links, 1
103, golf, municipal links, 2
104, snorkeling, Pensacola bay, 2
105, running, UWF track, 2
106, downhill skiing, Ski Beech, 3
Figure 2.2 depicts the relationship between parent and child records in
both directions. When viewed from the parent, this link is called a multiple-
child pointer (MCP). When viewed from the child the link is called a
parent pointer.
In this model, the relationship between parent and child records is done
two ways. The “reference to person” in the Sports file is a link (the FK) from
child to parent. The reference to multiple children in the parent records is
called an MCP scheme. While it is true that the MCP is redundant to the
relationship, it does two practical things: (a) It allows questions to be asked
Data and Data Models • 25
of the system that are easier to answer than with just parent pointers; and
(b) it allows the file system to be audited with a computer program. For
example, if you want to ask the system, “How many sports does Richard
play?” you need only look at the person file and count MCP references.
We have illustrated two ways to construct a relationship between two
files: the FK parent pointer and the MCP. Both of the linking techniques
are viable relationship implementations. The second one, the MCP system,
was actually implemented more often than the FK system. This is perhaps
because from a hierarchical standpoint it is easier to see the hierarchy.
You could implement either (or both) relationships to make this a hierarchical
database. Please remember that this illustration is just skeletal. In
real life, there would likely be far more fields in the records for the person
and for the sport or location. Also in real databases, one expects thousands,
if not millions, of records. If you could implement the relationship
between person and sports only one way, which of these systems (MCP or
FK) would work best if there were 1 million person-sport combinations?
The answer to this would likely depend on how many sports each person
played and how variable the number of sports was. For example, if no person
played more than five sports, the MCP system would work quite well.
If, on the other hand, the number of sports played was highly variable
(one person plays 1 or 2 sports, and some play 100), then the FK system
might be better because the FK system only keeps a parent reference, and
the MCP system with this high variability of pointers in the parent might
become cumbersome.

2.4.1.1 The Hierarchical Model with a Linked List
Having used the MCP/FK system to construct a hierarchical set of data,
we now present a second hierarchical model. The MCP system has some
drawbacks regardless of which direction one approaches the relationship
Person (Parent record)
Sport Sport Sport
(Child Records)

FIGURE 2.2

A hierarchy of persons and sports with parent and child pointers.

26 • Database Design Using Entity-Relationship Diagrams

(as MCP, FK, or both). For example, suppose you implemented the system
with the child pointers in the parent record (MCP), and there were more
child records than were planned. Suppose the system was not person and
sports, but students at a school and absences. The student would be the
parent, and the absence records would be the child. Now, suppose you
designed this system with an MCP relationship so that a student could
have up to 20 absences. What happens when the student is absent for the
21st time? One of two things has to happen: (a) The MCP system would
have to be modified to include some kind of overflow; or (b) some other
hierarchical system would have to be used. We could, of course, implement
this as an FK system and ignore the MCP part entirely. Or, we could
implement a different system, and that is what we illustrate now.
The following file system uses a “linked list” or “chain” system to implement
the relationship between parent and child. This system is similar to
both of the schemes discussed. With the same data, the records would set
up like this:

Parent (link to 1st child)
and within the child records:
Child (link to next child)
Here is the data from above with this type of arrangement:

PERSON(record address, name, first sport)
With data:
1. Brenda (101)
2. Richard (103)
3. Abbie (106)
SPORTS(sport, location, link to next sport for that person)
With data:
101, tennis, city courts, 102
102, golf, municipal links, 999
103, golf, municipal links, 104
104, snorkeling, Pensacola bay, 105
105, running, UWF track, 999
106, downhill skiing, Ski Beech, 999
Here, 999 means “no next link.”

Figure 2.3 illustrates a linked list parent-to-child pointing scheme. In
this system, we have a link from parent to child that we did not have with
the FK system alone. Furthermore, the records in both the parent and the

Data and Data Models • 27

child are uniform. Both the parent and the child records contain only one
pointer. Also in this system, it would not matter whether a person played
1 sport or 100—the system works well if the number of child records is
unknown or highly variable. If you would argue that finding a specific
child record among 200,000 sport records might be time consuming, you
are 100% correct. If you argue that the system is fragile and that if one link
is lost the whole thing goes down, you are correct again. While this linked
list system may look somewhat fragile, it formed the basis for several of
the most successful commercial databases. As you might expect, enhancements
to the basic linked list included things like direct links back to the
parent, forward and backward links, links that skipped along the chain in
one direction and not in the other (coral rings). All these enhancements
were the fodder of databases in the 1970s.

2.4.1.2 Relationship Terminology

Having seen how to implement relationships in hierarchical databases,
we need to tighten some language about how relationships are formed.
Relationships in all database models have what are called structural
constraints. A structural constraint consists of two notions: cardinality
and optionality. Cardinality is a description of how many of one record
type relate to the other and vice versa. Suppose we implement a database
about the personnel of a company. We have an employee (parent) and the
employee’s dependents (child). In our company, if an employee can have
multiple dependents and the dependent can have only one employee parent,
we would say the cardinality of the relationship is one to many: One
employee relates to many dependents, abbreviated 1:M. If the company
were such that employees might have multiple dependents and a dependent
might be claimed by more than one employee, then the cardinality
Person
(Parent record)
Sport Sport Sport (Child Records)

FIGURE 2.3

A hierarchy of persons and sports with a parent-to-child linked list.

28 • Database Design Using Entity-Relationship Diagrams

would be many to many: Many employees relate to many dependents,
abbreviated M:N (as in M of one side relates to N on the other side, and M
and N are generally not equal).
Optionality refers to whether one record may or must have a corresponding
record in the other file. If the employee may or may not have dependents,
then the optionality of the employee-to-dependent relationship is
optional or partial. If the dependents must be “related to” employee(s) (and
every employee must have dependents), then the optionality of dependent
to employee is mandatory or full.
Further, relationships are always stated in both directions in a database
description. We would say
Employees may have zero or more dependents,
and
Dependents must be associated with one and only one employee.
Note the employee-to-dependent, one-to-many cardinality and the
optional/mandatory nature of the relationship. We return to this language,
but it is easy to see that the way a database is designed depends on
the description of it. If the description is clear and unambiguous, then the
likely outcome of database design is far more predictable.

2.4.1.3 Drawbacks of the Hierarchical Model

All relationships between records in a hierarchical model have a cardinality
of one to many or one to one, but never many to one or many to many.
So, for a hierarchical model of employee and dependent, we can only have
the employee-to-dependent relationship as one to many or one to one; an
employee may have zero or more dependents. In the hierarchical model,
you could not have dependents with multiple parent-employees (unless we
introduce redundant recording of dependents).
As we illustrated, the original way hierarchical databases were implemented
involved choosing some way of physically “connecting” the parent
and the child records. Imagine you have looked up information on an
employee in an employee filing cabinet and you want to find the dependent
records for that employee in a dependent filing cabinet. One way to implement
the employee-dependent relationship would be to have an employee
record point to a dependent record and have that dependent record point
to the next dependent (a linked list of child records). For example, you
find employee Jones. In Jones’s record there is a notation that Jones’s first

Data and Data Models • 29

dependent is found in the dependent filing cabinet, file drawer 2, record
17. The “file drawer 2, record 17” is called a pointer and is the connection
or relationship between the employee and the dependent. Now, to take
this example further, suppose the record of the dependent in file drawer
2, record 17, points to the next dependent in file drawer 3, record 38, then
that dependent points to the next dependent in file drawer 1, record 82.
As we pointed out in the person-sports model, the linked list approach
to connecting parent and child records has advantages and disadvantages.
For example, some advantages would be that each employee has to maintain
only one pointer, and that the size of the linked list of dependents
is theoretically unbounded. Drawbacks would include the fragility of
the system in that if one dependent record is destroyed, then the chain
is broken. Further, if you wanted information about only one of the child
records, you might have to look through many records before you found
the one you were looking for.

There are, of course, several other ways of implementing the parentchild
link. The point here is that some system has to be chosen to be implemented
in the underlying database software. Once the linking system is
chosen, it is fixed by the software implementation; the way the link is done
has to be used to link all child records to parents regardless of how inefficient
it might be for one situation.

There are three major drawbacks to the hierarchical model:
1. Not all situations fall into the one-to-many, parent-child format.
2. The choice of the way in which the files are linked has an impact on
performance both positively and negatively.
3. The linking of parents and child records is done physically. If the
dependent file were reorganized, then all pointers would have to be
reset.

2.5 THE NETWORK MODEL

Each of the methods presented for the hierarchical database has advantages
and disadvantages. Since the network model allows M:N relationships,
we want to implement a system of pointers for multiple parents for
each child record. How would this be handled? You would most likely
have to have multiple child-forward links with either a linked list or an
MCP system in the parent, parent pointers in the child records (perhaps
30 • Database Design Using Entity-Relationship Diagrams
more than one), and possibly some other enhancement to a pointer
scheme. The network model alleviated a major concern of the hierarchical
model because in the network model, one was not restricted to having
one parent per child; a many-to-many relationship or a many-to-one
relationship was acceptable. To give an example of a network approach,
let us revisit the Person-Sports example but now allow a sports record
to be connected to more than one person. Here is some sample data like
those presented with more persons and more sports and using an MCP
system in both directions:




First, the data:

Brenda: tennis, city courts; golf, municipal links
Richard: golf, municipal links; snorkeling, Pensacola
bay; running, UWF track
Abbie: downhill skiing, Ski Beech
David: snorkeling, Pensacola bay; golf, municipal
links
Kaitlyn: curling, Joe’s skating rink; downhill skiing,
Ski Beech
Chrissy: cheerleading, Mountain Breeze High; running,
UWF track
Now, suppose we diagram this conglomeration of data with pointers (we
use record numbers in each “file”):
PERSON(record_number, name, (sports))
SPORTS(record_number, sport, location, (who plays))
In each case, the part of the file description in parentheses that looks like
(who plays) is called a repeating group, meaning that it can have multiple
values. Our small, albeit perplexing, database looks like this:
PERSON (record_number, name, (sports))

With data:

1. Kaitlyn (107, 106)
2. Abbie (106)
3. Brenda (102, 101)
4. Chrissy (108, 105)
5. Richard (103, 104, 105)
6. David (104, 102)
SPORTS (record_number, sport, location, (who plays))

With data:

Data and Data Models • 31

101 tennis, city courts (3)
102 golf, municipal links (3, 6)
103 golf, municipal links (5)
104 snorkeling, Pensacola bay (5, 6)
105 running, UWF track (4, 5)
106 downhill skiing, Ski Beech (2, 1)
107 curling, Joe’s skating rink (1)
108 cheerleading, Mountain Breeze High (4)

The network with pointers in both directions is illustrated in Figure 2.4.
The complexity of the network database is exponentially greater than
that of the hierarchical one. The database just illustrated could have been
implemented as a series of linked child/parents or some other combination
of links and pointers. The second and third drawbacks of hierarchical
databases spill over to network databases. If one were to design a file-based
database system from scratch, one would have to choose some method of
physically connecting or linking records. This choice of record connection
then locks us into the same problem as before: a hardware-implemented
connection that has an impact on performance both positively and negatively.
Further, as the database becomes more complicated, the paths of
connections and the maintenance problems become exponentially more
difficult to manage.

As a project, you could create the Person-Sports database with a few more
records than given in the example using linked lists. At first, you might
think this to be a daunting exercise, but one of the most popular database
systems in the 1970s used a variant of this system. The parent and child
records were all linked with linked lists going in two directions—forward
and backward. The forward/backward idea was to speed up finding child
records such that one could search for children by going either way. An
enhancement of this system would be to use forward-pointing linked lists
with a backward-pointing chain of links, but with the backward chain

PersonA PersonB
SportW SportX SportY SportZ

FIGURE 2.4
A network of persons and sports with MCP and parent pointers.

32 • Database Design Using Entity-Relationship Diagrams

skipping every n records, where the optimal n turns out to be the square
root of the number of entries in the chain.

2.6 THE RELATIONAL MODEL

Codd (1970) introduced the relational model to describe a database that
did not suffer the drawbacks of the hierarchical and network models (i.e.,
physical links and hardware-bound restrictions). Codd’s premise was that
if we ignore the way data files are connected and arrange our data into simple
two-dimensional, unordered tables, then we can develop a calculus for
queries (questions posed to the database), and we can focus on the data as
data, not as a physical realization of a logical model. Codd’s idea was truly
logical in that one was no longer concerned with how data was physically
stored. Rather, data sets were simply unordered, two-dimensional tables
of data. To arrive at a workable way of deciding which pieces of data went
into which table, Codd proposed “normal forms.” To understand normal
forms, we must first introduce the notion of functional dependencies. After
we understand functional dependences, normal forms follow.
As a historical note, when Codd introduced his relational model, it
was deemed by most people as “theoretical only.” At the time, a typical
mainframe computer might have 64K of internal memory, and a good
set of hard disks might have as much as several megabytes of storage. To
top that off, the computer typically took up a large room that required
separate air handling, special architectural features like raised floors,
and enhanced power grids. Remember that all the “computing power”
was shared by everyone who had to have computing. In a company, users
might include accounting, purchasing, personnel, finance, and so on.
For even one of those units to be able to run the relational model in the
early 1970s would have required vast dedicated resources. As computers
became more ubiquitous, less expensive, smaller, and so on, the amount
of memory available both internally and externally became cheaper and
had far greater capacity. The relational model “grew up” with the evolution
of computers in the 1980s. We expand the notion of relational database
in the next chapter.

Data and Data Models • 33

CHECKPOINT 2.3

1. What are the three main data models we have discussed?
2. Which data model is mostly used today? Why?
3. What are some of the disadvantages of the hierarchical data model?
4. What are some of the disadvantages of the network data model?
5. How are all relationships (mainly the cardinalities) described in the hierarchical
data model? How can these be a disadvantage of the hierarchical
data model?
6. How are all relationships (mainly the cardinalities) described in the
network data model? Would you treat these as advantages or disadvantages
of the network data model?
7. What are structural constraints?
8. Why was Codd’s promise of the relational model better?

2.7 CHAPTER SUMMARY

In this chapter, we covered concepts that are essential to the understanding
and design of a database. We also covered data models from a historical
perspective—the hierarchical and network models and then introduction
of the relational model. This chapter should serve more as background to
the material for the rest of the book.

BIBLIOGRAPHY

Codd, E. F. 1970. A relational model of data for large shared data banks. Communications
of the ACM, 13(6): 377–387.