2 Data and Data Models
2.1 INTRODUCTION
In this chapter, we introduce more concepts
that are essential to our presentation
of the design of a database. We defined a
database as a collection
of related facts (related data). In this
chapter, we explore database models
and introduce the relational database model.
While there are many kinds
of database models, most of the databases in
use today are relational; our
focus in this book is to design a good
relational database. In the next chapter,
we introduce the concept of functional
dependencies to define what is a
good (and a not-so-good) relational database.
The purpose of this chapter is
mostly historical; it engenders an appreciation
for the simplicity and power
of the relational model.
2.2 FILES, RECORDS, AND DATA ITEMS
Data has to be stored in some fashion in a file
to be useful. Suppose there
were no computers—think back to a time when
files were paper documents.
A business kept track of its customers and
products. A doctor’s
office kept track of patients. A sports team
kept statistics on its players. In
all of these cases, data was recorded on paper.
The files with data in them
could be referred to as a “database.” A
database is most simply a repository
of data about some specific entity. A customer
file might be as plain and
minimal as a list of people that did business
with a merchant. There are
two aspects to filing: storage and retrieval.
Some method of storing data
that will facilitate retrieval is most
desirable.
Suppose we have a file of customer records. The
whole file might be
called the customer file, whereas the
individual customer’s information
12 • Database Design Using Entity-Relationship
Diagrams
is kept in a customer record. Files consist of
records. More than likely,
more information than a list of just customer’s
names would be stored.
Most likely, at the very least a customer’s
name, address, and phone number
would constitute a customer record. Each of
these components of the
record is called a data item or field. The
customer file contains customer
records that consist of fields of data.
Here is an example of some data (you can
imagine each line as a 3 × 5
card, with the three cards making up a file):
CUSTOMER File
Record 1: Jones, J 123 4th St Mobile, AL
Record 2: Smith, S 452 Main St Pensacola, FL
Record 3: Harris, H 92 Adams Lane Elberta, AL
This file contains three records. There is one
record for each customer.
The records each consist of four fields: record
number, name, address,
city. As more customers are added, they also
are assigned a 3 × 5 card,
and these records are placed in the file.
Several interesting questions and
observations arise:
1. The merchant who keeps this file may well
want to add information
in the future. Suppose that the merchant wanted
to add a telephone
number. Would the merchant add a number to all
3 × 5 cards, or
would the adding be done “as necessary”? If it
were done as necessary,
then some customers would have telephone
numbers, and
some would not. If a customer had no phone
number, then the phone
number for that customer would be “null”
(unknown).
2. How will the file be organized? Imagine not
three customers, but 300 or
3,000. Would the 3 × 5 cards be put in
alphabetical order? Perhaps, but
what happens if you get another J. Jones or S.
Smith? The field on which
the file is organized is called a key.
Perhaps the file should be organized
by telephone number, which might not have a duplicate
value?
3. Suppose the file were organized by telephone
number. What if the
telephone number for some customer was not
recorded? If there were
no telephone number, the common terminology is
that we say the
telephone field for that record is null. It would
make no sense to have
the file organized by telephone number if some
values were null.
Clearly, the key of a file cannot be null.
Also, if telephone number
were the key, then the person finding a record
in the file would have
to know the phone number to find the
appropriate record efficiently.
Data and Data Models • 13
4. The format of the file given is
CUSTOMER(record_number, name, address,
city)
The format of the file dictates the order of
the fields in any record. In this
record, record_number is first, followed by a
name, and so on. The file
design could have the fields in some other
order, but once defined, the
order of the fields stays constant.
If a telephone number field were added, then
the file format could be
CUSTOMER (record_number, name, address,
city, telephone)
This shorthand format notation is referred to
as the file design. If the file
were set up to find data by name and name were
the key, then the name
would be underlined, as follows:
CUSTOMER (record_number, name, address,
city, telephone)
5. You might ask, “Why not use the record
number to organize the
file?” On one hand, it is unique (which is
desirable for a key), but on
the other hand, you would have to know the
record number to find
a customer. The example is organized by record
number; however,
imagine 300 or more customers. You want to find
Smith’s address—
you would have to know the record number. It
makes more sense to
organize this file of 3 × 5 cards by name.
Taking some of these points
into consideration, here is an enhanced version
of the customer file
in which each line represents a 3 × 5 card:
CUSTOMER File
Record 1: Adams, A 77 A St Pensacola
FL
555-5847
Record 2: Charles,
X
365
Broad St
Mobile AL 555-8214
Record 3: Harris,
H
92 Adams
Lane
Elberta,
AL
555-1234
Record 4: Jones, A 22 Pine
Forest
Pensacola
FL
null
Record 5: Jones, J 123 4th
St
Mobile, AL 555-9978
Record 6: Morris, M 932
Dracena
Way
Gulf
Breeze FL
555-1111
Record 7: Smith, S 452 Main
St
Pensacola,
FL
555-0003
14 • Database Design Using
Entity-Relationship Diagrams
CHECKPOINT 2.1
1. What does it mean to say a field has unique
values?
2. Why is it desirable to have a key be unique?
3. Why does a file have to be organized by a
key field?
4. What does null mean?
5. Consider this customer file:
Record 1: 77 A St Adams, A Pensacola
FL
555-5847
Record 2: Charles,
X
365 Broad
St
555-8214 Mobile AL
Record 3: 555-1234 Harris, H 92 Adams
Lane
Elberta,
AL
What is wrong here?
2.3 MOVING FROM 3 × 5 CARDS TO COMPUTERS
Let us return to our example of a merchant who
kept his customer file on
3 × 5 cards. As time passed, the customer base
grew and the merchant
probably desired to keep more information about
customers. From a
data-processing standpoint, we would say that
the techniques for storing
and retrieval led to enhanced forms or cards,
more fields, and perhaps better
ways to store and find individual records. Were
customer records kept
in alphabetical order? Were they stored by
telephone number or record
number (which might be a customer number)? What
happened if a new
field was required that was not on existing
forms or cards? Such was a
data-processing dilemma of the past.
When computers began to be used for businesses,
data was stored on
magnetic media. The magnetic media were mostly
disks and tapes. The
way data was stored and retrieved on a computer
started out like the 3
× 5 cards, but the data was virtual; it did not
physically exist where you
could touch it or see it without some kind of
software to load and find
records and some kind of display device to see
what the “3 × 5 card”
had on it. The most common way data was fed
into a computer was via
punched cards. Punched card systems for
handling data were in use as
early as the 1930s; sorters were capable of
scanning and arranging a pile
of cards. Using punched cards to input data
into computers did not blossom
until the 1960s. The output or “display device”
was typically a line
printer.
Data and Data Models • 15
As data was placed on a computer, and software
could be developed to
handle the data, filing techniques evolved. In
the very early days of databases,
the files kept on computers basically
replicated the 3 × 5 cards. There
were many problems with computers and databases
in the “early days.”
(Generally, early days in terms of
computers and databases means roughly
early to mid-1960s.) Some problems involved
input (how the data got into
the computer), output (how the data was to be
displayed), and file maintenance
(for example, how the data was to be stored and
kept up to date; how
records were to be added and deleted; how
fields were to be added, deleted,
or changed). A person using a computer for
keeping track of data could
buy a computer and hire programmers, operators,
and data entry personnel.
In the early days, computers were expensive and
large. Most small
businesses did not have the resources to obtain
a computer, much less hire
people whose jobs were solely “on the
computer.” Early attempts at filing
data and retrieving them were the purview of
large businesses and large
organizations. One thing was clear: Many more
virtual customer records
than the 300 or so we envisioned were possible.
In the early days, imagine that some company
had a computer, and the
departments within the company wanted to keep
files on the computer.
Suppose further that the company made some
product and had several
departments (e.g., sales, accounting,
production). Now, suppose that
each department wanted to keep data about customers.
Each department
had a different view of customers. The sales
department wanted
to know the name, address, telephone number,
and some data related to
the propensity to buy the product. The
accounting department wanted
to know roughly the same information, but
wanted to keep track of billing
and payments. Production also wanted some of
the same information,
but wanted to know what the customer wanted in
the product and
how many they should make. Each department
wanted about the same
thing, but each approached the problem in a
different way. What actually
happened in the early days was that each
department shared the
expensive computer but hired its own
programming staff to keep “their
database.” While the sense of sharing the
expensive computer was there,
the sense of sharing data was not. The idea of
a “software package” to
store and retrieve data was not there either.
Programmers used computer
languages like COBOL, RPG, ALGOL, PL/1, and
FORTRAN to
store and retrieve data. Each department
created its own records and its
own storage and retrieval methods, kept its own
programs, and had its
own data entry groups.
16 • Database Design Using
Entity-Relationship Diagrams
The earliest databases were filing systems
maintained by some computer
language. For example, a programmer wrote a
COBOL program to gather
input data on punched cards and to store the
data in a computer file. Then,
the programmer wrote another set of programs to
retrieve the data and
display it in whatever way a user wanted to see
it. Early computer filing
systems were simple sequential files. The data
on punched cards was read
and stored. Reconsider the customer file we
introduced previously:
CUSTOMER File
Record 1: Adams, A 77 A St Pensacola
FL
555-5847
Record 2: Charles,
X
365
Broad St
Mobile AL 555-8214
Record 3: Harris, H 92 Adams
Lane
Elberta, AL 555-1234
Record 4: Jones, A 22 Pine
Forest
Pensacola
FL
null
Record 5: Jones, J 123 4th
St
Mobile, AL 555-9978
Record 6: Morris, M 932
Dracena
Way
Gulf Breeze
FL
555-1111
Record 7: Smith, S 452 Main
St
Pensacola,
FL
555-0003
If you could look at the data on a disk, it
might look like this:
Adams, A 77 A St Pensacola
FL
555-5847
Charles, X 365 Broad
St
Mobile AL 555-8214
Harris, H 92 Adams
Lane
Elberta, AL 555-1234
Jones, A 22 Pine
Forest
Pensacola
FL
null
Jones, J 123 4th St Mobile, AL 555-9978
Morris, M 932 Dracena
Way
Gulf Breeze
FL
555-1111
Smith, S 452 Main St Pensacola,
FL
555-0003
Data and Data Models • 17
The records as addressed by COBOL had a
structure like this:
01 CUSTOMER
05 NAME CHARACTER(20)
05 ADDRESS CHARACTER(20)
05 CITY-STATE CHARACTER(25)
05 PHONE CHARACTER(7)
The file was referred to as a “sequential
file.” If a person wanted to see a
listing of data by address rather than name,
the file had to be sorted and
the data redisplayed. If data was added to the
file, it had to be put in its
proper place according to the sequential key,
which in this example is the
name field.
Two other principal filing systems evolved in
the 1960s: indexed and the
direct access filing systems. Indexed systems
were based on the maintenance
of indexes for records rather than a strict
sequential ordering of the records
themselves. Look again at the customer example
in a different light:
CUSTOMER File
Record 1: Adams, A 77 A St. Pensacola
FL
555-5847
Record 2: Charles, X 365 Broad
St
Mobile AL 555-8214
...
In this case, suppose that Record 1 were not a
numerical record but
rather a disk address. Then, suppose there was
an index to the records
themselves that consisted of names and disk
addresses like this:
Name Disk address
Adams, A 1
Charles, X 2
Harris, H 3
and so forth.
Now, suppose you added a customer called Baker,
B. In a sequential filing
system, you would have to put Baker between
Adams and Charles,
but in an indexed system, you could put Baker
anywhere on the disk and
simply change the index to find Baker:
18 • Database Design Using
Entity-Relationship Diagrams
Name Disk address
Adams, A 1
Baker, B 8
Charles, X 2
Harris, H 3
Disks were arranged by their geometry. A disk
usually consisted of
an arrangement of platters that looked like
flat, brown phonograph
records with a spindle in the center. The disk
rapidly rotated, and data
was arranged on the plate concentrically and
written and read magnetically.
If all of this sounds like an old phonograph
system, that is likely
where the idea of stacking disks originated. If
you could look at the data
on a specific plate, it would be arranged in a
circle, and there would be
many circles, getting larger as you moved out
on the plate. Each circle
was called a track. Data on a disk was
thought to be located in a relative
position on some track and imaginary overlaid
tracks called cylinders.
Here is a simplified example: If the tracks
were designated as 1, 2, 3,
… starting from the data circle nearest the
spindle, and the cylinders
were designated 1, 2, 3, … starting from the
top of the disk, then a data
address would be referenced as follows, for
example: cylinder 4, track 3,
record 24. This would mean that the data was on
the 4th plate from the
top, on the 3rd track from the spindle, and in
the 24th position on that
track.
The indexed systems became quite elaborate. The
one illustrated was
called a relative record indexed system.
In the relative record system, one
needed to know only which cylinders a file
occupied, and then the position
of the record could be calculated relative to
the starting position of
the file. Originally, relative record systems
were not popular, whereas there
was a common indexed system used primarily on
IBM computers that was
based on disk geometry (cylinders, tracks, and
records) called an indexedsequential
system.
Whatever the index system, one had to keep
indexes to store and find
records. Imagine not 30 or 300 customers, but
300,000. Now, suppose
that someone wanted to find customers by their
address rather than
their name. You would have to have another
index, or you would have
to look at each record in the filing system
until you found the address
you needed. One other less-common way to manage
early data access and
retrieval was to base the index on an exact
location of data on a disk: the
direct access filing system. The popular
indexed-sequential system was
Data and Data Models • 19
probably the best amalgamation of the
properties of all filing systems in
that it efficiently handled record additions,
deletions, and updating as well
as providing reasonable data access speeds.
Early filing systems involved
choices: elaborate indexes or elaborate
programs to sort and find data (or
both). Regardless of the choice of index, there
were always problems with
maintaining the index (or indexes); adding,
updating, or deleting records;
finding and displaying records and
reprogramming all of these items for
new applications or hardware.
In the late 1960s, software packages called
database systems began to
emerge. Database systems were
purchasable programs that stored and
retrieved data as well as performed maintenance
(adding, deleting, and
modifying fields and records). With a database
system, one did not have
to write COBOL programs to handle data directly
but rather relied on the
database program to handle data. With these
systems, each department
could share data and resources. Instead of each
department having its own
programmers and perhaps its own computer, there
could be one central
computer to store data, one programming staff,
and one database software
package. Data could be shared, with each
department having its own
view of the data. All this sounds great, but in
reality it took several years
to break away from the “my data” mold. In
addition, there were hybrid
systems that emerged that focused mainly on
retrieval of data that delayed
the move to a totally relational environment
because of the investment
companies had in software and programmers.
Why was sharing data a good thing? It not only
used expensive resources
more efficiently but also reduced redundancy. Redundancy
means storing
the same information in different places. If
each department stored its
own version of data, its own view of customers,
then the customer name,
address, telephone number, and so on were
recorded by each department.
Suppose the customer moved. Then, each
department changed its data
when it found out the customer moved, and at
some point the customer’s
address could be stored inconsistently: The
accounting department had
one address, the sales department had another.
The root problem here is
the lack of data sharing, and sharing was a
central goal of the early database
systems.
The early database software evolved into two
main data models: the
hierarchical and the network models. Although
the relational model for
a database was recognized as a desirable
technique in the early 1970s, the
relational model was treated as a really good
theoretical technique for
which computers were not fast enough to
implement.
20 • Database Design Using
Entity-Relationship Diagrams
The database models (hierarchical, network, and
relational) were logical
models—ways of logically perceiving the arrangement of data in a file
structure.
One perceived how the data was to be logically
stored, and the database
physically implemented the logical model. As we
shall see, there is a close relationship
between the logical and physical
implementations of the hierarchical
and network models. Since there were no
practical relational implementations
on other than what was then supercomputers at
research centers, the world of
commercial databases in the 1970s involved
choosing between the hierarchical
and network models. The next sections give a
little insight into each of
these three main models and an introduction to
the relational model.
CHECKPOINT 2.2
1. What is a sequential file?
2. What is COBOL?
3. Why is ordering important in a sequential
filing system?
4. What is a database program?
5. In the early days, how was data put into a
file?
2.4 DATABASE MODELS
We now take a look back at database models as
they were before the relational
database was practical. The look back shows why
the “old systems”
are considered obsolete and why the relational
model is the de facto standard
in databases today. The old systems were
classified as two main types:
hierarchical and network. These two models
defined a database before the
1980s. Although they might be considered “old
fashioned,” there are systems
still in use today that depend on these models.
2.4.1 The Hierarchical Model
In hierarchical database models, all data are
arranged in a top-down
fashion in which some records have one or more
“dependent” or “child”
records, and each child record is tied to one
and only one “parent.” The
parent-child relationship is not meant to infer
a human familial relationship.
The terms parent and child are
historical and are meant to conjure
up a picture of one type of data as dependent
on another. As is illustrated
here, the child will be sports played by a
parent-person. Another terminology
for the parent-child relationship is owner and
objects owned, but
parent-child is more common.
Data and Data Models • 21
In this section, we present some versions of
the hierarchical model for
several reasons: (a) to illustrate how older
models were constructed from
file systems; (b) to show why file-based
databases became outdated when
relational databases became practical; and (c)
to see the evolution of filebased
systems.
We begin with an example of a hierarchical file
situation. Suppose you
have a database of people who play a sport at
some location. A sample
record might consist of Brenda who plays
tennis at city courts and who
plays golf at the municipal links. The
person, Brenda, would be at the top
of the hierarchy, and the sport-location would
be the second layer of the
hierarchy. Usually, the connection between the
layers of the hierarchy is
referred to as a parent-child relationship.
Each parent person may be related to many child
sport-locations, but
each sport-location (each child) is tied back
to the one person (one parent)
it “is played by.” You may be thinking,
“Suppose the same sport is played
by two people at the same location?” That is a
good question, but it is not
a situation that can be handled by hierarchical
file systems in a direct way.
If, for example, Richard also plays golf at
the municipal links, then golf at
municipal links will be recorded twice as a child for each
parent. If the
same sport-location occurs, then it is recorded
redundantly.
The one person, many sports tie
may be called a one-to-many relationship.
Some more data might look like the following,
with a person’s name
followed by the sport and location:
The model:
Person: sport, location; sport, location;
sport,
location ...
The data itself:
Brenda: tennis, city courts; golf, municipal
links
Richard: golf, municipal links; snorkeling,
Pensacola
bay; running, UWF track
Abbie: downhill skiing, Ski Beech
Now this data could be stored in a database in
many ways. You could
imagine information on each person on a 3 × 5
card with the person,s
data. You could also store the data on separate
cards in a sequential file
like this:
Brenda
tennis, city courts
golf, municipal links
22 • Database Design Using
Entity-Relationship Diagrams
Richard
golf, municipal links
snorkeling, Pensacola bay
running, UWF track
Abbie
downhill skiing, Ski Beech
The first thing you notice is that there is a
series of text records (or 3 ×
5 cards), but it is hard to distinguish between
person and sports without
semantics (you can tell the difference between
person and sports by looking
at the entry, but the computer cannot do this
unless you give the line of text
a meaning: semantics). For the computer, it is
helpful to identify whether a
record is a person (P) or a sport (S). Hence,
the sequential file could be
P Brenda
S tennis, city courts
S golf, municipal links
P Richard
S golf, municipal links
S snorkeling, Pensacola bay
S running, UWF track
P Abbie
S downhill skiing, Ski Beech
In actuality, this sequential file would be
useful only if a person wanted
to list all the people followed by the
sport-locations for that person.
Furthermore, the relationship of person to
sport-location is one of position
in the file. The arrangement was sometimes
called a header-trailer format;
persons are header records, sports are trailer
records. If one wanted to
know all the people who played golf, they would
have to read the entire file
and pick out those individuals. A better
organization of the data would
be to have two files, one for person, one file
for sport-locations. For the
two-file model to make sense (i.e., to have the
files “related”), there would
have to be pointers or references of some kind
from one file to the other.
One way to implement a pointer scheme would be
a pointer from the sport
(child) to the person (parent) like this:
PERSON(record_number, name)
With data:
1. Brenda
2. Richard
3. Abbie
Data and Data Models • 23
SPORTS(sport, location, reference to
person)
With data:
tennis, city courts, 1
golf, municipal links,1
golf, municipal links, 2
snorkeling, Pensacola bay, 2
running, UWF track, 2
downhill skiing, Ski Beech, 3
A diagram of this relationship is shown in
Figure 2.1.
The actual location of the records on a disk
could be arbitrary. The sense
of the data is not lost if the locations are
disk addresses and if you imagine
that there are thousands of persons and sports:
PERSON(record address, name)
With data:
A45C. Brenda
C333. Abbie
B259. Richard
SPORTS(sport, location, reference to
person)
With data:
golf, municipal links, B259
running, UWF track, B259
downhill skiing, Ski Beech, C333
snorkeling, Pensacola bay, B259
tennis, city courts, A45C
golf, municipal links, A45C
This system has a parent-child link. Here, we
assume that the primary
key of the person file is the record_number.
The “reference to person” in
the sports file refers to the primary key of
the person file and is called a
Person (Parent record)
Sport Sport Sport
(Child Records)
FIGURE 2.1
A hierarchy of persons and sports with parent
pointers.
24 • Database Design Using
Entity-Relationship Diagrams
foreign key (FK) because it is a primary key of another file. The FK
references
a primary key, hence completing the essential
relationship of one
file to the other. If there were no
relationship, then you would have two
independent files with no connection—the system
would make no sense.
While we have established the relationship of
child to parent in the
discussion, the database has some drawbacks. To
answer a question like,
“Who plays golf at municipal links?” you start
looking at the Sports file,
look for “golf at municipal links,” and see
what parent records there are. If
your question were, “What sports does Richard
play?” you would first find
Richard in the Person file, then look through
the Sports file to find links
back to Richard.
If you were actually implementing this model,
you could enhance the
system a bit. An improvement to this model
would be to reference each
sport from within each parent record (let us go
back to simple numbers
for this):
PERSON(record address, name, (sport
address))
With data:
1. Brenda, (101, 102)
2. Richard (103, 104, 105)
3. Abbie (106)
SPORTS(sport, location, reference to
person)
With data:
101, tennis, city courts, 1
102, golf, municipal links, 1
103, golf, municipal links, 2
104, snorkeling, Pensacola bay, 2
105, running, UWF track, 2
106, downhill skiing, Ski Beech, 3
Figure 2.2 depicts the relationship between
parent and child records in
both directions. When viewed from the parent,
this link is called a multiple-
child pointer (MCP). When viewed from the child
the link is called a
parent pointer.
In this model, the relationship between parent
and child records is done
two ways. The “reference to person” in the
Sports file is a link (the FK) from
child to parent. The reference to multiple
children in the parent records is
called an MCP scheme. While it is true that the
MCP is redundant to the
relationship, it does two practical things: (a)
It allows questions to be asked
Data and Data Models • 25
of the system that are easier to answer than
with just parent pointers; and
(b) it allows the file system to be audited
with a computer program. For
example, if you want to ask the system, “How
many sports does Richard
play?” you need only look at the person file
and count MCP references.
We have illustrated two ways to construct a
relationship between two
files: the FK parent pointer and the MCP. Both
of the linking techniques
are viable relationship implementations. The
second one, the MCP system,
was actually implemented more often than the FK
system. This is perhaps
because from a hierarchical standpoint it is
easier to see the hierarchy.
You could implement either (or both)
relationships to make this a hierarchical
database. Please remember that this
illustration is just skeletal. In
real life, there would likely be far more
fields in the records for the person
and for the sport or location. Also in real
databases, one expects thousands,
if not millions, of records. If you could
implement the relationship
between person and sports only one way, which
of these systems (MCP or
FK) would work best if there were 1 million
person-sport combinations?
The answer to this would likely depend on how
many sports each person
played and how variable the number of sports
was. For example, if no person
played more than five sports, the MCP system
would work quite well.
If, on the other hand, the number of sports played
was highly variable
(one person plays 1 or 2 sports, and some play
100), then the FK system
might be better because the FK system only
keeps a parent reference, and
the MCP system with this high variability of
pointers in the parent might
become cumbersome.
2.4.1.1 The Hierarchical Model with a
Linked List
Having used the MCP/FK system to construct a
hierarchical set of data,
we now present a second hierarchical model. The
MCP system has some
drawbacks regardless of which direction one
approaches the relationship
Person (Parent record)
Sport Sport Sport
(Child Records)
FIGURE 2.2
A hierarchy of persons and sports with parent
and child pointers.
26 • Database Design Using
Entity-Relationship Diagrams
(as MCP, FK, or both). For example, suppose you
implemented the system
with the child pointers in the parent record
(MCP), and there were more
child records than were planned. Suppose the
system was not person and
sports, but students at a school and absences.
The student would be the
parent, and the absence records would be the
child. Now, suppose you
designed this system with an MCP relationship
so that a student could
have up to 20 absences. What happens when the
student is absent for the
21st time? One of two things has to happen: (a)
The MCP system would
have to be modified to include some kind of
overflow; or (b) some other
hierarchical system would have to be used. We
could, of course, implement
this as an FK system and ignore the MCP part
entirely. Or, we could
implement a different system, and that is what
we illustrate now.
The following file system uses a “linked list”
or “chain” system to implement
the relationship between parent and child. This
system is similar to
both of the schemes discussed. With the same
data, the records would set
up like this:
Parent (link to 1st child)
and within the child records:
Child (link to next child)
Here is the data from above with this type of
arrangement:
PERSON(record address, name, first
sport)
With data:
1. Brenda (101)
2. Richard (103)
3. Abbie (106)
SPORTS(sport, location, link to next
sport for that person)
With data:
101, tennis, city courts, 102
102, golf, municipal links, 999
103, golf, municipal links, 104
104, snorkeling, Pensacola bay, 105
105, running, UWF track, 999
106, downhill skiing, Ski Beech, 999
Here, 999 means “no next link.”
Figure 2.3 illustrates a linked list
parent-to-child pointing scheme. In
this system, we have a link from parent to
child that we did not have with
the FK system alone. Furthermore, the records
in both the parent and the
Data and Data Models • 27
child are uniform. Both the parent and the
child records contain only one
pointer. Also in this system, it would not
matter whether a person played
1 sport or 100—the system works well if the
number of child records is
unknown or highly variable. If you would argue
that finding a specific
child record among 200,000 sport records might
be time consuming, you
are 100% correct. If you argue that the system
is fragile and that if one link
is lost the whole thing goes down, you are
correct again. While this linked
list system may look somewhat fragile, it
formed the basis for several of
the most successful commercial databases. As
you might expect, enhancements
to the basic linked list included things like
direct links back to the
parent, forward and backward links, links that
skipped along the chain in
one direction and not in the other (coral
rings). All these enhancements
were the fodder of databases in the 1970s.
2.4.1.2 Relationship Terminology
Having seen how to implement relationships in
hierarchical databases,
we need to tighten some language about how
relationships are formed.
Relationships in all database models have what
are called structural
constraints. A structural constraint consists of two notions: cardinality
and optionality. Cardinality is a
description of how many of one record
type relate to the other and vice versa.
Suppose we implement a database
about the personnel of a company. We have an
employee (parent) and the
employee’s dependents (child). In our company,
if an employee can have
multiple dependents and the dependent can have
only one employee parent,
we would say the cardinality of the
relationship is one to many: One
employee relates to many dependents,
abbreviated 1:M. If the company
were such that employees might have multiple
dependents and a dependent
might be claimed by more than one employee,
then the cardinality
Person
(Parent record)
Sport Sport Sport (Child Records)
FIGURE 2.3
A hierarchy of persons and sports with a
parent-to-child linked list.
28 • Database Design Using
Entity-Relationship Diagrams
would be many to many: Many employees relate to
many dependents,
abbreviated M:N (as in M of one side relates to
N on the other side, and M
and N are generally not equal).
Optionality refers to whether one record may or must have a
corresponding
record in the other file. If the employee may
or may not have dependents,
then the optionality of the
employee-to-dependent relationship is
optional or partial. If the dependents must be “related to”
employee(s) (and
every employee must have dependents), then the
optionality of dependent
to employee is mandatory or full.
Further, relationships are always stated in
both directions in a database
description. We would say
Employees may have zero or more dependents,
and
Dependents must be associated with one and only
one employee.
Note the employee-to-dependent, one-to-many cardinality
and the
optional/mandatory nature of the relationship.
We return to this language,
but it is easy to see that the way a database
is designed depends on
the description of it. If the description is
clear and unambiguous, then the
likely outcome of database design is far more
predictable.
2.4.1.3 Drawbacks of the Hierarchical Model
All relationships between records in a
hierarchical model have a cardinality
of one to many or one to one, but never many to
one or many to many.
So, for a hierarchical model of employee and
dependent, we can only have
the employee-to-dependent relationship as one
to many or one to one; an
employee may have zero or more dependents. In
the hierarchical model,
you could not have dependents with multiple
parent-employees (unless we
introduce redundant recording of dependents).
As we illustrated, the original way
hierarchical databases were implemented
involved choosing some way of physically
“connecting” the parent
and the child records. Imagine you have looked
up information on an
employee in an employee filing cabinet and you
want to find the dependent
records for that employee in a dependent filing
cabinet. One way to implement
the employee-dependent relationship would be to
have an employee
record point to a dependent record and have
that dependent record point
to the next dependent (a linked list of child
records). For example, you
find employee Jones. In Jones’s record there is
a notation that Jones’s first
Data and Data Models • 29
dependent is found in the dependent filing
cabinet, file drawer 2, record
17. The “file drawer 2, record 17” is called a pointer
and is the connection
or relationship between the employee and the
dependent. Now, to take
this example further, suppose the record of the
dependent in file drawer
2, record 17, points to the next dependent in
file drawer 3, record 38, then
that dependent points to the next dependent in
file drawer 1, record 82.
As we pointed out in the person-sports model,
the linked list approach
to connecting parent and child records has
advantages and disadvantages.
For example, some advantages would be that each
employee has to maintain
only one pointer, and that the size of the
linked list of dependents
is theoretically unbounded. Drawbacks would
include the fragility of
the system in that if one dependent record is
destroyed, then the chain
is broken. Further, if you wanted information
about only one of the child
records, you might have to look through many
records before you found
the one you were looking for.
There are, of course, several other ways of
implementing the parentchild
link. The point here is that some system has to
be chosen to be implemented
in the underlying database software. Once the
linking system is
chosen, it is fixed by the software
implementation; the way the link is done
has to be used to link all child records to
parents regardless of how inefficient
it might be for one situation.
There are three major drawbacks to the
hierarchical model:
1. Not all situations fall into the
one-to-many, parent-child format.
2. The choice of the way in which the files are
linked has an impact on
performance both positively and negatively.
3. The linking of parents and child records is
done physically. If the
dependent file were reorganized, then all
pointers would have to be
reset.
2.5 THE NETWORK MODEL
Each of the methods presented for the
hierarchical database has advantages
and disadvantages. Since the network model
allows M:N relationships,
we want to implement a system of pointers for
multiple parents for
each child record. How would this be handled?
You would most likely
have to have multiple child-forward links with
either a linked list or an
MCP system in the parent, parent pointers in the
child records (perhaps
30 • Database Design Using
Entity-Relationship Diagrams
more than one), and possibly some other
enhancement to a pointer
scheme. The network model alleviated a major
concern of the hierarchical
model because in the network model, one was not
restricted to having
one parent per child; a many-to-many
relationship or a many-to-one
relationship was acceptable. To give an example
of a network approach,
let us revisit the Person-Sports example but
now allow a sports record
to be connected to more than one person. Here
is some sample data like
those presented with more persons and more
sports and using an MCP
system in both directions:
First, the data:
Brenda: tennis, city courts; golf, municipal
links
Richard: golf, municipal links; snorkeling,
Pensacola
bay; running, UWF track
Abbie: downhill skiing, Ski Beech
David: snorkeling, Pensacola bay; golf,
municipal
links
Kaitlyn: curling, Joe’s skating rink; downhill
skiing,
Ski Beech
Chrissy: cheerleading, Mountain Breeze High;
running,
UWF track
Now, suppose we diagram this conglomeration of
data with pointers (we
use record numbers in each “file”):
PERSON(record_number, name, (sports))
SPORTS(record_number, sport, location,
(who plays))
In each case, the part of the file description
in parentheses that looks like
(who plays) is called a repeating group,
meaning that it can have multiple
values. Our small, albeit perplexing, database
looks like this:
PERSON (record_number, name, (sports))
With data:
1. Kaitlyn (107, 106)
2. Abbie (106)
3. Brenda (102, 101)
4. Chrissy (108, 105)
5. Richard (103, 104, 105)
6. David (104, 102)
SPORTS (record_number, sport, location,
(who plays))
With data:
Data and Data Models • 31
101 tennis, city courts (3)
102 golf, municipal links (3, 6)
103 golf, municipal links (5)
104 snorkeling, Pensacola bay (5, 6)
105 running, UWF track (4, 5)
106 downhill skiing, Ski Beech (2, 1)
107 curling, Joe’s skating rink (1)
108 cheerleading, Mountain Breeze High (4)
The network with pointers in both directions is
illustrated in Figure 2.4.
The complexity of the network database is
exponentially greater than
that of the hierarchical one. The database just
illustrated could have been
implemented as a series of linked child/parents
or some other combination
of links and pointers. The second and third
drawbacks of hierarchical
databases spill over to network databases. If
one were to design a file-based
database system from scratch, one would have to
choose some method of
physically connecting or linking records. This
choice of record connection
then locks us into the same problem as before:
a hardware-implemented
connection that has an impact on performance
both positively and negatively.
Further, as the database becomes more
complicated, the paths of
connections and the maintenance problems become
exponentially more
difficult to manage.
As a project, you could create the
Person-Sports database with a few more
records than given in the example using linked
lists. At first, you might
think this to be a daunting exercise, but one
of the most popular database
systems in the 1970s used a variant of this
system. The parent and child
records were all linked with linked lists going
in two directions—forward
and backward. The forward/backward idea was to
speed up finding child
records such that one could search for children
by going either way. An
enhancement of this system would be to use
forward-pointing linked lists
with a backward-pointing chain of links, but
with the backward chain
PersonA PersonB
SportW SportX SportY SportZ
FIGURE 2.4
A network of persons and sports with MCP and
parent pointers.
32 • Database Design Using
Entity-Relationship Diagrams
skipping every n records, where the
optimal n turns out to be the square
root of the number of entries in the chain.
2.6 THE RELATIONAL MODEL
Codd (1970) introduced the relational model to
describe a database that
did not suffer the drawbacks of the
hierarchical and network models (i.e.,
physical links and hardware-bound
restrictions). Codd’s premise was that
if we ignore the way data files are connected
and arrange our data into simple
two-dimensional, unordered tables, then we can
develop a calculus for
queries (questions posed to the database), and we
can focus on the data as
data, not as a physical realization of a
logical model. Codd’s idea was truly
logical in that one was no longer concerned
with how data was physically
stored. Rather, data sets were simply
unordered, two-dimensional tables
of data. To arrive at a workable way of
deciding which pieces of data went
into which table, Codd proposed “normal forms.”
To understand normal
forms, we must first introduce the notion of functional
dependencies. After
we understand functional dependences, normal
forms follow.
As a historical note, when Codd introduced his
relational model, it
was deemed by most people as “theoretical
only.” At the time, a typical
mainframe computer might have 64K of internal
memory, and a good
set of hard disks might have as much as several
megabytes of storage. To
top that off, the computer typically took up a
large room that required
separate air handling, special architectural
features like raised floors,
and enhanced power grids. Remember that all the
“computing power”
was shared by everyone who had to have
computing. In a company, users
might include accounting, purchasing,
personnel, finance, and so on.
For even one of those units to be able to run
the relational model in the
early 1970s would have required vast dedicated resources.
As computers
became more ubiquitous, less expensive,
smaller, and so on, the amount
of memory available both internally and
externally became cheaper and
had far greater capacity. The relational model
“grew up” with the evolution
of computers in the 1980s. We expand the notion
of relational database
in the next chapter.
Data and Data Models • 33
CHECKPOINT 2.3
1. What are the three main data models we have
discussed?
2. Which data model is mostly used today? Why?
3. What are some of the disadvantages of the
hierarchical data model?
4. What are some of the disadvantages of the
network data model?
5. How are all relationships (mainly the
cardinalities) described in the hierarchical
data model? How can these be a disadvantage of
the hierarchical
data model?
6. How are all relationships (mainly the
cardinalities) described in the
network data model? Would you treat these as
advantages or disadvantages
of the network data model?
7. What are structural constraints?
8. Why was Codd’s promise of the relational
model better?
2.7 CHAPTER SUMMARY
In this chapter, we covered concepts that are
essential to the understanding
and design of a database. We also covered data
models from a historical
perspective—the hierarchical and network models
and then introduction
of the relational model. This chapter should
serve more as background to
the material for the rest of the book.
BIBLIOGRAPHY
Codd, E. F. 1970. A relational model of data
for large shared data banks. Communications
of the ACM, 13(6): 377–387.
No hay comentarios.:
Publicar un comentario