MaxiCompu Fagaad: Data and Data Models

2 Data and Data Models

2.1 INTRODUCTION

In this chapter, we introduce more concepts that are essential to our presentation

of the design of a database. We defined a database as a collection

of related facts (related data). In this chapter, we explore database models

and introduce the relational database model. While there are many kinds

of database models, most of the databases in use today are relational; our

focus in this book is to design a good relational database. In the next chapter,

we introduce the concept of functional dependencies to define what is a

good (and a not-so-good) relational database. The purpose of this chapter is

mostly historical; it engenders an appreciation for the simplicity and power

of the relational model.

2.2 FILES, RECORDS, AND DATA ITEMS

Data has to be stored in some fashion in a file to be useful. Suppose there

were no computers—think back to a time when files were paper documents.

A business kept track of its customers and products. A doctor’s

office kept track of patients. A sports team kept statistics on its players. In

all of these cases, data was recorded on paper. The files with data in them

could be referred to as a “database.” A database is most simply a repository

of data about some specific entity. A customer file might be as plain and

minimal as a list of people that did business with a merchant. There are

two aspects to filing: storage and retrieval. Some method of storing data

that will facilitate retrieval is most desirable.

Suppose we have a file of customer records. The whole file might be

called the customer file, whereas the individual customer’s information

12 • Database Design Using Entity-Relationship Diagrams

is kept in a customer record. Files consist of records. More than likely,

more information than a list of just customer’s names would be stored.

Most likely, at the very least a customer’s name, address, and phone number

would constitute a customer record. Each of these components of the

record is called a data item or field. The customer file contains customer

records that consist of fields of data.

Here is an example of some data (you can imagine each line as a 3 × 5

card, with the three cards making up a file):

CUSTOMER File

Record 1: Jones, J 123 4th St Mobile, AL

Record 2: Smith, S 452 Main St Pensacola, FL

Record 3: Harris, H 92 Adams Lane Elberta, AL

This file contains three records. There is one record for each customer.

The records each consist of four fields: record number, name, address,

city. As more customers are added, they also are assigned a 3 × 5 card,

and these records are placed in the file. Several interesting questions and

observations arise:

1. The merchant who keeps this file may well want to add information

in the future. Suppose that the merchant wanted to add a telephone

number. Would the merchant add a number to all 3 × 5 cards, or

would the adding be done “as necessary”? If it were done as necessary,

then some customers would have telephone numbers, and

some would not. If a customer had no phone number, then the phone

number for that customer would be “null” (unknown).

2. How will the file be organized? Imagine not three customers, but 300 or

3,000. Would the 3 × 5 cards be put in alphabetical order? Perhaps, but

what happens if you get another J. Jones or S. Smith? The field on which

the file is organized is called a key. Perhaps the file should be organized

by telephone number, which might not have a duplicate value?

3. Suppose the file were organized by telephone number. What if the

telephone number for some customer was not recorded? If there were

no telephone number, the common terminology is that we say the

telephone field for that record is null. It would make no sense to have

the file organized by telephone number if some values were null.

Clearly, the key of a file cannot be null. Also, if telephone number

were the key, then the person finding a record in the file would have

to know the phone number to find the appropriate record efficiently.

Data and Data Models • 13

4. The format of the file given is

CUSTOMER(record_number, name, address, city)

The format of the file dictates the order of the fields in any record. In this

record, record_number is first, followed by a name, and so on. The file

design could have the fields in some other order, but once defined, the

order of the fields stays constant.

If a telephone number field were added, then the file format could be

CUSTOMER (record_number, name, address, city, telephone)

This shorthand format notation is referred to as the file design. If the file

were set up to find data by name and name were the key, then the name

would be underlined, as follows:

CUSTOMER (record_number, name, address, city, telephone)

5. You might ask, “Why not use the record number to organize the

file?” On one hand, it is unique (which is desirable for a key), but on

the other hand, you would have to know the record number to find

a customer. The example is organized by record number; however,

imagine 300 or more customers. You want to find Smith’s address—

you would have to know the record number. It makes more sense to

organize this file of 3 × 5 cards by name. Taking some of these points

into consideration, here is an enhanced version of the customer file

in which each line represents a 3 × 5 card:

CUSTOMER File

Record 1: Adams, A 77 A St Pensacola

555-5847

Record 2: Charles,

365

Broad St

Mobile AL 555-8214

Record 3: Harris,

92 Adams

Lane

Elberta,

555-1234

Record 4: Jones, A 22 Pine

Forest

Pensacola

null

Record 5: Jones, J 123 4th

Mobile, AL 555-9978

Record 6: Morris, M 932

Dracena

Way

Gulf

Breeze FL

555-1111

Record 7: Smith, S 452 Main

Pensacola,

555-0003

14 • Database Design Using Entity-Relationship Diagrams

CHECKPOINT 2.1

1. What does it mean to say a field has unique values?

2. Why is it desirable to have a key be unique?

3. Why does a file have to be organized by a key field?

4. What does null mean?

5. Consider this customer file:

Record 1: 77 A St Adams, A Pensacola

555-5847

Record 2: Charles,

365 Broad

555-8214 Mobile AL

Record 3: 555-1234 Harris, H 92 Adams

Lane

Elberta,

What is wrong here?

2.3 MOVING FROM 3 × 5 CARDS TO COMPUTERS

Let us return to our example of a merchant who kept his customer file on

3 × 5 cards. As time passed, the customer base grew and the merchant

probably desired to keep more information about customers. From a

data-processing standpoint, we would say that the techniques for storing

and retrieval led to enhanced forms or cards, more fields, and perhaps better

ways to store and find individual records. Were customer records kept

in alphabetical order? Were they stored by telephone number or record

number (which might be a customer number)? What happened if a new

field was required that was not on existing forms or cards? Such was a

data-processing dilemma of the past.

When computers began to be used for businesses, data was stored on

magnetic media. The magnetic media were mostly disks and tapes. The

way data was stored and retrieved on a computer started out like the 3

× 5 cards, but the data was virtual; it did not physically exist where you

could touch it or see it without some kind of software to load and find

records and some kind of display device to see what the “3 × 5 card”

had on it. The most common way data was fed into a computer was via

punched cards. Punched card systems for handling data were in use as

early as the 1930s; sorters were capable of scanning and arranging a pile

of cards. Using punched cards to input data into computers did not blossom

until the 1960s. The output or “display device” was typically a line

printer.

Data and Data Models • 15

As data was placed on a computer, and software could be developed to

handle the data, filing techniques evolved. In the very early days of databases,

the files kept on computers basically replicated the 3 × 5 cards. There

were many problems with computers and databases in the “early days.”

(Generally, early days in terms of computers and databases means roughly

early to mid-1960s.) Some problems involved input (how the data got into

the computer), output (how the data was to be displayed), and file maintenance

(for example, how the data was to be stored and kept up to date; how

records were to be added and deleted; how fields were to be added, deleted,

or changed). A person using a computer for keeping track of data could

buy a computer and hire programmers, operators, and data entry personnel.

In the early days, computers were expensive and large. Most small

businesses did not have the resources to obtain a computer, much less hire

people whose jobs were solely “on the computer.” Early attempts at filing

data and retrieving them were the purview of large businesses and large

organizations. One thing was clear: Many more virtual customer records

than the 300 or so we envisioned were possible.

In the early days, imagine that some company had a computer, and the

departments within the company wanted to keep files on the computer.

Suppose further that the company made some product and had several

departments (e.g., sales, accounting, production). Now, suppose that

each department wanted to keep data about customers. Each department

had a different view of customers. The sales department wanted

to know the name, address, telephone number, and some data related to

the propensity to buy the product. The accounting department wanted

to know roughly the same information, but wanted to keep track of billing

and payments. Production also wanted some of the same information,

but wanted to know what the customer wanted in the product and

how many they should make. Each department wanted about the same

thing, but each approached the problem in a different way. What actually

happened in the early days was that each department shared the

expensive computer but hired its own programming staff to keep “their

database.” While the sense of sharing the expensive computer was there,

the sense of sharing data was not. The idea of a “software package” to

store and retrieve data was not there either. Programmers used computer

languages like COBOL, RPG, ALGOL, PL/1, and FORTRAN to

store and retrieve data. Each department created its own records and its

own storage and retrieval methods, kept its own programs, and had its

own data entry groups.

16 • Database Design Using Entity-Relationship Diagrams

The earliest databases were filing systems maintained by some computer

language. For example, a programmer wrote a COBOL program to gather

input data on punched cards and to store the data in a computer file. Then,

the programmer wrote another set of programs to retrieve the data and

display it in whatever way a user wanted to see it. Early computer filing

systems were simple sequential files. The data on punched cards was read

and stored. Reconsider the customer file we introduced previously:

CUSTOMER File

Record 1: Adams, A 77 A St Pensacola

555-5847

Record 2: Charles,

365

Broad St

Mobile AL 555-8214

Record 3: Harris, H 92 Adams

Lane

Elberta, AL 555-1234

Record 4: Jones, A 22 Pine

Forest

Pensacola

null

Record 5: Jones, J 123 4th

Mobile, AL 555-9978

Record 6: Morris, M 932

Dracena

Way

Gulf Breeze

555-1111

Record 7: Smith, S 452 Main

Pensacola,

555-0003

If you could look at the data on a disk, it might look like this:

Adams, A 77 A St Pensacola

555-5847

Charles, X 365 Broad

Mobile AL 555-8214

Harris, H 92 Adams

Lane

Elberta, AL 555-1234

Jones, A 22 Pine

Forest

Pensacola

null

Jones, J 123 4th St Mobile, AL 555-9978

Morris, M 932 Dracena

Way

Gulf Breeze

555-1111

Smith, S 452 Main St Pensacola,

555-0003

Data and Data Models • 17

The records as addressed by COBOL had a structure like this:

01 CUSTOMER

05 NAME CHARACTER(20)

05 ADDRESS CHARACTER(20)

05 CITY-STATE CHARACTER(25)

05 PHONE CHARACTER(7)

The file was referred to as a “sequential file.” If a person wanted to see a

listing of data by address rather than name, the file had to be sorted and

the data redisplayed. If data was added to the file, it had to be put in its

proper place according to the sequential key, which in this example is the

name field.

Two other principal filing systems evolved in the 1960s: indexed and the

direct access filing systems. Indexed systems were based on the maintenance

of indexes for records rather than a strict sequential ordering of the records

themselves. Look again at the customer example in a different light:

CUSTOMER File

Record 1: Adams, A 77 A St. Pensacola

555-5847

Record 2: Charles, X 365 Broad

Mobile AL 555-8214

...

In this case, suppose that Record 1 were not a numerical record but

rather a disk address. Then, suppose there was an index to the records

themselves that consisted of names and disk addresses like this:

Name Disk address

Adams, A 1

Charles, X 2

Harris, H 3

and so forth.

Now, suppose you added a customer called Baker, B. In a sequential filing

system, you would have to put Baker between Adams and Charles,

but in an indexed system, you could put Baker anywhere on the disk and

simply change the index to find Baker:

18 • Database Design Using Entity-Relationship Diagrams

Name Disk address

Adams, A 1

Baker, B 8

Charles, X 2

Harris, H 3

Disks were arranged by their geometry. A disk usually consisted of

an arrangement of platters that looked like flat, brown phonograph

records with a spindle in the center. The disk rapidly rotated, and data

was arranged on the plate concentrically and written and read magnetically.

If all of this sounds like an old phonograph system, that is likely

where the idea of stacking disks originated. If you could look at the data

on a specific plate, it would be arranged in a circle, and there would be

many circles, getting larger as you moved out on the plate. Each circle

was called a track. Data on a disk was thought to be located in a relative

position on some track and imaginary overlaid tracks called cylinders.

Here is a simplified example: If the tracks were designated as 1, 2, 3,

… starting from the data circle nearest the spindle, and the cylinders

were designated 1, 2, 3, … starting from the top of the disk, then a data

address would be referenced as follows, for example: cylinder 4, track 3,

record 24. This would mean that the data was on the 4th plate from the

top, on the 3rd track from the spindle, and in the 24th position on that

track.

The indexed systems became quite elaborate. The one illustrated was

called a relative record indexed system. In the relative record system, one

needed to know only which cylinders a file occupied, and then the position

of the record could be calculated relative to the starting position of

the file. Originally, relative record systems were not popular, whereas there

was a common indexed system used primarily on IBM computers that was

based on disk geometry (cylinders, tracks, and records) called an indexedsequential

system.

Whatever the index system, one had to keep indexes to store and find

records. Imagine not 30 or 300 customers, but 300,000. Now, suppose

that someone wanted to find customers by their address rather than

their name. You would have to have another index, or you would have

to look at each record in the filing system until you found the address

you needed. One other less-common way to manage early data access and

retrieval was to base the index on an exact location of data on a disk: the

direct access filing system. The popular indexed-sequential system was

Data and Data Models • 19

probably the best amalgamation of the properties of all filing systems in

that it efficiently handled record additions, deletions, and updating as well

as providing reasonable data access speeds. Early filing systems involved

choices: elaborate indexes or elaborate programs to sort and find data (or

both). Regardless of the choice of index, there were always problems with

maintaining the index (or indexes); adding, updating, or deleting records;

finding and displaying records and reprogramming all of these items for

new applications or hardware.

In the late 1960s, software packages called database systems began to

emerge. Database systems were purchasable programs that stored and

retrieved data as well as performed maintenance (adding, deleting, and

modifying fields and records). With a database system, one did not have

to write COBOL programs to handle data directly but rather relied on the

database program to handle data. With these systems, each department

could share data and resources. Instead of each department having its own

programmers and perhaps its own computer, there could be one central

computer to store data, one programming staff, and one database software

package. Data could be shared, with each department having its own

view of the data. All this sounds great, but in reality it took several years

to break away from the “my data” mold. In addition, there were hybrid

systems that emerged that focused mainly on retrieval of data that delayed

the move to a totally relational environment because of the investment

companies had in software and programmers.

Why was sharing data a good thing? It not only used expensive resources

more efficiently but also reduced redundancy. Redundancy means storing

the same information in different places. If each department stored its

own version of data, its own view of customers, then the customer name,

address, telephone number, and so on were recorded by each department.

Suppose the customer moved. Then, each department changed its data

when it found out the customer moved, and at some point the customer’s

address could be stored inconsistently: The accounting department had

one address, the sales department had another. The root problem here is

the lack of data sharing, and sharing was a central goal of the early database

systems.

The early database software evolved into two main data models: the

hierarchical and the network models. Although the relational model for

a database was recognized as a desirable technique in the early 1970s, the

relational model was treated as a really good theoretical technique for

which computers were not fast enough to implement.

20 • Database Design Using Entity-Relationship Diagrams

The database models (hierarchical, network, and relational) were logical

models—ways of logically perceiving the arrangement of data in a file structure.

One perceived how the data was to be logically stored, and the database

physically implemented the logical model. As we shall see, there is a close relationship

between the logical and physical implementations of the hierarchical

and network models. Since there were no practical relational implementations

on other than what was then supercomputers at research centers, the world of

commercial databases in the 1970s involved choosing between the hierarchical

and network models. The next sections give a little insight into each of

these three main models and an introduction to the relational model.

CHECKPOINT 2.2

1. What is a sequential file?

2. What is COBOL?

3. Why is ordering important in a sequential filing system?

4. What is a database program?

5. In the early days, how was data put into a file?

2.4 DATABASE MODELS

We now take a look back at database models as they were before the relational

database was practical. The look back shows why the “old systems”

are considered obsolete and why the relational model is the de facto standard

in databases today. The old systems were classified as two main types:

hierarchical and network. These two models defined a database before the

1980s. Although they might be considered “old fashioned,” there are systems

still in use today that depend on these models.

2.4.1 The Hierarchical Model

In hierarchical database models, all data are arranged in a top-down

fashion in which some records have one or more “dependent” or “child”

records, and each child record is tied to one and only one “parent.” The

parent-child relationship is not meant to infer a human familial relationship.

The terms parent and child are historical and are meant to conjure

up a picture of one type of data as dependent on another. As is illustrated

here, the child will be sports played by a parent-person. Another terminology

for the parent-child relationship is owner and objects owned, but

parent-child is more common.

Data and Data Models • 21

In this section, we present some versions of the hierarchical model for

several reasons: (a) to illustrate how older models were constructed from

file systems; (b) to show why file-based databases became outdated when

relational databases became practical; and (c) to see the evolution of filebased

systems.

We begin with an example of a hierarchical file situation. Suppose you

have a database of people who play a sport at some location. A sample

record might consist of Brenda who plays tennis at city courts and who

plays golf at the municipal links. The person, Brenda, would be at the top

of the hierarchy, and the sport-location would be the second layer of the

hierarchy. Usually, the connection between the layers of the hierarchy is

referred to as a parent-child relationship.

Each parent person may be related to many child sport-locations, but

each sport-location (each child) is tied back to the one person (one parent)

it “is played by.” You may be thinking, “Suppose the same sport is played

by two people at the same location?” That is a good question, but it is not

a situation that can be handled by hierarchical file systems in a direct way.

If, for example, Richard also plays golf at the municipal links, then golf at

municipal links will be recorded twice as a child for each parent. If the

same sport-location occurs, then it is recorded redundantly.

The one person, many sports tie may be called a one-to-many relationship.

Some more data might look like the following, with a person’s name

followed by the sport and location:

The model:

Person: sport, location; sport, location; sport,

location ...

The data itself:

Brenda: tennis, city courts; golf, municipal links

Richard: golf, municipal links; snorkeling, Pensacola

bay; running, UWF track

Abbie: downhill skiing, Ski Beech

Now this data could be stored in a database in many ways. You could

imagine information on each person on a 3 × 5 card with the person,s

data. You could also store the data on separate cards in a sequential file

like this:

Brenda

tennis, city courts

golf, municipal links

22 • Database Design Using Entity-Relationship Diagrams

Richard

golf, municipal links

snorkeling, Pensacola bay

running, UWF track

Abbie

downhill skiing, Ski Beech

The first thing you notice is that there is a series of text records (or 3 ×

5 cards), but it is hard to distinguish between person and sports without

semantics (you can tell the difference between person and sports by looking

at the entry, but the computer cannot do this unless you give the line of text

a meaning: semantics). For the computer, it is helpful to identify whether a

record is a person (P) or a sport (S). Hence, the sequential file could be

P Brenda

S tennis, city courts

S golf, municipal links

P Richard

S golf, municipal links

S snorkeling, Pensacola bay

S running, UWF track

P Abbie

S downhill skiing, Ski Beech

In actuality, this sequential file would be useful only if a person wanted

to list all the people followed by the sport-locations for that person.

Furthermore, the relationship of person to sport-location is one of position

in the file. The arrangement was sometimes called a header-trailer format;

persons are header records, sports are trailer records. If one wanted to

know all the people who played golf, they would have to read the entire file

and pick out those individuals. A better organization of the data would

be to have two files, one for person, one file for sport-locations. For the

two-file model to make sense (i.e., to have the files “related”), there would

have to be pointers or references of some kind from one file to the other.

One way to implement a pointer scheme would be a pointer from the sport

(child) to the person (parent) like this:

PERSON(record_number, name)

With data:

1. Brenda

2. Richard

3. Abbie

Data and Data Models • 23

SPORTS(sport, location, reference to person)

With data:

tennis, city courts, 1

golf, municipal links,1

golf, municipal links, 2

snorkeling, Pensacola bay, 2

running, UWF track, 2

downhill skiing, Ski Beech, 3

A diagram of this relationship is shown in Figure 2.1.

The actual location of the records on a disk could be arbitrary. The sense

of the data is not lost if the locations are disk addresses and if you imagine

that there are thousands of persons and sports:

PERSON(record address, name)

With data:

A45C. Brenda

C333. Abbie

B259. Richard

SPORTS(sport, location, reference to person)

With data:

golf, municipal links, B259

running, UWF track, B259

downhill skiing, Ski Beech, C333

snorkeling, Pensacola bay, B259

tennis, city courts, A45C

golf, municipal links, A45C

This system has a parent-child link. Here, we assume that the primary

key of the person file is the record_number. The “reference to person” in

the sports file refers to the primary key of the person file and is called a

Person (Parent record)

Sport Sport Sport

(Child Records)

FIGURE 2.1

A hierarchy of persons and sports with parent pointers.

24 • Database Design Using Entity-Relationship Diagrams

foreign key (FK) because it is a primary key of another file. The FK references

a primary key, hence completing the essential relationship of one

file to the other. If there were no relationship, then you would have two

independent files with no connection—the system would make no sense.

While we have established the relationship of child to parent in the

discussion, the database has some drawbacks. To answer a question like,

“Who plays golf at municipal links?” you start looking at the Sports file,

look for “golf at municipal links,” and see what parent records there are. If

your question were, “What sports does Richard play?” you would first find

Richard in the Person file, then look through the Sports file to find links

back to Richard.

If you were actually implementing this model, you could enhance the

system a bit. An improvement to this model would be to reference each

sport from within each parent record (let us go back to simple numbers

for this):

PERSON(record address, name, (sport address))

With data:

1. Brenda, (101, 102)

2. Richard (103, 104, 105)

3. Abbie (106)

SPORTS(sport, location, reference to person)

With data:

101, tennis, city courts, 1

102, golf, municipal links, 1

103, golf, municipal links, 2

104, snorkeling, Pensacola bay, 2

105, running, UWF track, 2

106, downhill skiing, Ski Beech, 3

Figure 2.2 depicts the relationship between parent and child records in

both directions. When viewed from the parent, this link is called a multiple-

child pointer (MCP). When viewed from the child the link is called a

parent pointer.

In this model, the relationship between parent and child records is done

two ways. The “reference to person” in the Sports file is a link (the FK) from

child to parent. The reference to multiple children in the parent records is

called an MCP scheme. While it is true that the MCP is redundant to the

relationship, it does two practical things: (a) It allows questions to be asked

Data and Data Models • 25

of the system that are easier to answer than with just parent pointers; and

(b) it allows the file system to be audited with a computer program. For

example, if you want to ask the system, “How many sports does Richard

play?” you need only look at the person file and count MCP references.

We have illustrated two ways to construct a relationship between two

files: the FK parent pointer and the MCP. Both of the linking techniques

are viable relationship implementations. The second one, the MCP system,

was actually implemented more often than the FK system. This is perhaps

because from a hierarchical standpoint it is easier to see the hierarchy.

You could implement either (or both) relationships to make this a hierarchical

database. Please remember that this illustration is just skeletal. In

real life, there would likely be far more fields in the records for the person

and for the sport or location. Also in real databases, one expects thousands,

if not millions, of records. If you could implement the relationship

between person and sports only one way, which of these systems (MCP or

FK) would work best if there were 1 million person-sport combinations?

The answer to this would likely depend on how many sports each person

played and how variable the number of sports was. For example, if no person

played more than five sports, the MCP system would work quite well.

If, on the other hand, the number of sports played was highly variable

(one person plays 1 or 2 sports, and some play 100), then the FK system

might be better because the FK system only keeps a parent reference, and

the MCP system with this high variability of pointers in the parent might

become cumbersome.

2.4.1.1 The Hierarchical Model with a Linked List

Having used the MCP/FK system to construct a hierarchical set of data,

we now present a second hierarchical model. The MCP system has some

drawbacks regardless of which direction one approaches the relationship

Person (Parent record)

Sport Sport Sport

(Child Records)

FIGURE 2.2

A hierarchy of persons and sports with parent and child pointers.

26 • Database Design Using Entity-Relationship Diagrams

(as MCP, FK, or both). For example, suppose you implemented the system

with the child pointers in the parent record (MCP), and there were more

child records than were planned. Suppose the system was not person and

sports, but students at a school and absences. The student would be the

parent, and the absence records would be the child. Now, suppose you

designed this system with an MCP relationship so that a student could

have up to 20 absences. What happens when the student is absent for the

21st time? One of two things has to happen: (a) The MCP system would

have to be modified to include some kind of overflow; or (b) some other

hierarchical system would have to be used. We could, of course, implement

this as an FK system and ignore the MCP part entirely. Or, we could

implement a different system, and that is what we illustrate now.

The following file system uses a “linked list” or “chain” system to implement

the relationship between parent and child. This system is similar to

both of the schemes discussed. With the same data, the records would set

up like this:

Parent (link to 1st child)

and within the child records:

Child (link to next child)

Here is the data from above with this type of arrangement:

PERSON(record address, name, first sport)

With data:

1. Brenda (101)

2. Richard (103)

3. Abbie (106)

SPORTS(sport, location, link to next sport for that person)

With data:

101, tennis, city courts, 102

102, golf, municipal links, 999

103, golf, municipal links, 104

104, snorkeling, Pensacola bay, 105

105, running, UWF track, 999

106, downhill skiing, Ski Beech, 999

Here, 999 means “no next link.”

Figure 2.3 illustrates a linked list parent-to-child pointing scheme. In

this system, we have a link from parent to child that we did not have with

the FK system alone. Furthermore, the records in both the parent and the

Data and Data Models • 27

child are uniform. Both the parent and the child records contain only one

pointer. Also in this system, it would not matter whether a person played

1 sport or 100—the system works well if the number of child records is

unknown or highly variable. If you would argue that finding a specific

child record among 200,000 sport records might be time consuming, you

are 100% correct. If you argue that the system is fragile and that if one link

is lost the whole thing goes down, you are correct again. While this linked

list system may look somewhat fragile, it formed the basis for several of

the most successful commercial databases. As you might expect, enhancements

to the basic linked list included things like direct links back to the

parent, forward and backward links, links that skipped along the chain in

one direction and not in the other (coral rings). All these enhancements

were the fodder of databases in the 1970s.

2.4.1.2 Relationship Terminology

Having seen how to implement relationships in hierarchical databases,

we need to tighten some language about how relationships are formed.

Relationships in all database models have what are called structural

constraints. A structural constraint consists of two notions: cardinality

and optionality. Cardinality is a description of how many of one record

type relate to the other and vice versa. Suppose we implement a database

about the personnel of a company. We have an employee (parent) and the

employee’s dependents (child). In our company, if an employee can have

multiple dependents and the dependent can have only one employee parent,

we would say the cardinality of the relationship is one to many: One

employee relates to many dependents, abbreviated 1:M. If the company

were such that employees might have multiple dependents and a dependent

might be claimed by more than one employee, then the cardinality

Person

(Parent record)

Sport Sport Sport (Child Records)

FIGURE 2.3

A hierarchy of persons and sports with a parent-to-child linked list.

28 • Database Design Using Entity-Relationship Diagrams

would be many to many: Many employees relate to many dependents,

abbreviated M:N (as in M of one side relates to N on the other side, and M

and N are generally not equal).

Optionality refers to whether one record may or must have a corresponding

record in the other file. If the employee may or may not have dependents,

then the optionality of the employee-to-dependent relationship is

optional or partial. If the dependents must be “related to” employee(s) (and

every employee must have dependents), then the optionality of dependent

to employee is mandatory or full.

Further, relationships are always stated in both directions in a database

description. We would say

Employees may have zero or more dependents,

and

Dependents must be associated with one and only one employee.

Note the employee-to-dependent, one-to-many cardinality and the

optional/mandatory nature of the relationship. We return to this language,

but it is easy to see that the way a database is designed depends on

the description of it. If the description is clear and unambiguous, then the

likely outcome of database design is far more predictable.

2.4.1.3 Drawbacks of the Hierarchical Model

All relationships between records in a hierarchical model have a cardinality

of one to many or one to one, but never many to one or many to many.

So, for a hierarchical model of employee and dependent, we can only have

the employee-to-dependent relationship as one to many or one to one; an

employee may have zero or more dependents. In the hierarchical model,

you could not have dependents with multiple parent-employees (unless we

introduce redundant recording of dependents).

As we illustrated, the original way hierarchical databases were implemented

involved choosing some way of physically “connecting” the parent

and the child records. Imagine you have looked up information on an

employee in an employee filing cabinet and you want to find the dependent

records for that employee in a dependent filing cabinet. One way to implement

the employee-dependent relationship would be to have an employee

record point to a dependent record and have that dependent record point

to the next dependent (a linked list of child records). For example, you

find employee Jones. In Jones’s record there is a notation that Jones’s first

Data and Data Models • 29

dependent is found in the dependent filing cabinet, file drawer 2, record

17. The “file drawer 2, record 17” is called a pointer and is the connection

or relationship between the employee and the dependent. Now, to take

this example further, suppose the record of the dependent in file drawer

2, record 17, points to the next dependent in file drawer 3, record 38, then

that dependent points to the next dependent in file drawer 1, record 82.

As we pointed out in the person-sports model, the linked list approach

to connecting parent and child records has advantages and disadvantages.

For example, some advantages would be that each employee has to maintain

only one pointer, and that the size of the linked list of dependents

is theoretically unbounded. Drawbacks would include the fragility of

the system in that if one dependent record is destroyed, then the chain

is broken. Further, if you wanted information about only one of the child

records, you might have to look through many records before you found

the one you were looking for.

There are, of course, several other ways of implementing the parentchild

link. The point here is that some system has to be chosen to be implemented

in the underlying database software. Once the linking system is

chosen, it is fixed by the software implementation; the way the link is done

has to be used to link all child records to parents regardless of how inefficient

it might be for one situation.

There are three major drawbacks to the hierarchical model:

1. Not all situations fall into the one-to-many, parent-child format.

2. The choice of the way in which the files are linked has an impact on

performance both positively and negatively.

3. The linking of parents and child records is done physically. If the

dependent file were reorganized, then all pointers would have to be

reset.

2.5 THE NETWORK MODEL

Each of the methods presented for the hierarchical database has advantages

and disadvantages. Since the network model allows M:N relationships,

we want to implement a system of pointers for multiple parents for

each child record. How would this be handled? You would most likely

have to have multiple child-forward links with either a linked list or an

MCP system in the parent, parent pointers in the child records (perhaps

30 • Database Design Using Entity-Relationship Diagrams

more than one), and possibly some other enhancement to a pointer

scheme. The network model alleviated a major concern of the hierarchical

model because in the network model, one was not restricted to having

one parent per child; a many-to-many relationship or a many-to-one

relationship was acceptable. To give an example of a network approach,

let us revisit the Person-Sports example but now allow a sports record

to be connected to more than one person. Here is some sample data like

those presented with more persons and more sports and using an MCP

system in both directions:

First, the data:

Brenda: tennis, city courts; golf, municipal links

Richard: golf, municipal links; snorkeling, Pensacola

bay; running, UWF track

Abbie: downhill skiing, Ski Beech

David: snorkeling, Pensacola bay; golf, municipal

links

Kaitlyn: curling, Joe’s skating rink; downhill skiing,

Ski Beech

Chrissy: cheerleading, Mountain Breeze High; running,

UWF track

Now, suppose we diagram this conglomeration of data with pointers (we

use record numbers in each “file”):

PERSON(record_number, name, (sports))

SPORTS(record_number, sport, location, (who plays))

In each case, the part of the file description in parentheses that looks like

(who plays) is called a repeating group, meaning that it can have multiple

values. Our small, albeit perplexing, database looks like this:

PERSON (record_number, name, (sports))

With data:

1. Kaitlyn (107, 106)

2. Abbie (106)

3. Brenda (102, 101)

4. Chrissy (108, 105)

5. Richard (103, 104, 105)

6. David (104, 102)

SPORTS (record_number, sport, location, (who plays))

With data:

Data and Data Models • 31

101 tennis, city courts (3)

102 golf, municipal links (3, 6)

103 golf, municipal links (5)

104 snorkeling, Pensacola bay (5, 6)

105 running, UWF track (4, 5)

106 downhill skiing, Ski Beech (2, 1)

107 curling, Joe’s skating rink (1)

108 cheerleading, Mountain Breeze High (4)

The network with pointers in both directions is illustrated in Figure 2.4.

The complexity of the network database is exponentially greater than

that of the hierarchical one. The database just illustrated could have been

implemented as a series of linked child/parents or some other combination

of links and pointers. The second and third drawbacks of hierarchical

databases spill over to network databases. If one were to design a file-based

database system from scratch, one would have to choose some method of

physically connecting or linking records. This choice of record connection

then locks us into the same problem as before: a hardware-implemented

connection that has an impact on performance both positively and negatively.

Further, as the database becomes more complicated, the paths of

connections and the maintenance problems become exponentially more

difficult to manage.

As a project, you could create the Person-Sports database with a few more

records than given in the example using linked lists. At first, you might

think this to be a daunting exercise, but one of the most popular database

systems in the 1970s used a variant of this system. The parent and child

records were all linked with linked lists going in two directions—forward

and backward. The forward/backward idea was to speed up finding child

records such that one could search for children by going either way. An

enhancement of this system would be to use forward-pointing linked lists

with a backward-pointing chain of links, but with the backward chain

PersonA PersonB

SportW SportX SportY SportZ

FIGURE 2.4

A network of persons and sports with MCP and parent pointers.

32 • Database Design Using Entity-Relationship Diagrams

skipping every n records, where the optimal n turns out to be the square

root of the number of entries in the chain.

2.6 THE RELATIONAL MODEL

Codd (1970) introduced the relational model to describe a database that

did not suffer the drawbacks of the hierarchical and network models (i.e.,

physical links and hardware-bound restrictions). Codd’s premise was that

if we ignore the way data files are connected and arrange our data into simple

two-dimensional, unordered tables, then we can develop a calculus for

queries (questions posed to the database), and we can focus on the data as

data, not as a physical realization of a logical model. Codd’s idea was truly

logical in that one was no longer concerned with how data was physically

stored. Rather, data sets were simply unordered, two-dimensional tables

of data. To arrive at a workable way of deciding which pieces of data went

into which table, Codd proposed “normal forms.” To understand normal

forms, we must first introduce the notion of functional dependencies. After

we understand functional dependences, normal forms follow.

As a historical note, when Codd introduced his relational model, it

was deemed by most people as “theoretical only.” At the time, a typical

mainframe computer might have 64K of internal memory, and a good

set of hard disks might have as much as several megabytes of storage. To

top that off, the computer typically took up a large room that required

separate air handling, special architectural features like raised floors,

and enhanced power grids. Remember that all the “computing power”

was shared by everyone who had to have computing. In a company, users

might include accounting, purchasing, personnel, finance, and so on.

For even one of those units to be able to run the relational model in the

early 1970s would have required vast dedicated resources. As computers

became more ubiquitous, less expensive, smaller, and so on, the amount

of memory available both internally and externally became cheaper and

had far greater capacity. The relational model “grew up” with the evolution

of computers in the 1980s. We expand the notion of relational database

in the next chapter.

Data and Data Models • 33

CHECKPOINT 2.3

1. What are the three main data models we have discussed?

2. Which data model is mostly used today? Why?

3. What are some of the disadvantages of the hierarchical data model?

4. What are some of the disadvantages of the network data model?

5. How are all relationships (mainly the cardinalities) described in the hierarchical

data model? How can these be a disadvantage of the hierarchical

data model?

6. How are all relationships (mainly the cardinalities) described in the

network data model? Would you treat these as advantages or disadvantages

of the network data model?

7. What are structural constraints?

8. Why was Codd’s promise of the relational model better?

2.7 CHAPTER SUMMARY

In this chapter, we covered concepts that are essential to the understanding

and design of a database. We also covered data models from a historical

perspective—the hierarchical and network models and then introduction

of the relational model. This chapter should serve more as background to

the material for the rest of the book.

BIBLIOGRAPHY

Codd, E. F. 1970. A relational model of data for large shared data banks. Communications

of the ACM, 13(6): 377–387.

MaxiCompu Fagaad

lunes, 7 de agosto de 2017

Data and Data Models

No hay comentarios.:

Publicar un comentario