Are you an SQL programmer that, like many, came to SQL after learning and writing procedural or object-oriented code? Or have switched jobs to where a. Joe Celko's SQL Programming Style (The Morgan Kaufmann Series in Data Management Systems). Home · Joe Celko's SQL . Size Report. DOWNLOAD PDF. Purchase Joe Celko's SQL Programming Style - 1st Edition. Print Book & E-Book. DRM-free (EPub, PDF, Mobi). × DRM-Free Easy - Download and start.
|Language:||English, Spanish, Hindi|
|Genre:||Science & Research|
|PDF File Size:||13.54 MB|
|Distribution:||Free* [*Regsitration Required]|
Celko J. SQL Programming Style. Файл формата pdf Celko J. Joe Celko's SQL for Smarties: Advanced SQL Programming. pdf. Раздел: Компьютерная. Joe Celko's SQL Programming Style, Joe Celko. • Data Mining, Second Edition: Concepts and Techniques, Ian Witten and Eibe Frank. • Fuzzy Modeling and. eBook Title for Download:Joe Celko's SQL Programming Style (Format: pdf, Language: English). Plot: Review. "Joe Celko, maybe one of the most prominent .
Give you the mental tools to approach a new problem with SQL as your tool, rather than another programming language — one that someone else might not know! Names and Data Elements 1. Fonts, Punctuation, and Spacing 2. Data Declaration Language 3. Scales and Measurements 4. Data Encoding Schemes 5. Coding Choices 6.
A "Manual of Style" for the SQL programmer, this book is a collection of heuristics and rules, tips, and tricks that will help you improve SQL programming style and proficiency, and for formatting and writing portable, readable, maintainable SQL code. Help you write Standard SQL without an accent or a dialect that is used in another programming language or a specific flavor of SQL, code that can be maintained and used by other people.
Enable you to give your group a coding standard for internal use, to enable programmers to use a consistent style. Give you the mental tools to approach a new problem with SQL as your tool, rather than another programming language -- one that someone else might not know!
Get A Copy. More Details Original Title. Other Editions 4. Friend Reviews. To see what your friends thought of this book, please sign up. Lists with This Book. Community Reviews. Showing Rating details. Sort order. Feb 24, Alex French rated it did not like it. I like some of the underlying ideas in this book. I may not be qualified to quibble about some of the ideas that I don't agree with. But I do feel qualified to quibble about some of the topical coverage, and certainly about the presentation of material.
I found the book frustrating enough that I started making negative notes only a few pages in. Coverage note that I read the edition: Even if this is Celko's background, I would appreciate some intelligent discussion understanding that this isn't always the case. Maybe this shouldn't be surprising since the book doesn't even attempt to establish a general progression of ideas or story arc.
This seems irresponsible even with the general admonition. For a book that is about programming style, which is fundamentally about organization of information at all scales typographical, conceptual, everything in between , this book just sucks. The presentation details seriously impact the ability to absorb the material- simple ideas become difficult to understand and difficult ideas become impossible.
Especially difficult when the thing being illustrated is indentation! It is not clear whether he asked for permission before calling out some forum posters by name. For your data integrity, you can: Ignore the problem.
This is actually what most newbies do. When the database becomes a mess without any data integrity, they move on to the second solution. Write elaborate ad hoc CHECK constraints with user-defined functions or proprietary bit-level library functions that cannot port and that run like cold glue. Now we add a seventh condition to the vector: Which end does it go on? How did you get it in the right place on all the possible hardware that it will ever use? Did the code that references a bit in a word by its position do it right after the change?
You need to sit down and think about how to design an encoding of the data that is high level, general enough to expand, abstract, and portable. For example, is that loan approval a hierarchical code? Vector code? It is not easy to design such things!
Very, very special circumstances where there is no alternative at the present time might excuse the use of proprietary data types. Next, consider porting a proprietary data type by building a userdefined distinct type that matches the proprietary data type. This is not always possible, so check your product. Having the key as the first thing you read in a table declaration gives you important information about the nature of the table and how you will find the entities in it.
In the case of a compound primary key, the columns that make up the key might not fit nicely into the next rule 3. If this is the case, then put a comment by each component of the primary key to make it easier to find.
The physical order of the columns within a table is not supposed to matter in the relational model. Their names and not their ordinal positions identify columns, but SQL has ordinal positions for columns in tables in default situations. For example, the columns for an address are best put in their expected order: Thanks to columns being added after the schema is in place, you might not be able to arrange the table as you would like in your SQL product.
Check to see if your product allows column reordering. If you have a physical implementation that uses the column ordering in some special way, you need to take advantage of it. If the change does not cause the length of the variable row to change size, it goes back to logging from the first byte changed to the last byte changed. The DBA can take advantage of this knowledge to optimize performance by placing: Because the log can be a significant bottleneck for performance, this approach is handy.
You can always create the table and then create a view for use by developers that resequences the columns into the logical order if it is that important. The standard does not require that they appear together in any particular order. The constraint name will show up in error messages when it is violated. This gives you the ability to create meaningful messages and easily locate the errors. For example: The exception is that Oracle will use the system-generated name when it displays the execution plans.
You can leave off constraint names during development work. We want as much information about a column on that column as possible. Having to look in several places for the definition of a column can only cost us time and accuracy. Likewise, put multicolumn constraints as near to the columns involved as is reasonable.
Multicolumn constraints on columns that are far apart should be moved to the end of the table declaration. This will give you one place to look for the more complex constraints, rather than trying to look all over the DDL statement. It can also be argued that none of this really matters, because most of the time we should be going to the schema information tables to retrieve the constraint definitions, not the DDL.
Constraints may have been removed or added with subsequent ALTER statements, and the system catalog will have the correct, current state, whereas the DDL may not.
The whole idea of a database is that it is a single trusted repository for all of the data in the enterprise. This is the place where the business rules must be enforced. The most common constraint on numbers in a data model is that they are not less than zero.
Now look at actual DDL and see how often you find that constraint. Programmers are lazy and do not bother with this level of details. When the column really can take any value whatsoever. Again, the whole idea of a database is that it is a single trusted repository for all of the data in the enterprise.
This is not as portable an option as numeric range checking, and many programmers who did not use UNIX in their youth have problems with regular expressions, but it is still important. You can ask Einstein or go back to the Greek philosopher Zeno and his famous paradoxes. Temporal values have duration, and you need to remember that they have a start and finish time, either explicitly or implicitly, that includes all of the continuum bound by them. The implicit model is a single column and the explicit model uses a pair of temporal values.
For example, when you set a due date for a payment, you usually mean any point from the start of that day up to but not including midnight of the following day. When you say an employee worked on a given date, you usually mean the event occurred during an eight-hour duration within that day. A CHECK constraint can round off time values to the start of the nearest year, month, day, hour, minute, or second as needed.
There will be exceptions for scientific and statistical data. Do not make the reader have to look in multiple physical locations to find all of the columns involved in the constraint. You do not have to indent this constraint, but it is a good idea to split it on two lines: This is not always physically possible, especially when many columns are involved.
Their predicates involve the entire table as a whole rather than just single rows. This implies that they will involve aggregate functions. Their predicates involve several different tables, not just one table. This implies that they are at a higher level and should be modeled there. The assertion name acts as the constraint name.
Put simple CHECK constraints in their own clauses rather than writing one long constraint with multiple tests. When you give a constraint a name, that name will appear in error messages and can help the user to correct data. For example, imagine a single validation for a name that looks for correct capitalization, extra spaces, and a length over five characters. However, if there were separate checks for capitalization, extra spaces, and a length over five characters, then those constraint names would be obvious and give the user a clue as to the actual problem.
If you do not want to give details about errors to users for security reasons, then you can use a single constraint with a vague name.
This would be a strange situation. This is the very definition of a table. The problem is that many newbies do not understand what a key really is. A key must be a subset of the attributes columns in the table. There is no such thing as a universal, one-size-fits-all key. Just as no two sets of entities are the same, the attributes that make them unique have to be found in the reality of the data. God did not put a letter Hebrew number on the bottom of everything in creation.
Here is my classification of types of keys Table 3. Table 3. A natural key is a subset of attributes that occurs in a table and acts as a unique identifier.
The user sees them. You can go to the external reality and verify them. You would also like to have some validation rule. An artificial key is an extra attribute added to the table that is seen by the user. The open codes in the UPC scheme that a user can assign to his or her own products. The check digit still works the same way, but you have to verify the codes inside your own enterprise.
If you have to construct a key yourself, it takes time to design it, to invent a validation rule, and so forth. There is a chapter on that topic in this book. Chapter 5 discusses the design of encoding schemes. An exposed physical locator is not based on attributes in the data model and is exposed to the user. There is no way to predict it or verify it.
The system obtains a value through some physical process in the storage hardware that is totally unrelated to the logical data model. This is the worst way to program in SQL. A surrogate key is system generated to replace the actual key behind the covers where the user never sees it. It is based on attributes in the table. Teradata hashing algorithms, pointer chains. When users can get to them, they will screw up the data integrity by getting the real keys and these physical locators out of sync.
The system must maintain them. Notice that people get exposed physical locator and surrogate mixed up; they are totally different concepts.
You put tables together from attributes, with the help of a data dictionary to model entities in SQL. Fields and subfields had to be completely specified to locate the data. There are important differences between a file system and a database, a table and a file, a row and a record, and a column and a field. If you do not have a good conceptual model, you hit a ceiling and cannot get past a certain level of competency. A file system is a loose collection of files, which have a lot of redundant data in them.
A database system is a single unit that models the entire enterprise as tables, constraints, and so forth. You open an entire database, not single tables within it, but you do open individual files. An action on one file cannot affect another file unless they are in the same application program; tables can interact without your knowledge via DRI actions, triggers, and so on. The original idea of a database was to collect data in a way that avoided redundant data in too many files and not have it depend on a particular programming language.
A file is made up of records, and records are made up of fields. A file is ordered and can be accessed by a physical location, whereas a table is not.
A database is language independent; the internal SQL data types are converted into host language data types. A field exists only because of the program reading it; a column exists because it is in a table in a database. A column is independent of any host language application program that might use it. You have no idea whatsoever how a column is physically represented internally because you never see it directly.
Consider temporal data types: When you have a field, you have to worry about that physical representation. SQL says not to worry about the bits; you think of data in the abstract. Rows and columns have constraints. Records and fields can have anything in them and often do! Talk to anyone who has tried to build a data warehouse about that problem.
Codd defined a row as a representation of a single simple fact. A record is usually a combination of a lot of facts. When the system needs new data, you add fields to the end of the records. That is how we got records that were measured in Kbytes. A checklist of desirable properties for a key is a good way to do a design inspection. There is no need to be negative all the time.
The first property is that the key be unique. This is the most basic property it can have because without uniqueness it cannot be a key by definition. Uniqueness is necessary, but not sufficient.
Uniqueness has a context. An identifier can be unique in the local database, in the enterprise across databases, or unique universally. We would prefer the last of those three options. We can often get universal uniqueness with industry: We can get enterprise uniqueness 3. An identifier that is unique only in a single database is workable but pretty much useless because it will lack the other desired properties.
The second property we want is stability or invariance. The first kind of stability is within the schema, and this applies to both key and nonkey columns. The same data element should have the same representation wherever it appears in the schema.
The same basic set of constraints should apply to it. That is, if we use the VIN as an identifier, then we can constrain it to be only for vehicles from Ford Motors; we cannot change the format of the VIN in one table and not in all others.
The next kind of stability is over time. You do not want keys changing frequently or in unpredictable ways. Contrary to a popular myth, this does not mean that keys cannot ever change. As the scope of their context grows, they should be able to change. The reason was globalization and erosion of American industrial domination. The EAN was set up in and uses 13 digits, whereas the UPC has 12 digits, of which you see 10 broken into two groups of 5 digits on a label.
The Uniform Code Council, which sets the standards in North America, has the details for the conversion worked out. More than 5 billion bar-coded products are scanned every day on earth. It has made data mining in retail possible and saved millions of hours of labor. Why would you make up your own code and stick labels on everything? The neoLuddites have been with us a long time. The grocery chain had stores in , and the grocery industry works 1 percent to 3 percent profit margins—the smallest margins of any industry that is not taking a loss.
See the following sources for more information: It helps if the users know something about the data. This is not quite the same as validation, but it is related. Validation can tell you if the code is properly formed via some process; familiarity can tell you if it feels right because you know something about the context. Thus, ICD codes for disease would confuse a patient but not a medical records clerk. Can you look at the data value and tell that it is wrong, without using an external source?
Check digits and fixed format codes are one way of obtaining this validation. How do I verify a key? This also comes in context and in levels of trust. Or rather, the clerk used to believe it was me; the Kroger grocery store chain is now putting an inkless fingerprinting system in place, just like many banks have done.
There is a little less trust here. When I get a security clearance, I also need to be investigated. There is a lot less trust. A key without a verification method has no data integrity and will lead to the accumulation of bad data. A key should be as simple as possible, but no simpler. People, reports, and other systems will use the keys.
Long, complex keys are more subject to error; storing and transmitting them is not an issue anymore, the way it was 40 or 50 years ago. A country code at the start of the string determines how to parse the rest of the string, and it can be up to 34 alphanumeric characters in length. Each country has its own account numbering systems, currencies, and laws, and they seldom match. In effect, the IBAN is a local banking code hidden inside an international standard see http: More and more programmers who have absolutely no database training are being told to design a database.
This magical, universal, one-size-fits-all numbering is totally nonrelational, depends on the physical state of the hardware at a particular time, and is a poor attempt at mimicking a magnetic tape file system.
They know that they need to verify the data against the reality they are modeling. A trusted external source is a good thing to have.
The reasons given for this poor programming practice are many, so let me go down the list: So what? This is the 21st century, and we have much better computers than we did in the s when key size was a real physical issue.
What is funny to me is the number of idiots who replace a natural two- or three-integer compound key with a huge GUID, which no human being or other system can possibly understand, because they think it will be faster and easy to program. This is an implementation problem that the SQL engine can handle.
They guarantee that no search requires more than two probes, no matter how large the database. A tree index requires more and more probes as the size of the database increases. A long key is not always a bad thing for performance. For example, if I use city, state as my key, I get a free index on just city. I can also add extra columns to the key to make it a super-key when such a superkey gives me a covering index i. Sure, if I want to lose all of the advantages of an abstract data model, SQL set-oriented programming, carry extra data, and destroy the portability of code.
Look at any of the newsgroups and see how difficult it is to move the various exposed physical locators in the same product. The auto-numbering features are a holdover from the early SQLs, which were based on contiguous storage file systems. The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns.
In short, just like a deck of punchcards or a magnetic tape. Most programmers still carry that mental model, too. But physically contiguous storage is only one way of building a relational database, and it is not the best one.
The basic idea of a relational database is that the user is not supposed to know how or where things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product on particular hardware at a particular time.
The first practical consideration is that auto-numbering is proprietary and nonportable, so you know that you will have maintenance problems 3. Newbies actually think they will never port code!
Perhaps they only work for companies that are failing and will be gone. Perhaps their code is such a disaster that nobody else wants their application. First, try to create a table with two columns and try to make them both auto-numbered. If you cannot declare more than one column to be of a certain data type, then that thing is not a data type at all, by definition. It is a property that belongs to the physical table, not the logical data in the table. Next, create a table with one column and make it an auto-number.
Now try to insert, update, and delete different numbers from it. If you cannot insert, update, and delete rows, then it is not really a table by definition.
Finally, create a simple table with one hidden auto-number column and a few other columns. Use a few statements like: If you delete a row, the gap in the sequence is not filled in, and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in preallocated sequential files in the s, by the way. A utility program would then pack or compress the records that were flagged as deleted or unused to move the empty space to the physical end of the physical file.
But we now use a statement with a query expression in it, like this: The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are n! The answer has been to use whatever the physical order of the result set happened to be. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order.
Can you explain from a logical model why the same rows in the second query get different auto-numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical. Using auto-numbering as a primary key is a sign that there is no data model, only an imitation of a sequential file system.
Because this magic, all-purpose, one-size-fits-all pseudo identifier exists only as a result of the physical state of a particular piece of hardware, at a particular time, as read by the current release of a particular database product, how do you verify that an entity has such a number in the reality you are modeling? People run into this problem when they have to rebuild their database from scratch after a disaster.
You will see newbies who design tables like this: Your data integrity is trashed. The natural key was this: But because there is no way to reconcile the auto-number and the natural key, you have no data integrity. To demonstrate, here is a typical newbie schema.
Finally, an appeal to authority, with a quote from Dr. Codd This means that a surrogate ought to act like an index: That means never used in queries, DRI, or anything else that a user does. There are three difficulties in employing user-controlled keys as permanent surrogates for entities. The actual values of user-controlled keys are determined by users and must therefore be subject to change by them e.
Two relations may have user-controlled keys defined on distinct domains e. It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one e. These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities.
A solution—proposed in part  and more fully in —is to introduce entity domains, which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them.
Codd, If you are using the table as a staging area for data scrubbing or some other purpose than as a database, then feel free to use any kind of proprietary feature you wish to get the data right. Today, however, you should consider using ETL and other software tools that did not exist even a few years ago.
Attribute splitting consists of taking an attribute and modeling it in more than one place in the schema. This violates Domain-key Normal Form 3. There are several ways to do this, discussed in the following sections. But if I were to split data by years temporal values or by location spatial values or by department organizational values , you might not see the same problem. The bad news is that constraints to prevent overlaps among the tables in the collection can be forgotten or wrong.
Do not confuse attribute splitting with a partitioned table, which is maintained by the system and appears to be a whole to the users.
The solution is to have scale and keep all measurements in it. Look at section 3. You will also see attempts at formatting of long text columns by splitting e.
When you get a character-wide printout, though, you are in trouble. Another common version of this is to program dynamic domain changes in a table. That is, one column contains the domain, which is metadata, for another column, which is data. Glenn Carr posted a horrible example of having a column in a table change domain on the fly on September 29, , on the SQL Server programming newsgroup. His goal was to keep football statistics; this is a simplification of his original schema design.
I have removed about a dozen other errors in design, so we can concentrate on just the shifting domain problem. Here is a rewrite: If reusing jersey numbers is a problem, then I am sure that leagues have some standard in their industry for this, and I am sure that it is not an auto-incremented number that was set by the hardware in Mr.
The hardest part of the code is avoiding a division by zero in a calculation. Using the 3. I leave this as an exercise to the reader. This is not really an exception. You can use a column to change the scale, but not the domain, used in another column. For example, I record temperatures in degrees Absolute, Celsius, or Fahrenheit and put the standard abbreviation code in another column. I also want people to be able to update through those views in the units their equipment gives them.
A more complex example would be the use of the ISO currency codes with a decimal amount in a database that keeps international transactions. The domain is constant; the second column is always currency, never shoe size or body temperature.
Euros, Yen, Dollars, or whatever. But now there is a time element because the exchange rates change constantly. This is not an easy problem. The classic example is temporal, such as this list of events: None These are simply bad schema designs that are often the results of confusing the physical representation of the data with the logical model. This tends to be done by older programmers carrying old habits over from file systems.
For example, in the old days of magnetic tape files, the tapes were dated and processing was based on the one-to-one correspondence between time and a physical file. Another source of these errors is mimicking paper forms or input screens directly in the DDL. The most common is an order detail table that includes a line number because the paper form or screen for the order has a line number. Customers buy products that are identified in the inventory database by SKU, UPC, or other codes, not a physical line number on a form on the front of the application.
But the programmer splits the quantity attribute into multiple rows. We had Mount Rushmore and Bjarne Stroustrup as special attractions. His answer was that Bell Labs, with all its talent, had tried four different approaches to this problem and came to the conclusion that you should not do it. OO was great for programming but deadly for data. A table represents a set of entities or a 3. GUIDs, auto-numbering, and all of those proprietary exposed physical locators will not work in the long run.
Every typo becomes a new attribute, or class queries that would have been so easy in a relational model are now multitable monster outer joins, redundancy grows at an exponential rate, constraints are virtually impossible to write so you can kiss data integrity goodbye, and so on. The amount of gymnastics that I need to go through to do what should be the simplest query is unimaginable. It took six man-hours me and one of the OO developers for three hours to come up with a query that was the equivalent of: The final query was almost a full page long, required the joining of all the various tables for each data element as each data element is now an object and each object has its own attributes, so requires its own table , and of course the monster object-linking tables so as to obtain the correct instance of each object.
By the way, which instance is the correct one? Why, the latest one, of course, unless it is marked as not being the one to use, in which case look for the one that is so marked. And the marking indicator is not always the same value, as there are several potential values.
These objectlinking tables are the biggest in the entire database. Self-joins are needed in some cases; here are two of these monster tables, and a few smaller ones. The idea is that you have one huge table with three columns of metadata: This lets your users invent new entities as they use the database.
Now try to put a constraint on the column. There are better tools for collecting free-form data. Most programmers have never heard of measurement theory or thought about the best way to represent their data. Although this topic is not specifically about SQL style, it gives a foundation for decisions that have to be made in the design of any schema. Measurements are not the same as the attribute being measured. Measurement is not just assigning numbers to things or their attributes so much as it is assigning to things a structural property that can be expressed in numbers or other computable symbols.
This structure is the scale used to take the measurement; the numbers or symbols represent units of measure. Strange as it might seem, measurement theory came from psychology, not mathematics or computer science. In particular, S. Scales are classified into types by the properties they do or do not have. The properties with which we are concerned are the following: A natural origin point on the scale.
This is sometimes called a zero, but it does not have to be literally a numeric zero. For example, if the measurement is the distance between objects, the natural zero is zero meters—you cannot get any closer than that. If the measurement is the temperature of objects, the natural zero is zero degrees Kelvin—nothing can get any colder than absolute zero.
However, consider time: It goes from an eternal past into an eternal future, so you cannot find a natural origin for it. Meaningful operations can be performed on the units. It makes sense to add weights together to get a new weight. However, adding names or shoe sizes together is absurd. A natural ordering of the units.
It makes sense to speak about an event occurring before or after another event, or a thing being heavier, longer, or hotter than another thing, but the alphabetical order imposed on a list of names is arbitrary, not natural—a foreign language, with different names for the same objects, would impose another ordering.
A natural metric function on the units. Metric functions have the following three properties: The metric between an object and itself is the natural origin of the scale.
The order of the objects in the metric function does not matter. This notation is meant to be more general than just arithmetic. The zero in the first property is the origin of the scale, not just a numeric zero. The third property, defined with a plus and a greater than or equal 4.
The greater than or equal to sign refers to a natural ordering on the attribute being measured.
The plus sign refers to a meaningful operation in regard to that ordering, not just arithmetic addition. The special case of the third property, where the greater than or equal to is always greater than, is desirable to people because it means that they can use numbers for units and do simple arithmetic with the scales. This is called a strong metric property. For example, human perceptions of sound and light intensity follow a cube root law—that is, if you double the intensity of light, the perception of the intensity increases by only 20 percent Stevens, Knowing this, designers of stereo equipment use controls that work on a logarithmic scale internally but that show evenly spaced marks on the control panel of the amplifier.
It is possible to have a scale that has any combination of the metric properties. For example, instead of measuring the distance between two places in meters, measure it in units of effort. This is the old Chinese system, which had uphill and downhill units of distance. It takes no effort to get to where you already are located. It takes less effort to go downhill than to go uphill. The amount of effort needed to go directly to a place will always be less than the effort of making another stop along the way.
Because we have to store data in a database within certain limits, these properties are important to a database designer.