Let be sets. An n-ary relation on these sets is a subset of . The sets are called the domains of the relation, and is called its degree.
For example, let bbe the relation consisting of triples , where , , and are integers with . Then , but . The degree of the relation is . Its domains are all equal to the set of natural numbers.
The relational data model is based on the concept of a relation.
A database consists of records, which are n-tuples, made up of fields. The fields are the entries of the n-tuples. The relational data model represents a database of records as an n-ary relation.
For instance, a database of student records may be made up of fields containing the name, student number, major, and gpa of the student. Student records are represented as 4-tuples of the form (Student_name, ID_number, Major, GPA).
Relations used to represent databases are called tables. Each column of the table corresponds to an attribute of the database. A table of student records discussed above is shown below..
A domain of an n-ary relation is called a primary key when the value of the n-tuple from this domain determines the n-tuple. That is, a domain is a primary key when no two n-tuples in the relation have the same value for this domain.
The current collection of n-tuples in a relation is called the extension of the relation. The more permanent part of a database, including the name and attributes of the database, is called its intension.
When selecting a primary key, the goal should be to select a key that can serve as a primary key for all possible extensions of the database. To do this, it is necessary to examine the intension of the database to understand the set of possible n-tuples that can occur in an extension.
Combinations of domains can also uniquely identify n-tuples in an n-ary relation. When the values of a set of domains determine an n-tuple in a relation, the Cartesian product of these domains is called a composite key.
Let be an n-ary relation and a condition that elements in must satisfy. Then the selection operator maps the n-ary relation to the n-ary relation of all n-tuples from that satisfy the condition .
Using our student records table above, we could use the operator where is the condition Major="Computer Science". The result is the two 4 tuples containing data on the students Ackermann and Chou.
The projection where , maps the n-tuple to the m-tuple , where .
In other words, the projection deletes of the components of an n-tuple, leaving the th, th, , and th components.
For example, taking the projection on the 4-tuples and (Jane Doe, 234111001, Geography, 3.14) results in and (Jane Doe, Geography) respectively.
Used to combine two tables into one when these tables share some identical fields. For instance, a table containing fields for airline, flight number, and gate, and another table containing fields for flight number, gate, and departure time can be combined into a table containing fields for airline, flight number, gate, and departure time.
Let be a relation of degree and a relation of degree . The join where and , is a relation of degree that consists of all -tuples , where the -tuple belongs to and the n-tuple belongs to .
In other words, the join operator produces a new relation from two relations by combining all m-tuples of the first relation with all n-tuples of the second relation, where the last components of the agree with the first components of the n-tuples.
We will use supermarket transactions as an example domain for discussing concepts in data mining. We will use records from a database to address the question: How likely is it that a customer buys a product given that they also buy a collection of one or more specified products?
A transaction is a set of items bought by a customer during a visit to the store, such as { milk, eggs, bread } or {orange juice, bananas, yogurt, cream}.
Each product in the store is an item. A collection of items is an itemset (or basket or transaction in this context). A k-itemset is an itemset that contains exactly items. When a store has items for sale , each transaction can be represented by an n-tuple , where is a binary variable that tells us whether occurs in this transaction. We can represent a transaction by an -tuple of the form (transaction number, . The collection of these -tuples forms a database of transactions.
The count of an itemset , denoted by , in a set of transactions , where is a positive integer, is the number of transactions that contain this itemset. That is,
The support of an itemset is the probabilty that is included ina randomly selected transaction from . That is,
The support threshhold is specified for a particular application. Frequent itemset mining is the process of finding itemsets with support greater than or equal to . Such itemsets are said to be frequent.
If is a set of items and is a set of transactions, an association rule is an implication of the form , where and are disjoint subsets of . Note that the statement is not the statement that whenever is a subset of a transaction, then must also be one. Instead, the strength of an association rule is measured in terms of its support, the frequency of transactions that contain both and , and its confidence, the frequency of transactions that contain when they also contain .
If and are subsets of a set of transactions, then
and
The support of the association rule , the fraction of transactions containing both and , is a useful measure because a low support value tells us a transaction containing all items in and all items in is seldom purchased. A high value would tell us they are purchased together in a large fraction of transactions. The confidence of is the conditional probability that a transaction will contain all the items in and in given that it contains all the items in . So the larger confidence in an association rule , the more likely it is for to be a subset of a transaction that contains .
An important problem in data mining is to find strong association rules, which have support and confidence greater than that of a minimum confidence and support levels.