Lecture Notes: April 2011

Thursday, April 14, 2011

Data Mining - Supervised and Unsupervised Learning

Supervised and Unsupervised Learning
Data and Knowledge Mining is learning from data. In this context, data are allowed to speak for themselves and no prior assumptions are made. This learning from data comes in two flavors: supervised learning and unsupervised learning. In supervised learning (often also called directed data mining) the variables under investigation can be split into two groups: explanatory variables and one (or more) dependent variables. The target of the analysis is to specify a relationship between the explanatory variables and the dependent variable as it is done in regression analysis. To apply directed data mining techniques the values of the dependent variable must be known for a sufficiently large part of the data set.
Unsupervised learning is closer to the exploratory spirit of Data Mining as stressed in the definitions given above. In unsupervised learning situations all variables are treated in the same way, there is no distinction between explanatory and dependent variables. However, in contrast to the name undirected data mining there is still some target to achieve. This target might be as general as data reduction or more specific like clustering. The dividing line between supervised learning and unsupervised learning is the same that distinguishes discriminant analysis from cluster analysis. Supervised learning requires that the target variable is well defined and that a sufficient number of its values are given. For unsupervised learning typically either the target variable is unknown or has only been recorded for too small a number of cases.
The large amount of data that is usually present in Data Mining tasks allows to split the data file in three groups: training cases, validation cases and test cases. Training cases are used to build a model and estimate the necessary parameters. The validation data helps to see whether the model obtained with one chosen sample may be generalizable to other data. In particular, it helps avoiding the phenomenon of overfitting. Iterative methods incline to result in models that try to do too well. The data at hand is perfectly described, but generalization to other data yields unsatisfactory outcomes. Not only different estimates might yield different models, usually different statistical methods or techniques are available for a certain statistical task and the choice of a method is open to the user. Test data can be used to assess the various methods and to pick the one that does the best job on the long run.
Although we are dealing with large data sets and typically have abundant cases, partially missing values and other data peculiarities can make data a scarce resource and it might not be easily achievable to split the data into as many subsets as there are necessary. Resampling and cross-validation techniques are often used in combination to data and computer intensive methods in Data Mining.

KDD-Knowledge Discovery in Database

Knowledge Discovery in Databases
There are almost as many differing definitions of the term ''Data Mining'' as there are authors who have written about it. Since Data Mining sits at the interface of a variety of fields, e.g. computer science, statistics, artificial intelligence, business information systems, and machine learning, its definition changes with the field's perspective. Computer scientists, typically, refer to Data Mining as a clearly defined part of the Knowledge Discovery in Databases (KDD) process, while many statisticians use Data Mining as a synonym for the whole KDD process.
To get a flavor of both the variation as well as the common core of data and knowledge mining, we cite some of the definitions used in the literature.

KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

([8])

Knowledge discovery is a knowledge-intensive task consisting of complex interactions, protracted over time, between a human and a (large) database, possibly supported by a heterogenous suite of tools.

([4])

[Data Mining is] a step in the KDD process consisting of particular data mining algorithms that, under some acceptable computational efficiency limitations, produce a particular enumeration of patterns.

([8])

[Data Mining is] a folklore term which indicates application, under human control, of low-level data mining methods. Large scale automated search and interpretation of discovered regularities belong to KDD, but are typically not considered part of data mining.

([24])

[Data Mining is] used to discover patterns and relationships in data, with an emphasis on large observational data bases. It sits at the common frontiers of several fields including Data Base Management, Artificial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization.

([10])

[Data Mining is] the process of secondary analysis of large databases aimed at finding unsuspected relationships which are of interest or value to the database owners.

([12])

Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

([13])

From these definitions the essence is that we are talking about exploratory analysis of large data sets. Two further aspects are the use of computer-based methods and the notion of secondary and observational data. The latter means that the data do not come from experimental studies and that data was originally collected for some other purpose, either for a study with different goals or for record-keeping reasons. These four characteristics in combination distinguish the field of Data Mining from traditional statistics. The exploratory approach in Data Mining clearly defines the goal of finding patterns and generating hypothesis, which might later on be subject of designed experiments and statistical tests. Data sets can be large at least in two different aspects. The most common one is in form of a large number of observations (cases). Real world applications usually are also large in respect of the number of variables (dimensions) that are represented in the data set. Data Mining is also concerned with this side of largeness. Especially in the field of bioinformatics, many data sets comprise only a small number of cases but a large number of variables. Secondary analysis implies that the data can rarely be regarded as a random sample from the population of interest and may have quite large selection biases. The primary focus in investigating large data sets tends not to be the standard statistical approach of inferencing from a small sample to a large universe, but more likely partitioning the large sample into homogeneous subsets.
The ultimate goal of Data Mining methods is not to find patterns and relationships as such, but the focus is on extracting knowledge, on making the patterns understandable and usable for decision purposes. Thus, Data Mining is the component in the KDD process that is mainly concerned with extracting patterns, while Knowledge Mining involves evaluating and interpreting these patterns. This requires at least that patterns found with Data Mining techniques can be described in a way that is meaningful to the data base owner. In many instances, this description is not enough, instead a sophisticated model of the data has to be constructed.
Data pre-processing and data cleansing is an essential part in the Data and Knowledge Mining process. Since data mining means taking data from different sources, collected at different time points, and at different places, integration of such data as input for data mining algorithms is an easily recognized task, but not easily done. Moreover, there will be missing values, changing scales of measurement, as well as outlying and erroneous observations. To assess the data quality is a first and important step in any scientific investigation. Simple tables and statistical graphics give a quick and concise overview on the data, to spot data errors and inconsistencies as well as to confirm already known features. Besides the detection of uni- or bivariate outliers graphics and simple statistics help in assessing the quality of the data in general and to summarize the general behavior. It is worth noting that many organizations still report that as much as $80\,{\%}$ of their effort for Data and Knowledge Mining goes into supporting the data cleansing and transformation process.

MOLAP, ROLAP, HOLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP

This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:

Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Business Intelligence Tools

Business Intelligence Tools

The most common tools used for business intelligence are as follows. They are listed in the following order: Increasing cost, increasing functionality, increasing business intelligence complexity, and decreasing number of total users.

Excel

Take a guess what's the most common business intelligence tool? You might be surprised to find out it's Microsoft Excel. There are several reasons for this:

1. It's relatively cheap.

2. It's commonly used. You can easily send an Excel sheet to another person without worrying whether the recipient knows how to read the numbers.

3. It has most of the functionalities users need to display data.

In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel" functionality. Even for home-built solutions, the ability to export numbers to Excel usually needs to be built.

Excel is best used for business operations reporting and goals tracking.

Reporting tool

In this discussion, I am including both custom-built reporting tools and the commercial reporting tools together. They provide some flexibility in terms of the ability for each user to create, schedule, and run their own reports. The Reporting Tool Selection selection discusses how one should select an OLAP tool.

Business operations reporting and dashboard are the most common applications for a reporting tool.

OLAP tool

OLAP tools are usually used by advanced users. They make it easy for users to look at the data from multiple dimensions. The OLAP Tool Selection selection discusses how one should select an OLAP tool.

OLAP tools are used for multidimensional analysis.

Data mining tool

Data mining tools are usually only by very specialized users, and in an organization, even large ones, there are usually only a handful of users using data mining tools.

Data mining tools are used for finding correlation among different factors.

DSS

Decision Support Systems (DSS) help executives make better decisions by using historical and current data from internal Information Systems and external sources. By combining massive amounts of data with sophisticated analytical models and tools, and by making the system easy to use, they provide a much better source of information to use in the decision-making process.

Decision Support Systems (DSS) are a class of computerized information systems that support decision-making activities. DSS are interactive computer-based systems and subsystems intended to help decision makers use communications technologies, data, documents, knowledge and/or models to successfully complete decision process tasks.

While many people think of decision support systems as a specialized part of a business, most companies have actually integrated this system into their day to day operating activities. For instance, many companies constantly download and analyze sales data, budget sheets and forecasts and they update their strategy once they analyze and evaluate the current results. Decision support systems have a definite structure in businesses, but in reality, the data and decisions that are based on it are fluid and constantly changing.

Types of Decision Support Systems (DSS)

Data-Driven DSS take the massive amounts of data available through the company’s TPS and MIS systems and cull from it useful information which executives can use to make more informed decisions. They don’t have to have a theory or model but can “free-flow” the data. The first generic type of Decision Support System is a Data-Driven DSS. These systems include file drawer and management reporting systems, data warehousing and analysis systems, Executive Information Systems (EIS) and Spatial Decision Support Systems. Business Intelligence Systems are also examples of Data-Driven DSS. Data-Driven DSS emphasize access to and manipulation of large databases of structured data and especially a time-series of internal company data and sometimes external data. Simple file systems accessed by query and retrieval tools provide the most elementary level of functionality. Data warehouse systems that allow the manipulation of data by computerized tools tailored to a specific task and setting or by more general tools and operators provide additional functionality. Data-Driven DSS with Online Analytical Processing (OLAP) provide the highest level of functionality and decision support that is linked to analysis of large collections of historical data.
Model-Driven DSS A second category, Model-Driven DSS, includes systems that use accounting and financial models, representational models, and optimization models. Model-Driven DSS emphasize access to and manipulation of a model. Simple statistical and analytical tools provide the most elementary level of functionality. Some OLAP systems that allow complex analysis of data may be classified as hybrid DSS systems providing modeling, data retrieval and data summarization functionality. Model-Driven DSS use data and parameters provided by decision-makers to aid them in analyzing a situation, but they are not usually data intensive. Very large databases are usually not needed for Model-Driven DSS. Model-Driven DSS were isolated from the main Information Systems of the organization and were primarily used for the typical “what-if” analysis. That is, “What if we increase production of our products and decrease the shipment time?” These systems rely heavily on models to help executives understand the impact of their decisions on the organization, its suppliers, and its customers.
Knowledge-Driven DSS The terminology for this third generic type of DSS is still evolving. Currently, the best term seems to be Knowledge-Driven DSS. Adding the modifier “driven” to the word knowledge maintains a parallelism in the framework and focuses on the dominant knowledge base component. Knowledge-Driven DSS can suggest or recommend actions to managers. These DSS are personal computer systems with specialized problem-solving expertise. The “expertise” consists of knowledge about a particular domain, understanding of problems within that domain, and “skill” at solving some of these problems. A related concept is Data Mining. It refers to a class of analytical applications that search for hidden patterns in a database. Data mining is the process of sifting through large amounts of data to produce data content relationships.
Document-Driven DSS A new type of DSS, a Document-Driven DSS or Knowledge Management System, is evolving to help managers retrieve and manage unstructured documents and Web pages. A Document-Driven DSS integrates a variety of storage and processing technologies to provide complete document retrieval and analysis. The Web provides access to large document databases including databases of hypertext documents, images, sounds and video. Examples of documents that would be accessed by a Document-Based DSS are policies and procedures, product specifications, catalogs, and corporate historical documents, including minutes of meetings, corporate records, and important correspondence. A search engine is a powerful decision aiding tool associated with a Document-Driven DSS.
Communications-Driven and Group DSS Group Decision Support Systems (GDSS) came first, but now a broader category of Communications-Driven DSS or groupware can be identified. This fifth generic type of Decision Support System includes communication, collaboration and decision support technologies that do not fit within those DSS types identified. Therefore, we need to identify these systems as a specific category of DSS. A Group DSS is a hybrid Decision Support System that emphasizes both the use of communications and decision models. A Group Decision Support System is an interactive computer-based system intended to facilitate the solution of problems by decision-makers working together as a group. Groupware supports electronic communication, scheduling, document sharing, and other group productivity and decision support enhancing activities We have a number of technologies and capabilities in this category in the framework – Group DSS, two-way interactive video, White Boards, Bulletin Boards, and Email.

Components of DSS

Traditionally, academics and MIS staffs have discussed building Decision Support Systems in terms of four major components:

The user interface
The database
The models and analytical tools and
The DSS architecture and network

This traditional list of components remains useful because it identifies similarities and differences between categories or types of DSS. The DSS framework is primarily based on the different emphases placed on DSS components when systems are actually constructed.

Data-Driven, Document-Driven and Knowledge-Driven DSS need specialized database components. A Model- Driven DSS may use a simple flat-file database with fewer than 1,000 records, but the model component is very important. Experience and some empirical evidence indicate that design and implementation issues vary for Data-Driven, Document-Driven, Model-Driven and Knowledge-Driven DSS.

Multi-participant systems like Group and Inter- Organizational DSS also create complex implementation issues. For instance, when implementing a Data-Driven DSS a designer should be especially concerned about the user’s interest in applying the DSS in unanticipated or novel situations. Despite the significant differences created by the specific task and scope of a DSS, all Decision Support Systems have similar technical components and share a common purpose, supporting decision- making.

A Data-Driven DSS database is a collection of current and historical structured data from a number of sources that have been organized for easy access and analysis. We are expanding the data component to include unstructured documents in Document-Driven DSS and “knowledge” in the form of rules or frames in Knowledge-Driven DSS. Supporting management decision-making means that computerized tools are used to make sense of the structured data or documents in a database.

Mathematical and analytical models are the major component of a Model-Driven DSS. Each Model-Driven DSS has a specific set of purposes and hence different models are needed and used. Choosing appropriate models is a key design issue. Also, the software used for creating specific models needs to manage needed data and the user interface. In Model-Driven DSS the values of key variables or parameters are changed, often repeatedly, to reflect potential changes in supply, production, the economy, sales, the marketplace, costs, and/or other environmental and internal factors. Information from the models is then analyzed and evaluated by the decision-maker.

Knowledge-Driven DSS use special models for processing rules or identifying relationships in data. The DSS architecture and networking design component refers to how hardware is organized, how software and data are distributed in the system, and how components of the system are integrated and connected. A major issue today is whether DSS should be available using a Web browser on a company intranet and also available on the Global Internet. Networking is the key driver of Communications- Driven DSS.

Wednesday, April 13, 2011

Questions for each unit

IMPORTANT QUESTIONS

UNIT 1

Q1) What are the major components of BI ? What do you understand b business Intelligence . Explain its major components

Q2) Major similarities and differences of DSS and BI ?

Q3) What are structured, semi structured and unstructured decisions? Provide two examples of each.

Q4) How can computer provide support to semi structured and unstructured decisions?

Q5) What are some of the drivers and benefits of computerized decision support system?

Q6) Explain Decision modeling process.

Q7) What are the components of DSS?

Q8) Differentiate between DSS and GDSS?

Q9) What are the major components of Business Intelligence?

Q10) Explain different categories or classification of DSS

Q11) List key characteristics or capabilities of DSS

Q12) What is document driven DSS

Q13) How is data driven DSs related to EIS?

Q14) Why is it important to include model in a DSS ?

Q15) Define Groupware?

Q16) List the major groupware tools and divide them into synchronous and asynchronous types.

Q17) Describe online workspace

Q18) Define GSS and list its benefits.

Q19) How GSS improves group work?

Q20) Explain 3 options of deploying GDSS.

UNIT 2

Q1) Explain the following:

a) EIS

b) Expert System

c) OLAP

d) OLTP

e) AI

f) EL Process

g) Snowflake Schema

h) Star Schema

i) Virtual Warehouse

j) Hypothesis driven Exploration

k) Discovery driven

l) ROLAP

m) MOLAP

n) Drill through and drill across

o) Fact table

p) HOLAP

Differentiate between a) & b), c) & d), g) & h), j) & k), l) & m)

Q2) What is data warehouse? Explain various characteristics of data warehouse?

Q3) List the different types of data warehouse architecture

Q4) List the benefits of data warehouses

Q5) Explain various tools for data warehousing?

Q6) What is data cube? What is multidimensional model?

Q7) Why is summary level data required to be kept in Data warehouse?

Q8) What are the various components of data warehouse?

Q9) Explain different types of OLAP servers?

Q10) Explain various types of OLAP operations?

Q11) Differentiate between dependent data mart and independent data mart?

Q12) Differentiate between apex cuboids and base cuboids?

Q13) Explain the architecture for OLAM

Q14) What do you understand by concept hierarchy?

Q15) Suppose that a data warehouse contains three dimensions date, doctor and patient. There is only measure – charge where charge is the fee that a doctor charges to a patient for a visit.

a) Draw a star schema for the above data warehouse

b) Starting with the base cuboid [date, doctor, patient], which sequence of OLAP operations do you need to list the total fee collected by each doctor in the year 2004?

ANS a)

b ) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004?

1. roll up from day to month to year

2. slice for year = “2004”

3. roll up on patient from individual patient to all

4. slice for patient = “all”

4. get the list of total fee collected by each doctor in 2004

Q16) Discuss whether or not each of the following activities is a data mining task:

Dividing the customers of a company according to their gender.

ANS: No. This is a simple database query.

Dividing the customers of a company according to their profitability.

ANS: No. This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be data mining.

Computing the total sales of a company.

ANS: No. Again, this is simple accounting.

Sorting a student database based on student identification numbers.

ANS: No. Again, this is a simple database query.

Predicting the outcomes of tossing a (fair) pair of dice.

ANS: No. Since the die is fair, this is a probability calculation. If the die were not fair, and we needed to estimate the probabilities of each outcome from the data, then this is more like the problems considered by data mining. However, in this specific case, solutions

to this problem were developed by mathematicians a long time ago, and thus, we wouldn’t consider it to be data mining.

Predicting the future stock price of a company using historical records.

ANS: Yes. We would attempt to create a model that can predict the

continuous value of the stock price. This is an example of the area of data mining known as predictive modeling. We could use regression for this modeling, although researchers in many fields have developed a wide variety of techniques for predicting time series.

Q17) Suppose that a data warehouse for Big University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination.

Draw a snowflake schema diagram for the data warehouse.

Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big University student.

If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Solution:

(a)

(b)

Starting with the base cuboid [student, course, semester, instructor]

1. roll-up on course from (course_key) to major

2. roll-up on student from (student_key) to university

3. Dice on course, student with department =”CS” and university=”Big University”

4. Drill-down on student from university to student name

Q18) What is snow flaking. How doe sit affect the performance of database?

Q19) Define metadata and give reasons why it can be useful in DW?

Q20) What are the advantages of multidimensional database structure over relational data base structure for DW applications?

UNIT 3

Q1) What is data mining? Explain KDD?

Q2) Describe various data mining functionalities: characterization, discrimination, association , classification and clustering

Q3) Differentiate between OLAP and data mining?

Q4) What are the major issues in data mining?

Q5) Explain various data mining techniques and tools?

Q6) What are some major characteristics of Data mining?

Q7) Identify and explain at least 5 applications of data mining?

Q8) Differentiate between classification and clustering

Q9) What are the thee main areas of web mining?

Q10) Differentiate between KDD and data mining?

Q11) Describe the important predictive tools of data mining

Q12) differentiate between descriptive and predictive data mining?

Q13) What kind of data mining can be performed on Spatial databases?

Q14) Explain why data preprocessing is necessary before feeding data into DW?

Q15) What is data cleansing and data integration and why are they important?

Q16) Explain how evolution of database technology led to data mining?

Q17) How is data mining classified? Explain the various database systems on which data mingin can be performed?

Q18) Explain the process of integrating data mining with database.

Integration of a Data Mining System with a Database or Data Warehouse System

DataBase and DataWarehouse systems, possible integration schemes include

No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system

Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse.

Semitight coupling: Semitight coupling means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives (identified by the analysis of frequently encountered data mining functions) can be provided in the DB/DW system.

Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system.

Q19) Discuss the issues to be considered during data integration.

Q20) What is data preprocessing?

UNIT 4

Q1) What is knowledge management system?

Q2) Explain the terms ‘ Knowledge generation’,’ knowledge storage’, and ‘knowledge utilization ‘ related to knowledge management.

Q3) what are the basic steps in KMSLC and what are the activities ine ach?

Q4) What s meant by dimension of knowledge?

Q5) Differentiate between knowledge base and database?

Q6) How knowledge is discovered from data warehouse?

Q7) Discuss the various methods of knowledge representations?

Q8) Why information technology is must for the implementation of knwoeldge management in the organization?

Q9) How is CRM linked with Knowledge management?

Q10) How knowledge management is beneficial in developing business strategy. Explain?

Q11) What is knowledge capture and what tools are used?

Q12) What are the different approaches to knowledge management?

Q13) Compare and contrast information management and knowledge management?

Q14) Define the term fuzzy logic in the context of knowledge management?

Q15) Discuss the various methods of knowledge representation?

Q16) Differentiate between data , information and knowledge?

Q17) Describe the role of CKO?

Q18)What is meant by culture of sharing knowledge?

Q19) Why knowledge is known as new economy of an organization?

Q20) Explain the following:

a) Knwoware

b) Tacit and explicit knowledge

c) Components of Knowledge management

d) Technologies that support knowledge management

e) Why is it important to manage knowledge?

f) Knowledge edge

g) Importance and limitations of knowledge management

h) List down several trends that highlight the need for business to manage knowledge for competitive advantage.

i) Knowledge codification; Techniques

j) Knowledge map

Search This Blog