Search This Blog

Tuesday, May 3, 2011

Software engineering- Halstead Software Science


Halstead complexity metrics were developed by the late Maurice Halstead as a means of determining a quantitative measure of complexity directly from the operators and operands in the module to measure a program module's complexity directly from source code.Among the earliest software metrics, they are strong indicators of code complexity.Halstead Science is an estimation technique to find out size, tiem and effort of a software
Halstead´s metrics is based on interpreting the source code as a sequence of tokens and classifying each token to be an operator or an operand.
First we need to compute the following numbers, given the program:
  • n1 = the number of distinct operators
  • n2 = the number of distinct operands
  • N1 = the total number of operators
  • N2 = the total number of operands
The number of unique operators and operands (n1 and n2) as well as the total number of operators and operands (N1 and N2) are calculated by collecting the frequencies of each operator and operand token of the source program.
From these numbers, following measures can be calculated:

 Program length (N)

The program length (N) is the sum of the total number of operators and operands in the program:
        N = N1 + N2

Vocabulary size (n)

The vocabulary size (n) is the sum of the number of unique operators and operands:
        n = n1 + n2

Program volume (V)

The program volume (V) is the information contents of the program, measured in mathematical bits. It is calculated as the program length times the 2-base logarithm of the vocabulary size (n) :
        V = N * log2(n)
Halstead's volume (V) describes the size of the implementation of an algorithm. The computation of V is based on the number of operations performed and operands handled in the algorithm. Therefore V is less sensitive to code layout than the lines-of-code measures.
The volume of a function should be at least 20 and at most 1000. The volume of a parameterless one-line function that is not empty; is about 20. A volume greater than 1000 tells that the function probably does too many things.
The volume of a file should be at least 100 and at most 8000. These limits are based on volumes measured for files whose LOCpro and v(G) are near their recommended limits. The limits of volume can be used for double-checking.

Difficulty level (D)

The difficulty level or error proneness (D) of the program is proportional to the number of unique operators in the program.
D is also proportional to the ration between the
total number of operands and the number of unique operands (i.e. if the same operands are used many times in the program, it is more prone to errors).
        D = ( n1 / 2 ) * ( N2 / n2 )

Program level (L)

The program level (L) is the inverse of the error proneness of the program.
I.e. a low level program is more prone to errors than a high level program.
        L = 1 / D

Effort to implement (E)

The effort to implement (E) or understand a program is proportional to the volume and to the difficulty level of the program.
        E = V * D

Estimated Program Length
According to Halstead, The first hypothesis of software science is that the length of a well
Structured program is a function only of the number of unique operators and operands.
the estimated lngth is denoted by N^
N^= n1log2n1+n2log2n2

Potential Volume:
Amongst all the programs, the one that has minimal size is said to have the potential volume, V*. Halstead argued that the minimal implementation of any algorithm was through a reference to a procedure that had been previously written. The implementation of this algorithm would then require nothing more than invoking the procedure and supplying the operands for its inputs and output parameters.
V*= (2+n2*) log2( 2+n2*)

Estimated program Level/difficulty

Halstead offered an alternate formula that estimates the program level.
L^ =2n2/n1N2
Hence, D =1 /L^

Effort:
Halstead hypothesized that the effort required to implement a program increases as the size of the program increases. It takes more effort to implement a program at a lower level than the higher level.
E= V/L^

Time to implement (T)

John stroud suggested that human mind is capable of making a limited number of elementary discriminations per second. Stroud claims that β (stroud number) ranges between 5 to 20. Since Effort uses elementary discriminations as its unit of measure, the programming time would be: T = E / β
β is normally set to 18 since this seemed to give best results in Halstead earliest experiments.

Monday, May 2, 2011

Cyclomatic complexity - Connection matrix

Connection Matrix  - cyclomatic complexity
Each node on the flow graph is identified by numbers, while each edge is identified by letters. A letter entry is made in the matrix to correspond to a connection between two nodes. For example node 3 is connected to node 4 by edge b.to this point, the graph matrix is nothing more than a tabular representation of a flow graph. However by adding a link weight to each matrix entry, the graph matrix can become a powerful tool for evaluating program control structure during testing. The link weight provides additional information about control flow. In its simplest form, the link weight is 1 (a connection exists) or 0(a connection does not exist). To illustrate we use the simplest weighting to indicate connections (0 or 1).
Each letter has been replaced with a 1, indicating that a connection exists (zeros have been excluded for clarity). Represented in this form, the graph matrix is called a connection matrix.
In the connection matrix, each row with two or more entries represents a predicate node. Therefore performing the arithmetic shown to the right of the connection matrix provides us with a method for determining Cyclomatic complexity.
Implementation
The control flow graph can be implemented using the adjacency matrix data structure. The procedure for creating the connection matrix from a control flow graph can be stated as a two dimensional array is declared. Its size is equal to the number of nodes in the flow graph.
construct the adjacency matrix Aij = 1 if there is an edge from node i to node j.
Aij = 0 if there is no edge.
The complexity can be found out from this adjacency matrix using the procedure below.
1. for each row, find out the number of 1’s in it. Sutract 1 from this. Save the result
2. after doing step 1 on all rows, sum the results and add 1. this will give the Cyclomatic complexity.

Software Engineering - McCabe's Cyclomatic metric

McCabe's Cyclomatic metric, V(G) of a graph "G" with "n" vertices and "e" edges is given by the following formula
V(G) = e – n + 2
Let us take an example of a program; wherein we associate it with a directed graph that has unique entry and exit nodes. Each node in the graph, G, corresponds to a block of statements in the program where flow is sequential and the arcs correspond to branches taken in the program. This graph is known as a "Flow Graph".
The Cyclomatic Complexity, V(G) provides us two things
1) To find the number of independent paths through the program.
2) To provide an upper bound for the number of test cases that must be executed in order to test the program thoroughly. The complexity measure is defined in terms of independent paths. It is defined as any path through the program that introduces at least one new set of processing statements or a new condition.
The following steps are usually followed
Step – 1: Construction of flow graph from the source code or flow charts.
Step – 2: Identification of independent paths.
Step – 3: Computation of Cyclomatic Complexity.
Step – 4: Test cases are designed.
Using the flow graph, an independent path can be defined as a path in the flow graph that has at least one edge that has not been traversed before in other paths. A set of independent paths that cover all the edges is a basic set. Once the basis set is formed, test cases should be written to execute all the paths in the basis set.

Properties of Cyclomatic Complexity
The following are the properties of Cyclomatic Complexity represented as V(G)

1) V(G) >= 1.

2) V(G) is the maximum number of independent paths in graph, G.
3) Inserting and deleting functional statements to G does not affect V(G).
4) G has only one path if and only if V(G) = 1.

5) Inserting a new row in G, increases V(G) by unity.

Let us see the meaning of V(G)

For small programs Cyclomatic Complexity can be calculated manually, but automated tools are essential as several thousands of lines of code are possible in each program in a project. It will be very difficult to manually create flow graphs for large programs.
There are several tools that are available in the market, which can compute Cyclomatic Complexity. Note that calculating the complexity of a module after it has been built and tested may be too late. It may not be possible to redesign a complex module after it has been tested. Thus, some basic complexity checks must be performed on the modules before embarking upon the testing (or even coding) phase. Based on the complexity number that emerges from using the tool, one can conclude what actions need to be taken for complexity measure using the table given below
Complexity No.
Corresponding Meaning of V(G)
1-10
1) Well-written code,
2) Testability is high,
3) Cost / effort to maintain is low
10-20
1) Moderately complex code,
2) Testability is medium,
3) Cost / effort to maintain is medium.
20-40
1) Very complex code,
2) Testability is low,
3) Cost / effort to maintain is high.
> 40
1) Not testable,
2) Any amount of money / effort to maintain may not be enough

Software engineering -Cyclomatic Complexity

Basis Path Testing : Basis path testing is a white box testing technique first proposed by Tom McCabe. The Basis path method enables to derive a logical complexity measure of a procedural design and use this measure as a guide for defining a basis set of execution paths. Test Cases derived to exercise the basis set are guaranteed to execute every statement in the program at least one time during testing.

Flow Graph Notation: The flow graph depicts logical control flow using a diagrammatic notation. Each structured construct has a corresponding flow graph symbol.

Cyclomatic Complexity: Cyclomatic complexity is a software metric that provides a quantitative measure of the logical complexity of a program. When used in the context of a basis path testing method, the value computed for Cyclomatic complexity defines the number for independent paths in the basis set of a program and provides us an upper bound for the number of tests that must be conducted to ensure that all statements have been executed at least once.

An independent path is any path through the program that introduces at least one new set of processing statements or a new condition.

Computing Cyclomatic Complexity: Cyclomatic complexity has a foundation in graph theory and provides us with extremely useful software metric. Complexity is computed in one of the three ways:

1. The number of regions of the flow graph corresponds to the Cyclomatic complexity.

2. Cyclomatic complexity, V(G), for a flow graph, G is defined as
V (G) = E-N+2P Where E, is the number of flow graph edges, N is the number of flow graph nodes, P is independent component.

3. Cyclomatic complexity, V (G) for a flow graph, G is also defined as:
V (G) = Pie+1 where Pie is the number of predicate nodes contained in the flow graph G.
4. Graph Matrices: The procedure for deriving the flow graph and even determining a set of basis paths is amenable to mechanization. To develop a software tool that assists in basis path testing, a data structure, called a graph matrix can be quite useful.
A Graph Matrix is a square matrix whose size is equal to the number of nodes on the flow graph. Each row and column corresponds to an identified node, and matrix entries correspond to connections between nodes. The connection matrix can also be used to find the cyclomatic complexity

Thursday, April 14, 2011

Data Mining - Supervised and Unsupervised Learning

Supervised and Unsupervised Learning
Data and Knowledge Mining is learning from data. In this context, data are allowed to speak for themselves and no prior assumptions are made. This learning from data comes in two flavors: supervised learning and unsupervised learning. In supervised learning (often also called directed data mining) the variables under investigation can be split into two groups: explanatory variables and one (or more) dependent variables. The target of the analysis is to specify a relationship between the explanatory variables and the dependent variable as it is done in regression analysis. To apply directed data mining techniques the values of the dependent variable must be known for a sufficiently large part of the data set.
Unsupervised learning is closer to the exploratory spirit of Data Mining as stressed in the definitions given above. In unsupervised learning situations all variables are treated in the same way, there is no distinction between explanatory and dependent variables. However, in contrast to the name undirected data mining there is still some target to achieve. This target might be as general as data reduction or more specific like clustering. The dividing line between supervised learning and unsupervised learning is the same that distinguishes discriminant analysis from cluster analysis. Supervised learning requires that the target variable is well defined and that a sufficient number of its values are given. For unsupervised learning typically either the target variable is unknown or has only been recorded for too small a number of cases.
The large amount of data that is usually present in Data Mining tasks allows to split the data file in three groups: training cases, validation cases and test cases. Training cases are used to build a model and estimate the necessary parameters. The validation data helps to see whether the model obtained with one chosen sample may be generalizable to other data. In particular, it helps avoiding the phenomenon of overfitting. Iterative methods incline to result in models that try to do too well. The data at hand is perfectly described, but generalization to other data yields unsatisfactory outcomes. Not only different estimates might yield different models, usually different statistical methods or techniques are available for a certain statistical task and the choice of a method is open to the user. Test data can be used to assess the various methods and to pick the one that does the best job on the long run.
Although we are dealing with large data sets and typically have abundant cases, partially missing values and other data peculiarities can make data a scarce resource and it might not be easily achievable to split the data into as many subsets as there are necessary. Resampling and cross-validation techniques are often used in combination to data and computer intensive methods in Data Mining.

KDD-Knowledge Discovery in Database

Knowledge Discovery in Databases
There are almost as many differing definitions of the term ''Data Mining'' as there are authors who have written about it. Since Data Mining sits at the interface of a variety of fields, e.g. computer science, statistics, artificial intelligence, business information systems, and machine learning, its definition changes with the field's perspective. Computer scientists, typically, refer to Data Mining as a clearly defined part of the Knowledge Discovery in Databases (KDD) process, while many statisticians use Data Mining as a synonym for the whole KDD process.
To get a flavor of both the variation as well as the common core of data and knowledge mining, we cite some of the definitions used in the literature.
KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

([8])


Knowledge discovery is a knowledge-intensive task consisting of complex interactions, protracted over time, between a human and a (large) database, possibly supported by a heterogenous suite of tools.

([4])


[Data Mining is] a step in the KDD process consisting of particular data mining algorithms that, under some acceptable computational efficiency limitations, produce a particular enumeration of patterns.

([8])


[Data Mining is] a folklore term which indicates application, under human control, of low-level data mining methods. Large scale automated search and interpretation of discovered regularities belong to KDD, but are typically not considered part of data mining.

([24])


[Data Mining is] used to discover patterns and relationships in data, with an emphasis on large observational data bases. It sits at the common frontiers of several fields including Data Base Management, Artificial Intelligence, Machine Learning, Pattern Recognition, and Data Visualization.

([10])


[Data Mining is] the process of secondary analysis of large databases aimed at finding unsuspected relationships which are of interest or value to the database owners.

([12])
Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

([13])
From these definitions the essence is that we are talking about exploratory analysis of large data sets. Two further aspects are the use of computer-based methods and the notion of secondary and observational data. The latter means that the data do not come from experimental studies and that data was originally collected for some other purpose, either for a study with different goals or for record-keeping reasons. These four characteristics in combination distinguish the field of Data Mining from traditional statistics. The exploratory approach in Data Mining clearly defines the goal of finding patterns and generating hypothesis, which might later on be subject of designed experiments and statistical tests. Data sets can be large at least in two different aspects. The most common one is in form of a large number of observations (cases). Real world applications usually are also large in respect of the number of variables (dimensions) that are represented in the data set. Data Mining is also concerned with this side of largeness. Especially in the field of bioinformatics, many data sets comprise only a small number of cases but a large number of variables. Secondary analysis implies that the data can rarely be regarded as a random sample from the population of interest and may have quite large selection biases. The primary focus in investigating large data sets tends not to be the standard statistical approach of inferencing from a small sample to a large universe, but more likely partitioning the large sample into homogeneous subsets.
The ultimate goal of Data Mining methods is not to find patterns and relationships as such, but the focus is on extracting knowledge, on making the patterns understandable and usable for decision purposes. Thus, Data Mining is the component in the KDD process that is mainly concerned with extracting patterns, while Knowledge Mining involves evaluating and interpreting these patterns. This requires at least that patterns found with Data Mining techniques can be described in a way that is meaningful to the data base owner. In many instances, this description is not enough, instead a sophisticated model of the data has to be constructed.
Data pre-processing and data cleansing is an essential part in the Data and Knowledge Mining process. Since data mining means taking data from different sources, collected at different time points, and at different places, integration of such data as input for data mining algorithms is an easily recognized task, but not easily done. Moreover, there will be missing values, changing scales of measurement, as well as outlying and erroneous observations. To assess the data quality is a first and important step in any scientific investigation. Simple tables and statistical graphics give a quick and concise overview on the data, to spot data errors and inconsistencies as well as to confirm already known features. Besides the detection of uni- or bivariate outliers graphics and simple statistics help in assessing the quality of the data in general and to summarize the general behavior. It is worth noting that many organizations still report that as much as $ 80\,{\%}$ of their effort for Data and Knowledge Mining goes into supporting the data cleansing and transformation process.

MOLAP, ROLAP, HOLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.
Advantages:
  • Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.
  • Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
  • Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
  • Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
  • Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
  • Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
  • Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
  • Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.