Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.
Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.
Although the project was carried out personally, the guidance, contribution and support of several individuals had a great encouraging and positive impact on the project.. I warmly thank my supervisor Phil Molyneux for his support and guidance through out the project and whose supervision kept the project within its scope and deadlines.
During this work I have collaborated with many colleagues for whom I have great regard, and I wish to extend my warmest thanks to all those who have helped me with my work.
I owe my most sincere gratitude to my friend Mr. Ashwani Roy for introducing me to the world of Business Intelligence and helping me out to overcome the problems.
I would like to express my warm and sincere thanks to my parents and my brother for all their love, support and encouragement over all these years.
Most of all, to GOD for this unconditional love, for making me capable of overcoming the obstacles of life.
Thank you so much for all your support and help!
Table of Contents
It is the process of transforming related business data into information, information into knowledge and with repetitive identification turning knowledge into Intelligence.
Data: It is a raw data i.e. it represents reality, facts and figures.
Information: It is a data that is processed and interpreted.
Knowledge: After the information is processed, it becomes more sensible containing some meaning and understanding.
Business Intelligence (BI) is a wide category of applications and technologies for collecting, storing, analyzing, and providing access to data to help the enterprise users make better decision making. BI applications supports the activities decision support systems, querying and reporting, online analytical processing (OLAP), statistical data analysis, forecasting and data mining.
Forrester Research often defines Business Intelligence in one of two ways. Typically, Forrester uses the following broad definition (Forrester,2008) :
Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.
A set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery.
“Business Intelligence is neither a product nor a system. It is an architecture and a collection of integrated operational as well as decision-support applications and databases that provide the business community easy access to business data” (Moss, 2003).
BI is the delivery of accurate, useful information to the decision makers within the needed timeframe to enhance decision making.
BI is not just some facts and figures on a printed report or computer screen. BI provides foundational information on which to base a decision. BI also provides feedback information that can be used to evaluate a decision.
BI systems enable organization to provide the users an insight about the company’s information assets.
The five key stages of Business Intelligence:
Data sourcing :
Business Intelligence is about extracting information from multiple sources of data. The data might be: text documents – e.g. memos or reports or email messages; photographs and images; sounds; formatted tables; web pages and URL lists. The key to data sourcing is to obtain the information in electronic form. So typical sources of data might include: scanners; digital cameras; database queries; web searches; computer file access etc.
Business Intelligence is about synthesizing useful knowledge from collections of data. It is about estimating current trends, integrating and summarizing disparate information, validating models of understanding and predicting missing information or future trends. This process of data analysis is also called data mining or knowledge discovery.
Situation awareness :
Business Intelligence is about filtering out irrelevant information, and setting the remaining information in the context of the business and its environment. The user needs the key items of information relevant to his or her needs, and summaries that are syntheses of all the relevant data (market forces, government policy etc.). Situation awareness is the grasp of the context in which to understand and make decisions. Algorithms for situation assessment provide such syntheses automatically.
Risk assessment :
Business Intelligence is about discovering what plausible actions might be taken, or decisions made, at different times. It is about helping you weigh up the current and future risk, cost or benefit of taking one action over another, or making one decision versus another. It is about inferring and summarizing your best options or choices.
Business Intelligence is about using information wisely. It aims to provide warning you of important events, such as takeovers, market changes, and poor staff performance, so that you can take preventative steps. It seeks to help you analyze and make better business decisions, to improve sales or customer satisfaction or staff morale. It presents the information you need, when you need it.
Transactional data is the info stored to track the interactions, or business transactions, carried out by an organization. (Brian Larson, 2008)
Online transaction processing (OLTP) systems record business interactions as they happen. They support the day to day operation of an organization. (Brian Larson, 2008)
A data mart is a body of historical data in an electronic repository that does not participate in the daily operations of the organization. Instead, this data is used to create business intelligence. The data in the data mart usually applies to a specific area of the organization. (Brian Larson, 2008)
While using organizational OLTP systems as a source for BI, number of problems can result. For that what we need to do is take the information stored in these OLTP systems and move it into a different data store. We need to store the data so it is available for BI needs outside of OLTP systems. When data is stored in this manner, it is referred to as a Data Mart. (data copied from OLTP systems periodically and written to data mart is known as data load)
When designing a data mart, the rules of normalization are replaced by a different method of design organized around “facts’. These new design approaches are called stars and snowflakes.
Data mart structure:
The data used for BI can be divided into four categories: measures, dimensions, attributes, and hierarchies. These four types of data help us to define the structure of data mart.
Measure forms the basis of Business Intelligence. They are the basic building blocks of for effective decision making.
A measure is a numeric quantity expressing some aspect of the organization’s performance. The information represented by this quantity is used to evaluate the decision making and performance of the organization. A measure can also be called a fact. (Brian Larson, 2008)
Measures are the facts used for information. Therefore, the tables that have measure information are known as fact tables.
Measures: Three categories:
Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
E.g., count (), sum (), min(), max().
Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
E.g., avg (), min_N (), standard deviation().
Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate.
E.g., median (), mode (), rank ().
A dimension is a categorization used to spread out an aggregate measure to reveal its constituent parts.
Dimensions are used to facilitate slicing and dicing.
Measures and dimensions are stored in a data mart in one of the two layouts/schemas i.e. Star schema and Snowflake schema.
A star schema is a relational database schema used to hold measures and dimensions in a data mart. The measures are stored in a fact table and dimensions are stored in a dimension tables. (Brian Larson, 2008)
Star schema uses two types of tables: fact tables and dimension tables.
Attributes are the additional information about the dimension members in a data mart. They are also used to store the information that may be used to limit or filter the records selected from the data mart during the data analysis.
An attribute is an additional piece of information related to a dimension member that is not the unique identifier or the description of the member. (Brian Larson, 2008)
A dimension is part of the larger structure with many levels. This structure is known as a hierarchy.
A hierarchy is a structure made up of two or more levels of related dimensions. A dimension in the upper level of the hierarchy contains one or more dimensions from the next lower level of the hierarchy. (Brian Larson, 2008)
Snowflake schema is an alternative to star schema. In a snowflake schema, each level of a hierarchy is stored in a separate dimension table. Snowflake schema is more complex than star schema because the tables describing the dimensions are normalized.
Snowflake schema contains all the advantages of a good relational design. It doesn’t result in duplicate data and is, therefore, easier to maintain. The only disadvantage of snowflake schema is that it requires a number of table joins when aggregating measures at upper levels of the hierarchy. Hence in larger data marts this can lead to performance problems.
Ralph Kimball defines the Data Warehouse as” the conglomeration of an organization’s data
warehouse staging and presentation areas, where operational data is specifically structured for
query and analysis performance and easeâ€ofâ€use” (Kimball, 2002).
In simple words, R. Kimball said “A data warehouse is a central repository for all or significant parts of the data that an enterprise’s various business systems collect”.
He also provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis
Bill Inmon’s defines the Data Warehouse in more detail:
“A data warehouse is a subjectâ€oriented, integrated, nonvolatile, and timeâ€variant collection of data
in support of management’s decisions” (Inmon, 2002).
â€¢ It is subject oriented since all data should be related to a specific subject instead
of the company’s operations.
â€¢ A Data Warehouse is defined as integrated, since the data is being fed from
multiple disparate data sources into the data warehouse.
â€¢ Nonvolatile since it stores historical data.
â€¢ Time variant due to the fact that every record stored has been accurate at one
moment in time.
There are three data warehouse models(R. Kimball,2002) :
collects all of the information asset about the entire organization
a subset of data warehouse that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
A set of views over operational databases
Only some of the possible summary views may be materialized
Data Warehouse Usage (R.Kimball, 2002):
There are three kinds of data warehouse applications:
This supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
In this, multidimensional analysis of data warehouse data is done
It supports basic OLAP operations, slicing-dicing, drilling, pivoting
Data mining is the analysis of data with the intent to discover gems of hidden information in the vast quantity of data that has been captured in the normal course of running the business (Moss, 2003).
Hence, from the above definition, we can say that it discovers knowledge from the hidden parts
Moreover, it supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools.
Goals of a Data warehouse:
Data warehouse must make an organization’s information easily accessible and consistent.
Data warehouse must assist in decision making process.
Data warehouse must be adaptive and flexible to the business changes.
Data warehouse must meet the business requirements.
Data warehouse development approaches:
Ralph Kimball and Bill Inmon formed the two different approaches to data warehouse design. These two approaches are: Top-down, bottom-up approaches or a combination of both.
Top-down approach: (Bill Inmon approach)
In top-down approach , first data warehouse is build and then the data marts.
Here the data warehouse is designed based on the normalized enterprise wide data model
In this approach DW acts as a single repository that feeds data into data marts
The top-down design approach exhibits highly consistent dimensional views of data across data marts as the data marts are loaded from the centralized repository.
This approach is robust against the business changes.
It is easy to create a data mart against the data stored in the data warehouse.
The top-down approach consumes more time in its implementation process.
In addition, top-down methodology sometimes become inflexible and unresponsive.
Bottom-up: (Ralph Kimball approach)
In bottom-up approach, first data marts are created and then data warehouse.
It starts with one data mart but later on more data marts can be added.
The bottom-up approach has a quick turn-around.
This approach is easier and faster to implement as one needs to deal just with smaller subject areas in the beginning.
In this approach, there is a long term risk of inconsistencies due to the use of multiple data marts.
Hence, needs to ensure the consistency of metadata.
Hybrid approach: (combination of top-down and bottom-up approaches)
This approach tried to mix the best of both top-down as well as bottom-up approaches.
There are some salient steps for applying this approach.
First step is to start the design by creating the data warehouse and data marts synchronously.
Build out first few data marts that are mutually exclusive and critical.
After this steps, backfill the data warehouse.
Then build the enterprise model and move the data to data warehouse.
Data warehouse Dimensional Model Phases
There are four dimensional model phases as follows:
Indentify the ‘Business Process’
Determine the ‘Grain’ (level of detail needed)
Identify the ‘Facts’
Identify the ‘Dimensions’
The most important thing in the business process is to identify the business requirements of a company and to analyze them thoroughly.
Grain of data – Granularity:
Granularity is the detailed data analysis captured from the data warehouse. More the detail, higher the granularity and vice-versa.
It is similar to the transaction table in an OLTP system. It stores the facts or measures of the business. E.g.: SALES, ORDERS
Hence it contains the metrics resulting from a business process or measurement event, such as the sales ordering process or service call event.
In addition to the measurements, the only other things a fact table contains are foreign keys for the dimension tables.
It is similar to the master table in an OLTP system. It stores the textual descriptors of the business. E.g.: CUSTOMER, PRODUCT
Hence it contains the descriptive attributes and characteristics associated with speci¬c events, such as the customer, product, or sales representative associated with an order being placed.
Hierarchical many-to-one relationships are de-normalized into single dimension tables.
Data Integration Process Design ( still more to write about dimension etl and facts etl)
Extract, transform and load (ETL) is the core process of data integration and is typically associated with data warehousing. ETL tools are employed in order to populate data warehouse with up-to-date records extracted from source system, therefore, they are useful in organizing the steps of the whole process as a work flow. There are some prominent tasks carried out under this work flow which include: (i) the identification of the relevant information at the source side; (ii) the extraction of this information; (iii) the transportation of this information to the Data Staging Area (DSA), where most of the transformation takes place usually; (iv) the transformation (i.e., customization and integration) of the information extracted from the multiple sources into a common format; (v) the cleansing of the resulting data set, on the basis of the database and business rules; and (vi) the propagation and loading of the data to the data warehouse and the refreshment of data marts.
One important function of ETL is “cleansing” data. The ETL consolidation protocols also include the elimination of duplicate or fragmentary data, so that what passes from the E portion of the process to the L portion is easier to assimilate and/or store. Such cleansing operations can also include eliminating certain kinds of data from the process.
Data cleansing removes inconsistencies and errors from transactional data so it has the consistency necessary for use in data mart. (Brian Larson, 2008)
Data cleansing transforms data into a format that doesn’t cause problems in the data mart environment. It converts inconsistent data types into a single type. Data cleansing translates dissimilar identifiers to a standard set of codes for the data mart. In addition, it repairs or removes any data that does not meet the business rules required by the measures calculated from his data mart.
Data cleansing is usually done as a part of a larger process. This process extracts the data from the OLTP systems and loads it into a data mart. Thus, the entire procedure is known as extract, transform , and load – or ETL.
The Extract, Transform, and Load (ETL) process extracts data to copy from one or more OLTP systems, performs any required data cleansing to transform the data into a consistent format, and loads the cleansed data by inserting it into the data mart. (Brian Larson, 2008)
ETL Architecture (from Vincent’s book)
There are several ways of implementing ETL. The most prominent way is to pull the data from source systems, put it in a staging area it, and then transform it and load it into the data warehouse, as per the top diagram of the figure. Alternatively, instead of putting the data in a staging area, sometimes the ETL server directly with no staging does the transformation and then updates the data warehouse, as shown in the bottom diagram of the figure. The staging area is a physical database or files. Putting the data into the staging area means inserting it into the database or writing it in files.
SQL Server 2008 Business Intelligence Application Development (SSIS, SSAS, SSRS)
SSIS – SQL Server Integration Services, It is a Data Warehousing Tool, Developed by Microsoft.
Microsoft says that SQL Server Integration Services (SSIS) “is a platform for building high performance data integration solutions, including extraction, transformation, and load (ETL) packages for data warehousing.”
SSIS provides the ability to:
retrieve data from any source
Perform various transformations on the data; e.g. convert from one type to another, convert to uppercase or lowercase, perform calculations, etc.
load data into any source
define a workflow
The first version of SSIS was released with SQL Server 2005. SSIS is a replacement for Data Transformation Services (DTS) which was available with SQL Server 7.0 and SQL Server 2000. SSIS builds on the capabilities introduced with DTS.
SSAS – SQL Server Analysis Services:
Microsoft SQL Server 2005 Analysis Services (SSAS) delivers online analytical processing (OLAP) and data mining functionality for business intelligence applications. Analysis Services supports OLAP by letting you design, create, and manage multidimensional structures that contain data aggregated from other data sources, such as relational databases. For data mining applications, Analysis Services lets you design, create, and visualize data mining models that are constructed from other data sources by using a wide variety of industry-standard data mining algorithms.
Microsoft SQL Server 2005 Reporting Services (SSRS) delivers enterprise, Web-enabled reporting functionality so you can create reports that draw content from a variety of data sources, publish reports in various formats, and centrally manage security and subscriptions. For information about other SQL Server2005 components, tools, and resources. Microsoft SQL Server Reporting Services enables organizations to transform valuable enterprise data into shared information for insightful, timely decisions at a lower total cost of ownership.
SQL Server Reporting Services is a comprehensive, server-based solution that enables the creation, management, and delivery of both traditional, paper-oriented reports and interactive, Web-based reports. An integrated part of the Microsoft Business Intelligence framework, Reporting Services combines the data management capabilities of SQL Server and Microsoft Windows Server with familiar and powerful Microsoft Office System applications to deliver real-time information to support daily operations and drive decisions.
With the help of SSIS (SQL Server Integration Services), we can load the data from trade capture system into the database. Here we can extract, transform and load (ETL) the packages for data warehousing. Since behind this system there is no organization supporting it all of the data was created by collecting some information from external sources.
Following are the steps to how to create the SSIS packages using Microsoft Visual Studio for this system.
Create a new flat file connection manager pointing to the source file you want to load. Also create a new OLEDB connection manager.
For the untyped load (as for in this system),
Drag and drop the Data Flow Task in control flow.
Double click on it and open it in Data Flow tab.
Drag a new flat file source. Point this to flat file connection you created before.
Verify that you can see the data.
Inside the Data Flow Task, drag a new OLEDB destination. Point it to the FRM db.
Create a new table and join it to the source. Check the mappings.
Run and verify.
For the typed load (as for in this system),
Again drag and drop the Data Flow Task in control flow.
Double click on it and open it in Data Flow tab.
Drag a new OLEDB source. Double click on it and point it to the FRM db. Create a new table and join it to the source. Also check the mappings. This OLEDB source will be of untyped load.
Now for the data conversion, drag a data conversion toolbox. Double click on it and select the column whose data type has to be changed.
Then again drag two OLEDB destination toolbox. (One will be the final successfully accomplished typed load destination and another will be the error generated destination)
Now double click on typed load destination, point it to the FRM db and create a new table and join it to the source. Also check the mappings.
After this double click on the error generated destination, point it to the FRM db and enter the table name created before and check the mappings.
Run and verify.
Above discussed are the overall simple steps to describe how SSIS packages are created. Now will see in detail how all these steps go in actual creation.
For this system, initially the data created was analyzed using Excel with .csv file extension. There are two .csv files in this system containing quite a large data. And these are the flat file source for this system.
Below are the images of these two .csv files.
Figure: bucketed .csv file
Figure: src_trade_1_risk file .csv file
Following figure shows the total number of created tables in FRM named database for this system.
Figure: tables created in Microsoft SQL Server Management Studio
DBMS – Database management system.
Normalized – Elimination of redundancy in databases so that all columns depend on a primary key.
RDBMS – Relational database management system.
SQL – Structured Query Language is a standard language for communication with a relational database management system (RDBMS).
Schema. Consists of a library, a journal, a journal receiver, an SQL catalog, and optionally a data dictionary. A schema groups related objects and allows you to find the objects by name.
Table. A set of columns and rows
Row. The horizontal part of a table containing a serial set of columns
Column. The vertical part of a table of one data type.
View. A subset of columns and rows of one or more tables.
Package. An object type that is used to run SQL statements.
The most common statements used in this report for query writing are as follows, whereas, other statements are just listed as per the category of statements under which they fall.
Creates a new database
Creates a new index on a table column
Creates a new sequence in an existing database
Creates a new table in an existing database
Creates a new trigger definition
Creates a new view on an existing table
Retrieves records from a table
Adds one or more new records into a table
Modifies the data in existing table records
Removes existing records from a table
Destroys an existing database
Removes a column index from an existing table
Destroys an existing sequence generator
Destroys an existing table
Destroys an existing trigger definition
Destroys an existing table view
Adds a new PostgreSQL user account to the system
Modifies an existing PostgreSQL user account
Removes an existing PostgreSQL user account
Grant rights on a database object to a user
Deny rights on a database object from a user
Creates a new SQL function within a database
Creates a new language definition within a database
Creates a new SQL operator within a database
Creates a new SQL data type within a database
There are four basic types of SQL statements:
Data definition language (DDL) statements
Data manipulation language (DML) statements
Dynamic SQL statements
Following are the most common and basic SQL statements in detail as per the above written statements:
Data definition language statements (DDL)
Data manipulation language (DML) statements
Dynamic SQL statements
BEGIN DECLARE SECTION
END DECLARE SECTION
SET ENCRYPTION PASSWORD
SET RESULT SETS
For the creating the tables
For table name stg_typed_trade
CREATE TABLE [dbo].[stg_typed_trade](
[trade_ref] [varchar](50) NULL,
[value_date] [int] NULL,
[currency] [varchar](50) NULL,
[curve] [varchar](50) NULL,
[funding_currency] [varchar](50) NULL,
[pv] [numeric](18, 4) NULL,
[pvxt] [numeric](18, 4) NULL,
[notional] [numeric](18, 4) NULL,
[theta] [numeric](18, 4) NULL
) ON [PRIMARY]
For table name stg_typed_bucketed
CREATE TABLE [dbo].[stg_typed_bucketed](
[trade_ref] [varchar](50) NULL,
[currency] [varchar](50) NULL,
[funding_currency] [varchar](50) NULL,
[value_date] [varchar](50) NULL,
[value_time] [varchar](50) NULL,
[tenor] [varchar](50) NULL,
[curve] [varchar](50) NULL,
[gamma] [numeric](18, 8) NULL
Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Find out more