杯子茶室

关注有趣的事物

數據倉庫基礎

网络 0 评

Introduction

What is Data Warehouse

"A data warehouse is a subject-oriented, integrated , volatile collection of data in support of management’s decision-making process." —W. H. Inmon
“数据仓库是面向主题的、集成的、易变的数据集合,用于支持管理层的决策过程。” —W. H. Inmon
  • A decision support database 决策支持数据库

    • separately with organization’s operational database 与运营数据库分开维护
  • Support information processing 支持信息处理

    • providing a solid platform of consolidated, historical data for analysis 通过提供用于分析的综合历史数据的可靠平台
  • Data warehousing is the process of constructing and using a data warehouse. Data warehousing是构建和使用数据仓库的过程。
  • A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. 数据仓库是通过集成来自多个异构源的数据构建的,支持分析报告、结构化和/或即席查询和决策。
  • Data warehousing involves 数据仓库涉及

    • data cleaning, 数据清理
    • data integration 数据集成
    • data consolidations. 数据合并。

What is Data Mining?

  • Extraction of interesting patterns or knowledge from huge amount of data 从大量数据中提取有趣的模式或知识

    • non-trivial 非平凡的
    • implicit 隐含的
    • previously unknown 以前未知的
    • potentially useful 可能有用
  • Alternative names

    • Knowledge discovery (mining) in databases (KDD) 数据库中的知识发现(挖掘)(KDD)
    • knowledge extraction 知识提取
    • data/pattern analysis 数据/模式分析
    • data archeology 数据考古
    • data dredging 数据挖掘
    • information harvesting 信息收集
    • business intelligence 商业智能

What is not data mining?

  • Simple search and query processing 简单搜索和查询处理
  • (Deductive) expert systems 专家系统

Knowledge Discovery Process

  • Data Cleaning
  • Data Warehouse
  • Task-relevant Data
  • Data Mining
  • Pattern Evaluation
  • Knowledge

Features

Subject-Oriented 主题导向

  • Organized around major subjects 围绕主要主题组织

    • customer 客户
    • product 产品
    • sales 销售

Integrated 集成/交互

  • Constructed by integrating multiple, heterogeneous data sources– 通过整合多种异构数据源构建而成

    - relational databases, 
    - flat files, 
    - on-line transaction records

Time Variant

Non-Volatile 非易失性

  • A physically separate store of data transformed from the operational environment 从操作环境转换而来的物理上独立的数据存储
  • Operational update of data does not occur in the data warehouse environment 数据仓库环境中不会发生数据的操作更新

    • Does not require transaction processing, recovery, and concurrency control mechanisms 不需要事务处理、恢复和并发控制机制
    • Requires only two operations in data accessing: 数据访问中仅需要两个操作:

      • initial loading of data and access of data 数据的初始加载和数据访问

Why a Separate Data Warehouse? 为什么要有独立的数据仓库

  • High performance for both systems 两种系统均具有高性能

    • DBMS

      • tuned for OLTP: 针对 OLTP(Online transactional processin) 进行调整

        • access methods, 访问方法
        • indexing, 索引
        • concurrency control, 并发控制
        • recovery 恢复
    • Warehouse 仓库

      • tuned for OLAP: 针对 OLAP(online analytical processing) 进行调整

        • complex OLAP queries, 复杂 OLAP 查询
        • multidimensional view, 多维视图
        • consolidation 合并

Data Warehouse Models

  • Enterprise Warehouse 企业仓库

    • collects all of the information about subjects spanning the entire organization 收集有关整个组织的所有主题的信息
  • Data Mart (數據市集)

    • a subset of corporate-wide data that is of value to a specific groups of users. 对特定用户组有价值的公司范围数据子集。
    • Its scope is confined to specific, selected groups, such as marketing data mart 其范围仅限于特定的选定组,例如营销数据集市
  • Independent vs. dependent (directly from warehouse) data mart 独立与依赖(直接来自仓库)数据集市
  • Virtual Warehouse虚拟仓库

    • A set of views over operational databases 一组操作数据库视图
    • Only some of the possible summary views may be materialized 只有一些可能的摘要视图可以实现

Functions of Data Warehouse Tools and Utilities – ETL Process

Extraction, Transformation and Load
  • Data Extraction - get data from multiple, heterogeneous, and external sources 数据提取 - 从多个异构外部来源获取数据

    • Data Cleaning - detect errors in the data and rectify them when possible 数据清理 - 检测数据中的错误并在可能时纠正错误
    • Data Transformation - convert data from legacy or host format to warehouse format 数据转换 - 将数据从旧格式或主机格式转换为仓库格式
    • Load - sort, summarize, consolidate, compute views, check integrity, and build indices and partitions 加载 - 排序、汇总、合并、计算视图、检查完整性以及构建索引和分区
  • Refresh - propagate the updates from the data sources to the warehouse 刷新 - 将更新从数据源传播到仓库

Terminology

Data Mart

  • Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. 数据集市包含组织范围内的数据子集,这些数据对组织中的特定人群很有价值。

    • a data mart contains only those data that is specific to a particular group. 数据集市仅包含特定于特定群体的数据。
    • For example, the marketing data mart may contain only data related to items, customers, and sales. 例如,营销数据集市可能仅包含与项目、客户和销售相关的数据。
    • Data marts are confined to subjects. 数据集市仅限于主题。

Metadata - data about data

  • The data that are used to represent other data is known as metadata. 用于表示其他数据的数据称为元数据。
  • For example, the index of a book serves as a metadata for the contents in the book 例如,一本书的索引是书中内容的元数据
  • In other words, we can say that metadata is the summarized data that leads us to the detailed data. 换句话说,我们可以说元数据是引导我们找到详细数据的汇总数据。

Metadata Repository 元数据仓库

  • Business metadata 业务元数据

    • It contains the data ownership information, business definition, and changing policies. 它包含数据所有权信息、业务定义和变化政策。
  • Operational metadata 运营元数据

    • Currency of data refers to the data being active, archived, or purged. 数据的时效性是指数据处于活动状态、已存档或已清除状态。
    • Lineage of data means history of data migrated and transformation applied on it. 数据沿袭是指数据迁移和转换的历史记录。
  • Data for mapping from operational environment to data warehouse 从操作环境映射到数据仓库的数据

    • source databases and their contents, 源数据库及其内容
    • data extraction, 数据提取
    • data partition, 数据分区
    • cleaning, 清理
    • transformation rules, 转换规则
    • data refresh 数据刷新
    • purging rules 清除规则
  • The algorithms for summarization 汇总算法

    • dimension algorithms, 维度算法
    • data on granularity, 粒度数据
    • aggregation, 聚合
    • summarizing 汇总

Data Cube

  • A data warehouse is based on a multidimensional data model which views data in the form of a data cube 数据仓库基于多维数据模型,该模型以数据立方体的形式查看数据
  • A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions 数据立方体(例如销售)允许以多个维度对数据进行建模和查看
  • It is defined by dimensions and facts. 它由维度和事实定义。
  • Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year). 维度表,例如项目(item_name、brand、type)或时间(day、week、month、quarter、year)。
  • Fact table contains measures (such as sales_amount, units_sold) and keys to each of the related dimension tables. 事实表包含度量(例如 sales_amount、units_sold)和每个相关维度表的键。
  • 我的简易理解是他其实就是Group By产生的视图(View),但Data Cube是一个可以用于操作的对象,生成的时候会根据多个汇总维度生成所有的汇总信息。例如:
-- 假设原始数据表为 sales_data
CREATE TABLE sales_data (
    time DATE,
    location VARCHAR(50),
    product VARCHAR(50),
    sales DECIMAL(10, 2)
);

-- 插入示例数据
INSERT INTO sales_data (time, location, product, sales) VALUES
('2023-01-01', 'New York', 'A', 100),
('2023-01-01', 'New York', 'B', 200),
('2023-01-01', 'Los Angeles', 'A', 150),
('2023-01-02', 'New York', 'A', 120),
('2023-01-02', 'Los Angeles', 'B', 180);

-- 构建数据立方体
SELECT 
    time,
    location,
    product,
    SUM(sales) AS total_sales
FROM 
    sales_data
GROUP BY 
    CUBE(time, location, product);

输出示例

timelocationproducttotal_sales
2023-01-01New YorkA100
2023-01-01New YorkB200
2023-01-01Los AngelesA150
2023-01-01NULLNULL450
NULLNULLNULL950

可以看到这个Data Cube其实就是将所有time、location、product做了排列组合,然后筛选所有这些条件生成total_sales。
我理解这其实就是一个用于快速查询的子视图。
至于Fact Table其实就是原本的数据表格,原始数据。

Business Intelligence 商业智能

  • BI (Business Intelligence) is a set of processes, architectures, and technologies that convert raw data into meaningful information that drives profitable business actions. BI(商业智能)是一套流程、架构和技术,可将原始数据转换为有意义的信息,从而推动有利可图的商业行动。
  • It is a suite of software and services to transform data into actionable intelligence and knowledge. 它是一套软件和服务,可将数据转换为可操作的情报和知识。
  • BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts to provide users with detailed intelligence about the nature of the business. BI 工具执行数据分析并创建报告、摘要、仪表板、地图、图形和图表,为用户提供有关业务性质的详细情报。

Data Warehouse Design and Usage

A Business Analysis Framework 业务分析框架

  • Four views regarding the design of a data warehouse 数据仓库设计的四种视图

    • Top-down view 自上而下的视图

      • allows selection of the relevant information necessary for the data warehouse 允许选择数据仓库所需的相关信息
    • Data source view 数据源视图

      • exposes the information being captured, stored, and managed by operational systems 公开操作系统捕获、存储和管理的信息
    • Data warehouse view 数据仓库视图

      • consists of fact tables and dimension tables 由事实表和维度表组成
    • Business query view 业务查询视图

      • sees the perspectives of data in the warehouse from the view of end-user 从最终用户的角度查看仓库中的数据

Data Warehouse Design Process

  • Top-down, bottom-up approaches or a combination of both 自上而下、自下而上的方法或两者结合

    • Top-down: Starts with overall design and planning (mature) 自上而下:从总体设计和规划开始(成熟)
    • Bottom-up: Starts with experiments and prototypes (rapid) 自下而上:从实验和原型开始(快速)
  • From software engineering point of view 从软件工程的角度来看

    • Waterfall: structured and systematic analysis at each step before proceeding to the next 瀑布式:在进行下一步之前,对每一步进行结构化和系统化的分析
    • Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around 螺旋式:快速生成功能日益丰富的系统,缩短周转时间,快速周转
  • Typical data warehouse design process 典型的数据仓库设计流程

    • Choose a business process to model, e.g., orders, invoices, etc. 选择要建模的业务流程,例如订单、发票等。
    • Choose the grain (atomic level of data) of the business process 选择业务流程的粒度(数据的原子级别)
    • Choose the dimensions that will apply to each fact table record 选择将应用于每个事实表记录的维度
    • Choose the measure that will populate each fact table record 选择将填充每个事实表记录的度量

Data Warehouse Usage

Information processing

  • querying, 查询
  • basic statistical analysis, 基本统计分析
  • reporting using crosstabs, 使用交叉表进行报告
  • tables, 表格
  • charts 图表
  • graphs 图形

Analytical processing

  • multidimensional analysis of data warehouse data 数据仓库数据的多维分析

    • basic OLAP operations 基本 OLAP 操作
    • slice-dice 切片
    • drilling 下探
    • pivoting 旋转

Data mining

  • knowledge discovery from hidden patterns 从隐藏模式中发现知识

    • associations 关联
    • constructing analytical models 构建分析模型
    • performing classification 执行分类
    • prediction 预测
    • presenting the mining results using visualization tools 使用可视化工具呈现挖掘结果

Data Warehouse Applications

  • A data warehouse helps business executives to organize, analyze, and use their data for decision making. 数据仓库帮助企业高管组织、分析和使用数据进行决策。
  • Data warehouses are widely used in the following fields: 数据仓库广泛应用于以下领域:

    • Financial services 金融服务
    • Banking services 银行服务
    • Consumer goods 消费品
    • Retail sectors 零售业
    • Controlled manufacturing 受控制造
數據挖掘基礎
发表评论
撰写评论