杯子茶室

关注有趣的事物

數據倉庫架構基礎

网络 0 评

Architecture Properties 架构属性

Separation 分离

Analytical and transactional processing should be kept apart as much as possible. 分析和交易处理应尽可能分开。

Scalability 可扩展性

Hardware and software architectures should be easy to upgrade as the data volume, which has to be managed and processed, and the number of users’ requirements, which have to be met, progressively increase. 随着需要管理和处理的数据量以及需要满足的用户需求数量的不断增加,硬件和软件架构应易于升级。

Extensibility 可扩展性

The architecture should be able to host new applications and technologies without redesigning the whole system.
架构应能够承载新的应用程序和技术,而无需重新设计整个系统。

Security 安全性

Monitoring accesses is essential because of the strategic data stored in data warehouses.
由于数据仓库中存储着战略数据,因此监控访问至关重要。

Administerability 可管理性

Data warehouse management should not be overly difficult.
数据仓库管理不应过于困难。

Common Architectures 通常架构

  • Two different classifications are commonly adopted for data warehouse architectures. 数据仓库架构通常采用两种不同的分类。
  • The first classification is a structure-oriented one that depends on the number of layers used by the architecture. 第一种分类是面向结构的分类,取决于架构使用的层数

    • Single-Layer Architecture 单层架构
    • Two-Layer Architecture 双层架构
    • Three-Layer Architecture 三层架构
  • The second classification depends on how the different layers are employed to create enterprise-oriented or department-oriented views of data warehouses. 第二种分类取决于如何使用不同的层来创建面向企业或面向部门的数据仓库视图。

Layered Architectures

There are mainly three types of layered architectures: 分层架构主要有三种类型:

  • Single-Layer Architecture 单层架构

    • The objective of a single layer is to minimize the amount of data stored. This goal is to remove data redundancy. This architecture is not frequently used in practice. 单层架构的目标是尽量减少存储的数据量。此目标是消除数据冗余。这种架构在实践中并不常用。
  • Two-Layer Architecture 双层架构

    • Two-layer architecture separates physically available sources and data warehouse. This architecture is not expandable and also not supporting a large number of end users. It also has connectivity problems because of network limitations. 双层架构将物理可用源和数据仓库分开。这种架构不可扩展,也不支持大量最终用户。由于网络限制,它还存在连接问题。

Single Layer Architecture

            Operational Data        Source Layer
                    ↓
                Middleware        Data warehouse
                    /\
                   /  \
                  /    \
                 /      \
                ↓        ↓
    Reporting Tools   OLAP Tools       Analysis

Data warehouse和Analysis在单层架构中是虚拟层

Two-Layer Architecture

Operational data    External Data                        Source layer
          ↳   ETL Tools      ↵                                Data staging
                  ↓
            Data Warehouse ↤↦ Meta-data                
                  ↓                                     Data warehous layer
              Data marts
    ↓            ↓            ↓            ↓
Reporting    OLAP tools    Data mining    Wath-if analysis    Analysis

Three-Layer (-Tier) Architecture

This is the most widely used architecture. It consists of the Top, Middle and Bottom Tier. 这是最广泛使用的架构。它由顶层、中层和底层组成。

    1. Bottom Tier: The database of the data warehouse servers as the bottom tier. It is usually a relational database system. Data is cleansed, transformed, and loaded into this layer using back-end tools. 底层:数据仓库服务器的数据库作为底层。它通常是关系数据库系统。使用后端工具清理、转换和加载数据到此层。
    2. Middle Tier: The middle tier in a data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model*. For a user, this application tier presents an abstracted view of the database. This layer also acts as a mediator between the end-user and the database. 中间层:数据仓库中的中间层是 OLAP 服务器,使用 ROLAP 或 MOLAP 模型实现。对于用户,此应用程序层呈现数据库的抽象视图。此层还充当最终用户和数据库之间的中介。
    • ROLAP – Relational OLAP; MOLAP – Multidimensional OLAP
    • ROLAP – 关系 OLAP;MOLAP – 多维 OLAP
    1. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get data out from the data warehouse. It could be Query tools, reporting tools, managed query tools, Analysis tools and Data mining tools. 顶层是前端客户端层。顶层是您连接并从数据仓库中获取数据的工具和 API。它可以是查询工具、报告工具、托管查询工具、分析工具和数据挖掘工具。

    Factors That Affect Choosing A Data Warehouse Architecture 影响选择数据仓库架构的因素

    • Information Interdependence between Organizational Units 组织单位之间的信息相互依赖性
    • Upper Management’s Information Needs 高层管理人员的信息需求
    • Urgency of Need for a Data Warehouse 数据仓库需求的紧迫性
    • Nature of End-User Tasks 最终用户任务的性质
    • Constraints on Resources 资源限制
    • Strategic View of the Data Warehouse Prior to Implementation 实施前对数据仓库的战略视角
    • Compatibility with Existing Systems 与现有系统的兼容性
    • Perceived Ability of the In-House IT Staff 内部 IT 员工的能力感知
    • Technical Issues 技术问题
    • Social/Political Factors 社会/政治因素

    Data Warehouse Components

    • The data warehouse is based on an RDBMS server which is a central information repository that is surrounded by some key components to make the entire environment functional, manageable and accessible. 数据仓库基于 RDBMS 服务器,它是一个中央信息存储库,由一些关键组件包围,使整个环境功能齐全、易于管理和访问。
    • There are five major components of Data Warehouse: 数据仓库有五个主要组件

      1. Data Warehouse Database 数据仓库数据库
      2. ETL Tools
      3. Metadata 元数据
      4. Query Tools 查询工具
      5. Data Warehouse Bus Architecture 数据仓库总线架构

    Data Warehouse Database

    • The central database is the foundation of the data warehousing environment. 中央数据库是数据仓库环境的基础。
    • This database is implemented on the RDBMS technology. 该数据库基于 RDBMS 技术实现。
    • Although, this kind of implementation is constrained by the fact that traditional RDBMS system is optimized for transactional database processing and not for data warehousing. 但这种实现方式受到传统 RDBMS 系统针对事务数据库处理进行优化而非数据仓库优化的限制。
    • For instance, ad-hoc query, multi-table joins, aggregates are resource intensive and slow down performance 例如,即席查询、多表连接、聚合会占用大量资源并降低性能

    ETL Tools

    • The data sourcing, transformation, and migration tools are used for performing all the conversions, summarizations, and all the changes needed to transform data into a unified format in the data warehouse. 数据采购、转换和迁移工具用于执行所有转换、汇总以及将数据转换为数据仓库中的统一格式所需的所有更改。
    • They are also called Extract, Transform and Load (ETL) Tools. 它们也称为提取、转换和加载 (ETL) 工具。

      • Their functionality includes: 其功能包括:

        • Anonymize data as per regulatory stipulations. 根据监管规定匿名化数据。
        • Eliminating unwanted data in operational databases from loading into Data warehouse.消除操作数据库中不需要的数据,避免将其加载到数据仓库中。
        • Search and replace common names and definitions for data arriving from different sources. 搜索和替换来自不同来源的数据的通用名称和定义。
        • Calculating summaries and derived data. 计算摘要和派生数据。
        • In case of missing data, populate them with defaults. 如果有缺失数据,则使用默认值填充。
        • De-duplicated repeated data arriving from multiple data sources. 对来自多个数据源的重复数据进行去重。
      • These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol programs, shell scripts, etc. that regularly update data in data warehouse. 这些提取、转换和加载工具可能会生成 cron 作业、后台作业、Cobol 程序、shell 脚本等,以定期更新数据仓库中的数据。
      • These tools are also helpful to maintain the Metadata. 这些工具也有助于维护元数据。
      • These ETL Tools have to deal with challenges of Database & Data heterogeneity. 这些 ETL 工具必须应对数据库和数据异构性的挑战。

    Metadata

    • The name Metadata suggests some high-level technological concept. 元数据这个名字暗示了一些高级技术概念。
    • Metadata is data about data which defines the data warehouse. 元数据是关于数据的数据,它定义了数据仓库。
    • It is used for building, maintaining and managing the data warehouse. 它用于构建、维护和管理数据仓库。
    • In the Data Warehouse Architecture, metadata plays an important role as it specifies the source, usage, values, and features of data warehouse data. It also defines how data can be changed and processed. It is closely connected to the data warehouse 在数据仓库架构中,元数据起着重要作用,因为它指定了数据仓库数据的来源、用途、值和特征。它还定义了如何更改和处理数据。它与数据仓库紧密相关
    • For example, a line in sales database may contain: 例如,销售数据库中的一行可能包含:

      • 4030 KJ732 299.90
    • This is a meaningless data until we consult the Meta that tell us it was 这是一个毫无意义的数据,直到我们查阅元数据告诉我们它是

      • Model number: 4030
      • Sales Agent ID: KJ732
      • Total sales amount of $299.90
    • Therefore, Metadata are essential ingredients in the transformation of data into knowledge. 因此,元数据是将数据转换为知识的重要组成部分。
    • Metadata helps to answer the following questions: 元数据有助于回答以下问题:

      • What tables, attributes, and keys does the Data Warehouse contain? 数据仓库包含哪些表、属性和键?
      • Where did the data come from? 数据来自哪里?
      • How many times do data get reloaded? 数据重新加载了多少次?
      • What transformations were applied with cleansing? 清理时应用了哪些转换?
    • Metadata can be classified into following categories: 元数据可分为以下几类:

      1. Technical Meta Data: This kind of Metadata contains information about warehouse which is used by Data warehouse designers and administrators. 技术元数据:这种元数据包含有关仓库的信息,供数据仓库设计人员和管理员使用。
      2. Business Meta Data: This kind of Metadata contains detail that gives end-users a way easy to understand information stored in the data warehouse. 业务元数据:这种元数据包含详细信息,使最终用户可以轻松理解存储在数据仓库中的信息。

    Query Tools

    • One of the primary objects of data warehousing is to provide information to businesses to make strategic decisions. 数据仓库的主要目标之一是向企业提供信息,以便他们做出战略决策。
    • Query tools allow users to interact with the data warehouse system. 查询工具允许用户与数据仓库系统进行交互。
    • These tools fall into four different categories: 这些工具分为四个不同的类别:

      1. Query and reporting tools 查询和​​报告工具
      2. Application development tools 应用程序开发工具
      3. Data mining tools 数据挖掘工具
      4. OLAP tools

    Query and Reporting Tools

    1. Reporting tools 报告工具

      • Reporting tools can be further divided into production reporting tools and desktop report writer. 报告工具可进一步分为生产报告工具和桌面报告编写器。
      • Report writers: This kind of reporting tool are tools designed for end-users for their analysis. 报告编写器:此类报告工具是专为最终用户分析而设计的工具。
      • Production reporting: This kind of tools allows organizations to generate regular operational reports. It also supports high volume batch jobs like printing and calculating. Some popular reporting tools are Brio, Business Objects, Oracle, PowerSoft, SAS Institute. 生产报告:此类工具允许组织生成定期运营报告。它还支持大量批处理作业,如打印和计算。一些流行的报告工具是 Brio、Business Objects、Oracle、PowerSoft、SAS Institute。
    2. Managed query tools 托管查询工具

      • This kind of access tools helps end users to resolve snags in database and SQL and database structure by inserting meta-layer between users and database. 此类访问工具通过在用户和数据库之间插入元层,帮助最终用户解决数据库和 SQL 以及数据库结构中的障碍。

    Application Development Tools

    • Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an organization. 有时内置的图形和分析工具不能满足组织的分析需求。
    • In such cases, custom reports are developed using Application Development Tools. 在这种情况下,使用应用程序开发工具开发自定义报告。

    Data Mining Tools

    • Data mining is a process of discovering meaningful new correlation, patterns, and trends by mining large amount data. 数据挖掘是通过挖掘大量数据来发现有意义的新关联、模式和趋势的过程。
    • Data mining tools are used to make this process automatic. 数据挖掘工具用于使该过程自动化。

    OLAP Tools

    • These tools are based on concepts of a multidimensional database 这些工具基于多维数据库的概念
    • It allows users to analyze the data using elaborate and complex multidimensional views. 它允许用户使用精细而复杂的多维视图来分析数据。

    Data Warehouse Bus Architecture 数据仓库总线架构

    • Data warehouse bus determines the flow of data in your warehouse. 数据仓库总线决定了仓库中的数据流。
    • The data flow in a data warehouse can be categorized as Inflow, Upflow, Downflow, Outflow and Meta flow. 数据仓库中的数据流可分为流入、上流、下流、流出和元流。
    • While designing a Data Bus, one needs to consider the shared dimensions, facts across data marts. 在设计数据总线时,需要考虑数据集市之间的共享维度和事实。

    Data Marts

    A data mart is an access layer which is used to get data out to the users. 数据集市是一个访问层,用于将数据传输给用户。

    Data Warehouse Architecture Best Practices

    • Use a data model which is optimized for information retrieval which can be the dimensional mode, denormalized or hybrid approach. 使用针对信息检索进行优化的数据模型,可以是维度模式、非规范化或混合方法。
    • Need to assure that Data is processed quickly and accurately. At the same time, you should take an approach which consolidates data into a single version of the truth. 需要确保快速准确地处理数据。同时,您应该采取一种将数据整合为单一事实版本的方法。
    • Carefully design the data acquisition and cleansing process for Data warehouse. 精心设计数据仓库的数据采集和清理过程。
    • Design a Metadata architecture which allows sharing of metadata between components of Data Warehouse. 设计允许在数据仓库组件之间共享元数据的元数据架构。
    • Consider implementing an ODS model when information retrieval need is near the bottom of the data abstraction pyramid or when there are multiple operational sources required to be accessed. 当信息检索需求接近数据抽象金字塔的底部或需要访问多个操作源时,请考虑实施 ODS 模型。
    • One should make sure that the data model is integrated and not just consolidated. In that case, you should consider 3NF data model. It is also ideal for acquiring ETL and Data cleansing tools 应确保数据模型是集成的,而不仅仅是合并的。在这种情况下,您应该考虑 3NF 数据模型。它也是获取 ETL 和数据清理工具的理想选择
    數據倉庫建模
    发表评论
    撰写评论