杯子茶室

关注有趣的事物

數據挖掘基礎

网络 0 评

Example: A Web Mining Framework

  • Data cleaning
  • Data integration from multiple sources
  • Warehousing the data
  • Data cube construction
  • Data selection for data mining
  • Data mining
  • Presentation of the mining results
  • Patterns and knowledge to be used or stored into knowledge-base

Data Mining in Business Intelligence

More support business decisions 更多支持业务决策Decision Making 决策End User 最终用户
Data Presentation 数据呈现Business Analyst 业务分析师
Data Mining 数据挖掘Data Analyst 数据分析师
Data Exploration 数据探索
Data Preprocessing/Integration, Data Warehouses 数据预处理/集成、数据仓库
Less support business decisions 更少业务决策的支持Data Sources 数据源DBA 数据库管理员

Mining vs. Data Exploration 挖掘 vs 数据探索

  • Business intelligence view 商业智能视角

    • Warehouse, 仓库,
    • data cube, 数据立方体,
    • reporting but not much mining 报告但不太挖掘
  • Business objects vs. data mining tools 业务对象与数据挖掘工具
  • Supply chain example: tools 供应链示例:工具
  • Data presentation 数据呈现
  • Exploration 探索

KDD Process: A Typical View from ML and Statistics KDD 流程:来自 ML 和统计学的典型观点

  • Pattern discovery 模式发现
  • Association & correlation 关联和相关性
  • Classification 分类
  • Clustering 聚类
  • Outlier analysis 异常值分析
  • ...

Multi-Dimensional View of Data Mining 多维度视角下的数据挖掘

  • Data to be mined 待挖掘的数据

    • Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
    • 数据库数据(扩展关系型、面向对象型、异构型、遗留型)、数据仓库、交易数据、流、时空、时间序列、序列、文本和网络、多媒体、图表和社交及信息网络
  • Knowledge to be mined (or: Data mining functions) 待挖掘的知识(或:数据挖掘功能)

    • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 特征化、区分、关联、分类、聚类、趋势/偏差、异常值分析等。
    • Descriptive vs. predictive data mining 描述性与预测性数据挖掘
    • Multiple/integrated functions and mining at multiple levels 多种/集成功能和多层次挖掘
  • Techniques utilized 所用技术

    • Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc 数据密集型、数据仓库 (OLAP)、机器学习、统计、模式识别、可视化、高性能等
  • Applications adapted 适用的应用程序

    • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. 零售、电信、银行、欺诈分析、生物数据挖掘、股票市场分析、文本挖掘、Web 挖掘等。

What Kind of Data Can Be Mined?

  • Database-oriented data sets and applications 面向数据库的数据集和应用程序

    • Relational database, data warehouse, transactional database
    • 关系数据库、数据仓库、事务数据库
  • Advanced data sets and advanced applications 高级数据集和高级应用程序

    • Data streams and sensor data
    • Time-series data, temporal data, sequence data (incl. bio-sequences)
    • ...

Data Mining Function

Generalization 概括

  • Information integration and data warehouse construction 信息集成与数据仓库构建

    • Data cleaning, transformation, integration, and multidimensional data model 数据清理、转换、集成和多维数据模型
  • Data cube technology 数据立方体技术

    • Scalable methods for computing (i.e., materializing) multidimensional aggregates 计算(即物化)多维聚合的可扩展方法
    • OLAP (online analytical processing) 在线分析处理
  • Multidimensional concept description: Characterization and discrimination 多维概念描述:特征化和区分

    • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region 概括、总结和对比数据特征,例如干旱地区与湿润地区

Association and Correlation Analysis 关联和相关性分析

  • Frequent patterns (or frequent item sets)– What items are frequently purchased together in your Walmart? 频繁模式(或频繁项集)– 您在沃尔玛经常一起购买哪些商品?
  • Association, correlation vs. causality 关联、相关性与因果关系

    • A typical association rule • Diaper → Beer [0.5%, 75%] (support, confidence) 典型的关联规则 • 尿布 → 啤酒 [0.5%,75%](支持度、置信度)
    • Are strongly associated items also strongly correlated? 强关联的项目是否也强相关?
  • How to mine such patterns and rules efficiently in large datasets? 如何在大型数据集中有效挖掘此类模式和规则?
  • How to use such patterns for classification, clustering, and other applications? 如何将此类模式用于分类、聚类和其他应用?

Classification 分类

  • Classification and label prediction 分类和标签预测

    • Construct models (functions) based on some training examples 根据一些训练示例构建模型(函数)
    • Describe and distinguish classes or concepts for future prediction 描述和区分类别或概念以供未来预测

      • E.g., classify countries based on (climate), or classify cars based on (gas mileage) 例如,根据(气候)对国家进行分类,或根据(油耗)对汽车进行分类
    • Predict some unknown class labels 预测一些未知的类别标签
  • Typical methods 典型方法

    • Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule based classification, pattern-based classification, logistic regression, … 决策树、朴素贝叶斯分类、支持向量机、神经网络、基于规则的分类、基于模式的分类、逻辑回归……
  • Typical applications: 典型应用:

    • Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages 信用卡欺诈检测、直接营销、对明星、疾病、网页进行分类

Cluster Analysis 聚类分析

  • Unsupervised learning (i.e., Class label is unknown) 无监督学习(即类别标签未知)
  • Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns 将数据分组以形成新类别(即聚类),例如,对房屋进行聚类以查找分布模式
  • Principle: Maximizing intra-class similarity & minimizing interclass similarity 原理:最大化类内相似性并最小化类间相似性
  • Many methods and applications 多种方法和应用

Outlier Analysis 异常值分析

  • Outlier: A data object that does not comply with the general behavior of the data 异常值:不符合数据一般行为的数据对象
  • Noise or exception? 噪音还是异常?

    • One person’s garbage could be another person’s treasure 一个人的垃圾可能是另一个人的宝藏
  • Methods: by product of clustering or regression analysis, … 方法:聚类或回归分析的副产品,……
  • Useful in fraud detection, rare events analysis 适用于欺诈检测、罕见事件分析

Time and Ordering: Sequential Pattern, Trend and Evolution Analysis 时间与顺序:序列模式、趋势与演化分析

  • Sequence, trend and evolution analysis 序列、趋势和演化分析

    • Trend, time-series, and deviation analysis: e.g., regression and value prediction 趋势、时间序列和偏差分析:例如回归和价值预测
    • Sequential pattern mining 序列模式挖掘

      • e.g., first buy digital camera, then buy large SD memory cards 例如,先买数码相机,然后买大容量 SD 存储卡
    • Periodicity analysis 周期性分析
    • Motifs and biological sequence analysis 主题和生物序列分析

      • Approximate and consecutive motifs 近似和连续主题
    • Similarity-based analysis 基于相似性的分析
  • Mining data streams– Ordered, time-varying, potentially infinite, data streams 挖掘数据流 - 有序、随时间变化、可能无限的数据流

Structure and Network Analysis 结构与网络分析

  • Graph mining 图挖掘

    • Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) 查找频繁子图(例如,化合物)、树(XML)、子结构(网络片段)
  • Information network analysis 信息网络分析

    • Social networks: actors (objects, nodes) and relationships (edges) 社交网络:参与者(对象、节点)和关系(边)

      • e.g., author networks in CS, terrorist networks 例如,CS 中的作者网络、恐怖分子网络
    • Multiple heterogeneous networks 多个异构网络

      • A person could be multiple information networks: friends, family, classmates, … 一个人可能属于多个信息网络:朋友、家人、同学……
    • Links carry a lot of semantic information: Link mining 链接携带大量语义信息:链接挖掘
  • Web mining Web 挖掘

    • Web is a big information network: from PageRank to Google Web 是一个大型信息网络:从 PageRank 到 Google
    • Analysis of Web information networks Web 信息网络分析

      • Web community discovery, opinion mining, usage mining, … Web 社区发现、意见挖掘、使用挖掘……

Evaluation of Knowledge

Are all mined knowledge interesting? 挖掘出的所有知识都有趣吗?

  • One can mine tremendous amount of “patterns” and knowledge 可以挖掘出大量的“模式”和知识
  • Some may fit only certain dimension space (time, location, …) 有些可能只适合某些维度空间(时间、位置等)
  • Some may not be representative, may be transient, … 有些可能不具代表性,可能是短暂的

Evaluation of mined knowledge → directly mine only interesting knowledge? 评估挖掘出的知识 → 直接挖掘有趣的知识?

  • Descriptive vs. predictive 描述性与预测性
  • Coverage 覆盖率
  • Typicality vs. novelty 典型性与新颖性
  • Accuracy 准确性
  • Timelines 时间线

Technology Are Used

  • Pattern Recognition
  • Machine Learning
  • Applications
  • Algorithm
  • Database Technology
  • Visualization
  • High-Performance Computing
  • Statistics
  • Pattern Recognition

Why Confluence of Multiple Disciplines? 为何要融合多个学科?

  • Tremendous amount of data 数据量巨大
  • High-dimensionality of data 数据维度高
  • High complexity of data 数据复杂度高
  • New and sophisticated applications 新型复杂应用

What Kind of Applications Are Targeted?

  • Web page analysis 网页分析
  • Collaborative analysis & recommender systems 协同分析和推荐系统
  • Basket data analysis to targeted marketing 购物篮数据分析和定向营销
  • Biological and medical data analysis 生物和医学数据分析
  • Data mining and software engineering 数据挖掘和软件工程
數據倉庫架構基礎
发表评论
撰写评论