Example: A Web Mining Framework
- Data cleaning
- Data integration from multiple sources
- Warehousing the data
- Data cube construction
- Data selection for data mining
- Data mining
- Presentation of the mining results
- Patterns and knowledge to be used or stored into knowledge-base
Data Mining in Business Intelligence
More support business decisions 更多支持业务决策 | Decision Making 决策 | End User 最终用户 |
---|---|---|
Data Presentation 数据呈现 | Business Analyst 业务分析师 | |
Data Mining 数据挖掘 | Data Analyst 数据分析师 | |
Data Exploration 数据探索 | ||
Data Preprocessing/Integration, Data Warehouses 数据预处理/集成、数据仓库 | ||
Less support business decisions 更少业务决策的支持 | Data Sources 数据源 | DBA 数据库管理员 |
Mining vs. Data Exploration 挖掘 vs 数据探索
Business intelligence view 商业智能视角
- Warehouse, 仓库,
- data cube, 数据立方体,
- reporting but not much mining 报告但不太挖掘
- Business objects vs. data mining tools 业务对象与数据挖掘工具
- Supply chain example: tools 供应链示例:工具
- Data presentation 数据呈现
- Exploration 探索
KDD Process: A Typical View from ML and Statistics KDD 流程:来自 ML 和统计学的典型观点
- Pattern discovery 模式发现
- Association & correlation 关联和相关性
- Classification 分类
- Clustering 聚类
- Outlier analysis 异常值分析
- ...
Multi-Dimensional View of Data Mining 多维度视角下的数据挖掘
Data to be mined 待挖掘的数据
- Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
- 数据库数据(扩展关系型、面向对象型、异构型、遗留型)、数据仓库、交易数据、流、时空、时间序列、序列、文本和网络、多媒体、图表和社交及信息网络
Knowledge to be mined (or: Data mining functions) 待挖掘的知识(或:数据挖掘功能)
- Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. 特征化、区分、关联、分类、聚类、趋势/偏差、异常值分析等。
- Descriptive vs. predictive data mining 描述性与预测性数据挖掘
- Multiple/integrated functions and mining at multiple levels 多种/集成功能和多层次挖掘
Techniques utilized 所用技术
- Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc 数据密集型、数据仓库 (OLAP)、机器学习、统计、模式识别、可视化、高性能等
Applications adapted 适用的应用程序
- Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. 零售、电信、银行、欺诈分析、生物数据挖掘、股票市场分析、文本挖掘、Web 挖掘等。
What Kind of Data Can Be Mined?
Database-oriented data sets and applications 面向数据库的数据集和应用程序
- Relational database, data warehouse, transactional database
- 关系数据库、数据仓库、事务数据库
Advanced data sets and advanced applications 高级数据集和高级应用程序
- Data streams and sensor data
- Time-series data, temporal data, sequence data (incl. bio-sequences)
- ...
Data Mining Function
Generalization 概括
Information integration and data warehouse construction 信息集成与数据仓库构建
- Data cleaning, transformation, integration, and multidimensional data model 数据清理、转换、集成和多维数据模型
Data cube technology 数据立方体技术
- Scalable methods for computing (i.e., materializing) multidimensional aggregates 计算(即物化)多维聚合的可扩展方法
- OLAP (online analytical processing) 在线分析处理
Multidimensional concept description: Characterization and discrimination 多维概念描述:特征化和区分
- Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region 概括、总结和对比数据特征,例如干旱地区与湿润地区
Association and Correlation Analysis 关联和相关性分析
- Frequent patterns (or frequent item sets)– What items are frequently purchased together in your Walmart? 频繁模式(或频繁项集)– 您在沃尔玛经常一起购买哪些商品?
Association, correlation vs. causality 关联、相关性与因果关系
- A typical association rule • Diaper → Beer [0.5%, 75%] (support, confidence) 典型的关联规则 • 尿布 → 啤酒 [0.5%,75%](支持度、置信度)
- Are strongly associated items also strongly correlated? 强关联的项目是否也强相关?
- How to mine such patterns and rules efficiently in large datasets? 如何在大型数据集中有效挖掘此类模式和规则?
- How to use such patterns for classification, clustering, and other applications? 如何将此类模式用于分类、聚类和其他应用?
Classification 分类
Classification and label prediction 分类和标签预测
- Construct models (functions) based on some training examples 根据一些训练示例构建模型(函数)
Describe and distinguish classes or concepts for future prediction 描述和区分类别或概念以供未来预测
- E.g., classify countries based on (climate), or classify cars based on (gas mileage) 例如,根据(气候)对国家进行分类,或根据(油耗)对汽车进行分类
- Predict some unknown class labels 预测一些未知的类别标签
Typical methods 典型方法
- Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule based classification, pattern-based classification, logistic regression, … 决策树、朴素贝叶斯分类、支持向量机、神经网络、基于规则的分类、基于模式的分类、逻辑回归……
Typical applications: 典型应用:
- Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages 信用卡欺诈检测、直接营销、对明星、疾病、网页进行分类
Cluster Analysis 聚类分析
- Unsupervised learning (i.e., Class label is unknown) 无监督学习(即类别标签未知)
- Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns 将数据分组以形成新类别(即聚类),例如,对房屋进行聚类以查找分布模式
- Principle: Maximizing intra-class similarity & minimizing interclass similarity 原理:最大化类内相似性并最小化类间相似性
- Many methods and applications 多种方法和应用
Outlier Analysis 异常值分析
- Outlier: A data object that does not comply with the general behavior of the data 异常值:不符合数据一般行为的数据对象
Noise or exception? 噪音还是异常?
- One person’s garbage could be another person’s treasure 一个人的垃圾可能是另一个人的宝藏
- Methods: by product of clustering or regression analysis, … 方法:聚类或回归分析的副产品,……
- Useful in fraud detection, rare events analysis 适用于欺诈检测、罕见事件分析
Time and Ordering: Sequential Pattern, Trend and Evolution Analysis 时间与顺序:序列模式、趋势与演化分析
Sequence, trend and evolution analysis 序列、趋势和演化分析
- Trend, time-series, and deviation analysis: e.g., regression and value prediction 趋势、时间序列和偏差分析:例如回归和价值预测
Sequential pattern mining 序列模式挖掘
- e.g., first buy digital camera, then buy large SD memory cards 例如,先买数码相机,然后买大容量 SD 存储卡
- Periodicity analysis 周期性分析
Motifs and biological sequence analysis 主题和生物序列分析
- Approximate and consecutive motifs 近似和连续主题
- Similarity-based analysis 基于相似性的分析
- Mining data streams– Ordered, time-varying, potentially infinite, data streams 挖掘数据流 - 有序、随时间变化、可能无限的数据流
Structure and Network Analysis 结构与网络分析
Graph mining 图挖掘
- Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) 查找频繁子图(例如,化合物)、树(XML)、子结构(网络片段)
Information network analysis 信息网络分析
Social networks: actors (objects, nodes) and relationships (edges) 社交网络:参与者(对象、节点)和关系(边)
- e.g., author networks in CS, terrorist networks 例如,CS 中的作者网络、恐怖分子网络
Multiple heterogeneous networks 多个异构网络
- A person could be multiple information networks: friends, family, classmates, … 一个人可能属于多个信息网络:朋友、家人、同学……
- Links carry a lot of semantic information: Link mining 链接携带大量语义信息:链接挖掘
Web mining Web 挖掘
- Web is a big information network: from PageRank to Google Web 是一个大型信息网络:从 PageRank 到 Google
Analysis of Web information networks Web 信息网络分析
- Web community discovery, opinion mining, usage mining, … Web 社区发现、意见挖掘、使用挖掘……
Evaluation of Knowledge
Are all mined knowledge interesting? 挖掘出的所有知识都有趣吗?
- One can mine tremendous amount of “patterns” and knowledge 可以挖掘出大量的“模式”和知识
- Some may fit only certain dimension space (time, location, …) 有些可能只适合某些维度空间(时间、位置等)
- Some may not be representative, may be transient, … 有些可能不具代表性,可能是短暂的
Evaluation of mined knowledge → directly mine only interesting knowledge? 评估挖掘出的知识 → 直接挖掘有趣的知识?
- Descriptive vs. predictive 描述性与预测性
- Coverage 覆盖率
- Typicality vs. novelty 典型性与新颖性
- Accuracy 准确性
- Timelines 时间线
Technology Are Used
- Pattern Recognition
- Machine Learning
- Applications
- Algorithm
- Database Technology
- Visualization
- High-Performance Computing
- Statistics
- Pattern Recognition
Why Confluence of Multiple Disciplines? 为何要融合多个学科?
- Tremendous amount of data 数据量巨大
- High-dimensionality of data 数据维度高
- High complexity of data 数据复杂度高
- New and sophisticated applications 新型复杂应用
What Kind of Applications Are Targeted?
- Web page analysis 网页分析
- Collaborative analysis & recommender systems 协同分析和推荐系统
- Basket data analysis to targeted marketing 购物篮数据分析和定向营销
- Biological and medical data analysis 生物和医学数据分析
- Data mining and software engineering 数据挖掘和软件工程