数据挖掘概论

Here you will get introduction to data mining.在这里，您将获得数据挖掘的介绍。We are back again in front of you with another successive Machine Learning blogpost. So far we have covered many interrelated topics p...

culing2941

782人浏览 · 2020-09-15 21:49:34

culing2941 · 2020-09-15 21:49:34 发布

Here you will get introduction to data mining.

在这里，您将获得数据挖掘的介绍。

We are back again in front of you with another successive Machine Learning blogpost. So far we have covered many interrelated topics pertaining to ML and today we think should start with another such interdisciplinary subject Data Mining or more appropriately Knowledge Mining. We must tell you that Data Mining does not only play a vital role in the field of ML but also sets its foot in the ever growing technological domain. Data is considered as the backbone of any industry and so is the Data Mining. Availability of right kind of data at the right moment acts like magic for business and can boost it by providing crucial information and demographics about the consumer.

另一个相继的机器学习博客文章将再次在您的面前。到目前为止，我们已经涵盖了与ML相关的许多相互关联的主题，而今天，我们认为应该从另一个这样的跨学科主题“ 数据挖掘”或更合适的“ 知识挖掘”开始 。我们必须告诉您，数据挖掘不仅在ML领域起着至关重要的作用，而且还在不断发展的技术领域中立足。数据被视为任何行业的骨干，数据挖掘也是如此。在适当的时机提供正确种类的数据对于企业来说就像魔术一样，并且可以通过提供有关消费者的重要信息和人口统计信息来提高其可用性。

什么是数据挖掘？ (What is Data Mining?)

Let’s make this simpler for you to understand the terminology even better. Think of it in this way, when gold is mined from rocks and sands it is referred to as gold mining not sand or rock mining, right? Exactly in the same way when we dig deeper we would realise that instead of calling it data mining, the process should be more accurately termed as “knowledge mining from data”, which is lengthy enough to be used in day to day lives. Therefore the term Data Mining prevails widely.

让我们简化一下，让您更好地理解术语。这样想吧，当从岩石和沙子中开采黄金时，它被称为黄金开采，而不是沙子或岩石开采，对吗？完全以同样的方式，当我们进行更深入的研究时，我们会意识到，与其将其称为数据挖掘，不如将其更准确地称为“从数据中进行知识挖掘”，该过程足够长，可以用于日常生活。因此，术语“ 数据挖掘”广泛流行。

Synonymously several other terms are being used in the industry alternative to data mining such as, knowledge extraction, data archaeology, data/pattern analysis and data dredging.

同义地，其他几个术语在行业中用于替代数据挖掘，例如知识提取，数据考古，数据/模式分析和数据挖掘。

A major portion of people out there use the terms data mining and knowledge discovery from data, or KDD interchangeably, while some others view data mining as just an essential step in the process of knowledge discovery.

那里的很大一部分人使用术语数据挖掘和从数据中发现知识，或将KDD互换使用，而另一些人则认为数据挖掘只是知识发现过程中的重要步骤。

Now, when we’ve talked about the knowledge discovery from data (KDD), let us have a quick sneek peek to get an overview of what KDD is? And more precisely what steps are involved in KDD.

现在，当我们谈论从数据中发现知识(KDD)时，让我们快速浏览一下一下，以了解什么是KDD？更确切地说，KDD涉及哪些步骤。

KDD中的步骤 (Steps in KDD)

Data Cleaning: To eliminate noise, inconsistency and redundancies from the data.

数据清理 ：从数据中消除噪声，不一致和冗余。
Data integration: Multiple sources to provide data are possibly combined in this step.

数据集成 ：此步骤中可能会组合提供数据的多个源。
Data Selection: Data that are important to the analysis are only retrieved from the database

数据选择 ：仅从数据库中检索对分析重要的数据
Data transformation : The data are converted into the forms suitable for mining by applying aggregation operations.

数据转换 ：通过应用聚合操作将数据转换为适合于挖掘的形式。
Data mining: One of the most crucial step where intelligent and predetermined methods are used to get patterns in the data.

数据挖掘 ：最关键的步骤之一，其中使用智能和预定方法来获取数据中的模式。
Pattern evaluation: To explore the interesting and valuable patterns representing knowledge among several thousands of similar patterns.

模式评估 ：探索代表数千种相似模式中知识的有趣且有价值的模式。
Knowledge presentation: Knowledge representation and visualization techniques are applied to present the extracted knowledge to the users.

知识表示 ：知识表示和可视化技术用于将提取的知识呈现给用户。

Summary of the Steps Involved:

涉及的步骤摘要：

Steps 1, 2, 3 and 4 are the distinct types of data pre-processing, result of which are the used for the mining purpose. The data mining step may interact with the user or a knowledge base. The relevant and important patterns are made available to the users and can also be saved in the form of new knowledge in the knowledge base.

步骤1、2、3和4是数据预处理的不同类型，其结果用于挖掘目的。数据挖掘步骤可以与用户或知识库交互。相关和重要的模式可供用户使用，也可以以新知识的形式保存在知识库中。

Rest of the steps (step 5, 6, 7) also play crucial role in the entire mining process. Data mining is primarily the technique to discover interesting patterns and knowledge from huge data. The data sources consist of databases, data warehouses, the web, other information repositories, or data that are streamed into the system dynamically.

其余步骤(步骤5、6、7)在整个采矿过程中也起着至关重要的作用。数据挖掘主要是从大量数据中发现有趣的模式和知识的技术。数据源包括数据库，数据仓库，Web，其他信息存储库或动态流式传输到系统中的数据。

为什么需要数据挖掘？ (Why is Data Mining required?)

In day to day lives we very often come across data and data mining techniques provide mechanisms and tools to analyse those data. Data mining is also crucial to extract knowledge from the available data. Moreover data mining can be interpreted as a result of the continuous evolution in the information technology.

在日常生活中，我们经常会遇到数据，而数据挖掘技术提供了分析这些数据的机制和工具。数据挖掘对于从可用数据中提取知识也至关重要。此外，数据挖掘可以解释为信息技术不断发展的结果。

We already are aware about the fact that we are living in the information age and tons (terabytes and petabytes) of data are flowing into our networks facilitating our business requirements and fulfilling our data needs. In the past few decades we have seen a tremendous increment in the volume of data present and the sole reason for this is the computerization and introduction of the advance and capable tools for data collection and discovery of knowledge from the available data.

我们已经意识到我们生活在信息时代，并且大量数据(TB和PB)正流入我们的网络，这有利于我们满足业务需求并满足数据需求。在过去的几十年中，我们看到了现有数据量的巨大增长，其唯一原因是计算机化和引入了先进而强大的工具，用于数据收集和从可用数据中发现知识。

数据如何生成？ (How data are generated?)

We can understand this by taking example of any one of business or stores like Mc Donald’s, KFC, or Wal-Mart which produces gigantic data like their sales record, transactions, sales promotions, company profiles etc. Such companies/stores handle millions of transactions per week at several hundreds of branches across the globe.

我们可以通过以麦当劳，肯德基或沃尔玛等任何一家企业或商店为例来理解这一点，这些企业或商店会产生巨大的数据，例如其销售记录，交易，促销，公司简介等。此类公司/商店可以处理数百万笔交易每周在全球数百个分支机构。

Apart from big stores and companies, engineering and scientific practices are too capable of generating petabytes of data in continuity. Processes like remote sensing and scientific experiments are primarily responsible for this. Moreover the very obvious source of generating humongous data is the telecommunication network which produces large datasets on the daily basis.

除了大型商店和公司之外，工程和科学实践也能够连续生成PB级的数据。诸如遥感和科学实验之类的过程是主要原因。此外，产生庞大数据的非常明显的来源是电信网络，该网络每天产生大量数据集。

Also the web searches and different social platforms, Web communities and blogs generate data that are endless. This presence of huge amount of data from various sources and desperate need of the tools to uncover useful knowledge from these data inspire the need to explore such domains and is giving rise to data mining principles and techniques.

此外，网络搜索和不同的社交平台，网络社区和博客都会产生无穷无尽的数据。来自各种来源的大量数据的存在以及对从这些数据中发现有用知识的工具的迫切需求激发了探索此类领域的需求，并引发了数据挖掘的原理和技术。

Additionally data mining can be viewed as a result of the evolution of the information technology over ages. The different functionalities which resulted due to the evolution of database and data management industry are depicted below with the help of an image.

另外，数据挖掘可以看作是信息技术历代发展的结果。下面借助图像描述了由于数据库和数据管理行业的发展而导致的不同功能。

We hope things in the mind are pretty much clear from the image description above. If still there exists some ambiguity or doubt please let us know in the comment section, we will be happy to help.

我们希望从上面的图片描述中，头脑中的事情变得清晰得多。如果仍然存在歧义或疑问，请在评论部分告诉我们，我们将竭诚为您服务。

Moving on we will now study what are Database and Data warehouse relative to the concept of data mining and how are they different from each other.

现在，我们将继续研究与数据挖掘概念相关的数据库仓库和数据仓库以及它们之间的区别。

数据库和数据仓库 (Databases and Data Warehouses)

People often treat these two terms as same or interchangeably. We must tell you that two terms are not synonymous to each other. Both stand for a distinct meaning and must be treated accordingly.

人们经常将这两个术语视为相同或互换。我们必须告诉您，两个术语互不相同。两者都代表不同的含义，必须相应地对待。

什么是数据库？ (What is Database?)

In a layman’s language we can say that a database is a memory space which we can use to store our data.

用外行的语言，我们可以说数据库是一个存储空间，可以用来存储数据。

Moreover we can understand database as a repository which is used by businesses, in science and engineering, research, etc. to store huge amount of data for day to day or future use. A database is known to store the current data like transactions occurring at thousands of Wal-Mart outlets around the world, number of users visiting a blog or even demographics of the persons applying for a passport.

此外，我们可以将数据库理解为企业，科学和工程，研究等机构使用的存储库，以存储大量数据以供日用或将来使用。已知数据库可以存储当前数据，例如在世界各地成千上万的沃尔玛商店发生的交易，访问博客的用户数量，甚至是申请护照的人的人口统计信息。

The type of the data being able to be stored in the databases varies widely and nearly data belonging to any of the domain from buyer’s information in the local store to the results derived from the aerospace research. It is said and believed that the data that is stored within the database is dynamic and is meant to be used for day to day transactions.

能够存储在数据库中的数据类型变化很大，几乎属于任何领域的数据，从本地商店中的买方信息到航空研究的结果。据说并相信存储在数据库中的数据是动态的，并且打算用于日常交易。

We think we have discussed sufficient on databases now let’s quickly jump to our next topic which is Data Warehouse.

我们认为我们已经讨论了足够的数据库，现在让我们快速跳到下一个主题即数据仓库。

什么是数据仓库？ (What is Data Warehouse?)

The phrase “A data warehouse refers to a database that is maintained separately from an organization’s operational databases” says it all. However according to a great data scientist, a data warehouse is a subject-oriented, integrated, time variant and non-volatile collection of data in support of management’s decision making process. Let’s make this easier for you to grasp.

短语“数据仓库是指与组织的运营数据库分开维护的数据库”。但是，根据一位出色的数据科学家的说法，数据仓库是面向主题的，集成的，随时间变化且非易失性的数据收集，以支持管理层的决策过程。让我们使其更容易理解。

Suppose there is a big multinational company (say XYZ) which has its branches spread all across the globe. Each of the branches has their own set of databases. Now the chairperson of the company has asked to provide a summary of company’s sales per item type per branch for some specified time. This is relatively tough to accumulate the data that are physically spread at different sited all over the world.

假设有一家大型跨国公司(例如XYZ)，其分支机构遍布全球。每个分支都有自己的数据库集。现在，公司董事长已要求提供在特定时间段内每个分支机构每个项目类型的公司销售摘要。要累积在世界各地分布在物理上的数据，这相对比较困难。

If XYZ had a data warehouse it would be quite easy to accomplish the task defined above. A data warehouse is a repository of knowledge accumulated from several sources, stored under a unified schema, and most probably residing at a single physical location.

如果XYZ有数据仓库，那么完成上面定义的任务将非常容易。数据仓库是从多个来源累积的知识的存储库，以统一的模式存储，并且很可能位于单个物理位置。

Structure of a typical data warehouse for XYZ company is given below for the users reference.

下面给出了XYZ公司典型数据仓库的结构，以供用户参考。

Types of Data Warehouse

数据仓库的类型

Enterprise Data Warehouse
企业数据仓库
Operational Data Store/Virtual Warehouse
运营数据存储/虚拟仓库
Data Mart
数据库

We will study about different types of data warehouses in the future posts if required.

如果需要，我们将在以后的文章中研究不同类型的数据仓库。

数据库和数据仓库之间的区别 (Difference between Database and Data Warehouse)

Database System	Data Warehouse
Stores current data	Stores historical data
Data is dynamic	Data is largely static
Used for daily transaction	Used for analysis of data
Application-oriented	Subject-oriented

数据库系统	数据仓库
存储当前数据	存储历史数据
数据是动态的	数据基本上是静态的
用于日常交易	用于数据分析
面向应用	面向学科

Why should we have a separate data warehouse?

为什么我们要有一个单独的数据仓库？

Operational database store huge amount of data, you may wonder “why can’t we perform online analytical processing directly on such databases despite of providing additional time and resources to make a separate data warehouse?” A major reason for such separation is to help promote the high performance of both systems.

运营数据库存储着大量数据，您可能会想：“尽管提供了额外的时间和资源来建立单独的数据仓库，我们为什么不能直接在此类数据库上执行在线分析处理？” 进行这种分离的主要原因是有助于提高两个系统的高性能。

多维数据模型 (Multidimensional Data Model)

Data warehouses and Online Analytical Processing (OLAP) tools are based on a multidimensional data model. This model views data in the form of cube (a three dimensional structure).

数据仓库和在线分析处理(OLAP)工具基于多维数据模型。该模型以多维数据集 (三维结构)的形式查看数据。

What is data cube?

什么是数据立方体？

A data cube is a means which permits data to be modelled and viewed in multiple dimensions. It is defined by dimensions and facts. A typical structure of a data cube is given below for reference:

数据立方体是一种允许对数据进行建模和多维查看的方法。它由维度和事实定义。下面给出数据立方体的典型结构，以供参考：

It is clearly visible from the snap above that how data can be represented in multiple dimensions instead of only two dimensions that most people think.

从上面的快照中可以清楚地看到，如何以多维方式表示数据，而不是大多数人认为的二维方式。

The dimensions in general are entities or variables with respect to which an organization (say XYZ) wants to keep records. Let us again take example of XYZ company that we have talked about. The XYZ company may create a data warehouse pertaining to sales to keep records of the branch’s sales with respect to the dimensions like time, item, branch, demand, etc. These entities or dimensions help company keep record of things like monthly sales and the branches at which the items were sold. Also with every dimension there is a table associated to keep such tracks and is popularly known as Dimension Table. Further it is proved that cube is not only 3D but can be n-dimensional depending upon the requirement. The famous representations used to represent the multidimensional models are Star, Snowflakes and Fact constellations. We are not going to cover the models in this post and is for the overview purpose only.

通常，维度是组织(例如XYZ)要保留记录的实体或变量。让我们再次以我们刚才讨论的XYZ公司为例。 XYZ公司可以创建一个与销售有关的数据仓库，以保持分支机构有关时间，项目，分支机构，需求等维度的销售记录。这些实体或维度可以帮助公司保留诸如每月销售额和分支机构之类的记录出售物品的时间。同样，对于每个维度，都有一个与之相关的表格来保持这样的轨迹，并且通常称为维度表。 进一步证明，根据要求，立方体不仅是3D的，而且可以是n维的。用来表示多维模型的著名表示是星，雪花和事实星座。我们将不在本文中介绍这些模型，仅用于概述目的。

Different tools and utilities are being used by the data warehousing systems to make their data denser and to refresh them periodically. The functions that are being carried out by such tools include:

数据仓库系统使用了不同的工具和实用程序来使其数据更密集并定期刷新它们。此类工具所执行的功能包括：

Data Extraction: The process of accumulation of data from various heterogeneous sources

数据提取：从各种异构源收集数据的过程

Data Cleaning: Involves detection and rectification of the errors whenever possible

数据清理：尽可能地检测和纠正错误

Data Transformation: as clear from the term itself, transformation of the data is carried out. That is data is transformed from its conventional form into the form suitable for data warehouse.

数据转换：从术语本身可以清楚地看出， 数据的转换是进行的。也就是说，数据已从其常规形式转换为适合数据仓库的形式。

Load: A series of functions are performed at this very step like sorting, summarization, consolidation, computation of views, checking integrity and building indices and partitions of the dataset.

负载：在这一步执行一系列功能，例如排序，汇总，合并，视图计算，检查完整性以及建立数据集的索引和分区。

Refresh: This process is basically responsible for propagating the updates from data sources to the warehouse.

刷新：此过程主要负责将更新从数据源传播到仓库。

Besides these tools the data warehouse system also provides a set of tools beneficial for the management of the data warehouses.

除了这些工具之外，数据仓库系统还提供了一组有利于数据仓库管理的工具。

Image Sources: All the images presented in this blog are taken from book Data Mining Concepts and Techniques by Jiwaei Han, Micheline Kamber & Jian Pei and belong to their respective owners.

图像来源： 本博客中呈现的所有图像均摘自Haniwaei Han，Micheline Kamber和Jian Pei的 《 数据挖掘概念和技术》一书，属于其各自所有者。

We think this is sufficient piece of knowledge for today. We wish our readers are not facing any difficulties while learning. Our only motto is to facilitate our readers to get a good grasp over the content being exposed to them. If there persist any issue regarding data mining or if there is something you can’t really understand, please let us know via comments below, we are there to resolve the doubts.

我们认为这是今天足够的知识。希望我们的读者在学习时不会遇到任何困难。我们唯一的座右铭是帮助读者更好地了解暴露给他们的内容。如果仍然存在有关数据挖掘的问题，或者您无法真正理解某些内容，请通过以下评论告知我们，我们将在那里解决疑虑。