统计学习基础:数据挖掘、推理和预测_统计学习导论|读书笔记02|基础概念

统计学习(2.1) - 基础概念1. 预测推断和基础概念「Sales Example」Suppose we are consultants hired by a client to provide advice on how to improve sales of a particular product it is impossible for our clinet to directly in

weixin_39863017

699人浏览 · 2020-11-23 05:37:04

weixin_39863017 · 2020-11-23 05:37:04 发布

统计学习(2.1) - 基础概念

1. 预测推断和基础概念

「Sales Example」
Suppose we are consultants hired by a client to provide advice on how to improve sales of a particular product

it is impossible for our clinet to directly increase sales of the product
Data analysis can help them to control the advertising expenditure in each of those three media

「Goal」
Develop an accurate model to predict sales on the basis of the three media budget

indirectly increase sales by adjusting advertising budgets

「Model and Variables」TV, radio and newspaper are three features, (or input & predictor)

refer to the input vector collectively as

Sales is a response to predict

is a random error term, independent of

and has

zero

❝ In essence, statistical learning refers to a set of approaches for estimating

❞

1.1 为什么估计

There are two main reasons:

「Prediction」
A set of inputs

are readily available while the output

is needed to predict:

represents our estimate for

, and

represents the resulting prediction for
is treated as a
「black box」, which means if it yields accurate predictions for
=> people not concerned with the exact form(inside) of

【月光宝盒】
The accuracy of
(as prediction for

) depends on two quantities
- Reducible Error: can be reduced by using better statistical learning models
- Irreducible Error: variability associated with
  affects the accuracy of our predictions, which is the error cannot be reduced

The average, or expected value of the squared difference between the predicted and actual value of

「Inference」

need to be interpretable
understand the way that
is affected as

change
cannot be treated as a black box

三个推断需要关注的问题
Q1: Which predictors are associated with the response?

only a few predictors are important

Q2: What is the relationship between the response and each predictor?

some may have a positive relationship and other may have a negative relationship with

Q3: Can the relationship between

and each predictor be adequately summarized using a linear equation or more complicated models?

linear model may not provide an accurate representation of the relationship between the input and ouput variables (e.g. 深度神经网络 = 特殊的非线性回归分析方法)

Different methods for estimating

may be applied depending on whether the goal is

「prediction」, 「inference」 or a combination

「Linear Models」 allow for relatively simple and 「interpretable inference」
the highly non-linear approaches can provide quite 「accurate predicitions」 for
- but this comes at the expense of a less interpretable model for which inference is more challenging

1.2 如何估计

「Parametric Methods」
Two-step model-based approach:

make an assumption about the functional form, or shape, of
. E.g. linear model

after a model is selected, uses the training data to fit or train the model to estimate the parameters

the approach to fitting the model is referred to as (ordinary) least squares

Parametric approach reduces the problem of estimating f down to one of estimating a set of parameters

通过估计一组参数来简化估计函数整体（整个黑盒子）
真实的世间万物的关系可不是线性模型
disadvantage: the model we choose will never exactly match the unknown form of
solution: Choosing 「flexible models」 that can fit many different possible functional forms of
- fitting a more flexible model requires estimating a greater number of parameters
- this can lead to a phenomenon known as 「overfitting」 the data, which means they follow the errors, or noise, to closely

Fit a linear model of the form

Comparing Figure 2.3 to Figure 2.4, the true

has some

curvature that is NOT captured in the linear fit, but linear fit still capture the positive relationship between years of education and income

「Non-parametric Methods」
Do not make explicit assumptions about the functional form of

seek an estimate of
that gets
as close to the data points as possible without being too rough or wiggly
advantage: avoiding the assumption of a particular functional form of
disadvantage: a large number of observations is required to obtain an accurate estimate for
Thin-plate splines attempt to produce an estimate for
that is as close as possible to the observed data
level of smoothness needs to be selected: lower level allows for a rougher fit (zero errors on the training data) but causes overfitting, choosing the correct amount of smoothness is in chapter 5.

1.3 预测准确性&模型解释性的权衡

Tradeoff: some models are less flexible (e.g. linear regression), or more restrictive(i.e. they can produce a relatively small range of shapes to estimate

, like a line or a plane )

Thin plate splines are moreflexible since they can generate a much wider range of possible shapes to estimate
(complicated)

Q: Why would we choose restrictive method instead of a flexible one?

when inference is the goal, linear model is easy to understand the relationship between Y and X
however, flexible approaches like splines and the boosting methods(chapter 8) can lead to such complicated estimates of
that it is difficult to understand how any individual predictor is associated with the response

经验之谈(暂不知道原因，只做摘抄）：

Lasso uses a more restrictive procedure tin estimating the coefficients, and sets a number of them to Zero => more interpretable
GAMs is less interpretable because the relationship between each X and Y is modeled using a non-linear curve
Fully Non-Linear methods such as bagging, boosting and SVM with non-linear kernels are highly flexible but harder to interpret
When Goal is inference, use simple and inflexible method
When Goal is prediction, like predict the price of a stock, seems like to choose a flexible model
- even since our sole requirement is prediction accuracy, interpretability is not a concern
- however, still using a less flexible method to avoid overfitting in highly flexible methods

1.4 监督学习 & 无监督学习

「Supervised Learning」
Fit a model that relates the response for

future observations (prediction)
or better understanding between Y and X (inference)
linear regression, logistic regression, GAM, boosting, SVM...

「Unsupervised Learning」
No associated response

cluster analysis
ascertain on the basis of
, whether the observations fall into relatively distinct groups

「Semi-supervised Learning」
For

observations,

have both

and

have predicture measurements but no response measurement

incorporate the
observations for which response measurements are available as well as the

observations for which they are not

1.5 回归问题 & 分类问题

Quantitative (take on numerical values) response
=> Regression Problem

person's age, height, income, price of a stock

Qualitative (categorical, take on values in one of K classes) response => Classification Problem

person's gender, brand of product
whether a person defaults on a debt (yes or no)

Logistic Regression (chapter 4) is used with a qualitative (two-class, or binary) response

classification method to estimates 「class probabilities」, this can be also thought of as a regression method

KNN(2.2) and Boosting (chapter 8) can be used in the case of either quantitative or qualitative responses

Since any qualitative predictors are properly 「coded」 before the analysis is performed

most methods can be applied regardless of the predictor variable type

TOGO: 2.2. 评估模型精度

永洪数据分析社区

永洪科技，致力于打造全球领先的数据技术厂商，具备从数据应用方案咨询、BI、AIGC智能分析、数字孪生、数据资产、数据治理、数据实施的端到端大数据价值服务能力。

更多推荐

java计算机毕业设计教师工作量统计系统基于SpringBoot的高校教师绩效测算与可视化平台教师教学任务与工作量智能汇总系统

永洪数据分析社区

BI是报表？BI是可视化？BI到底是什么？

永洪数据分析社区

AI智能体+BI可视化：1小时极速入门，市场部独力完成分析报告

商务蓝：适合正式报告活力橙：适合创意活动环保绿：适合可持续发展主题自主分析能力：不再依赖IT部门，活动当天即可产出分析结论智能数据处理：AI智能体自动完成80%的数据清洗和分析工作专业可视化：通过PowerBI制作媲美专业数据分析师的报告持续优化：建立可复用的分析模板，后续活动效率更高决策支持：基于数据快速调整营销策略，提升活动ROI现在就可以尝试部署你的第一个智能分析环境，体验从数据到决策的高速