2026-02-23 12:09:47 +08:00
|
|
|
|
# 数据分析:从数据中"挖掘价值"
|
2026-02-15 01:57:52 +08:00
|
|
|
|
|
2026-02-23 12:09:47 +08:00
|
|
|
|
::: tip 核心问题
|
|
|
|
|
|
**如何从数据中发现规律?** 这就像问:怎么从一堆杂乱的数字里找到有价值的信息?怎么判断业务是否健康?怎么预测未来的趋势?数据分析解决的就是"从数据到洞察"的问题。
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 0. 先问一个问题:你有没有经历过这些困惑?
|
|
|
|
|
|
|
|
|
|
|
|
**场景一:被数据淹没**
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
系统日志:10 GB/天
|
|
|
|
|
|
用户行为:100 万条/天
|
|
|
|
|
|
订单数据:10 万条/天
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
数据堆积如山,但不知道从哪里入手分析,更不知道这些数据能告诉你什么。
|
|
|
|
|
|
|
|
|
|
|
|
**场景二:只看表面指标**
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
DAU(日活用户):10 万 → 看起来不错!
|
|
|
|
|
|
次日留存:15% → 危险!
|
|
|
|
|
|
30 日留存:3% → 非常危险!
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
只看 DAU 以为产品很成功,但留存率暴跌说明用户来一次就走,产品根本没有粘性。
|
|
|
|
|
|
|
|
|
|
|
|
**场景三:不会用 SQL 分析数据**
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
想统计:每个用户的平均订单额
|
|
|
|
|
|
只会写:SELECT * FROM orders;
|
|
|
|
|
|
然后用 Excel 手动计算...
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
掌握基本的数据分析技能,能让你从"看数据"变成"用数据驱动决策"。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**好的数据分析就像侦探破案**——从蛛丝马迹中发现规律,从混乱中找到真相。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 1. 什么是数据?
|
|
|
|
|
|
|
|
|
|
|
|
**数据**就是关于任何事物的记录。在你的日常生活中,数据无处不在。
|
|
|
|
|
|
|
|
|
|
|
|
### 1.1 生活中的数据例子
|
|
|
|
|
|
|
|
|
|
|
|
**你的个人数据**:
|
|
|
|
|
|
- 每天走了多少步(手机会记录)
|
|
|
|
|
|
- 每月花了多少钱(支付宝/微信账单)
|
|
|
|
|
|
- 睡了多少小时(健康 App 记录)
|
|
|
|
|
|
- 看了哪些视频(B站/抖音历史记录)
|
|
|
|
|
|
|
|
|
|
|
|
**一个咖啡店的数据**:
|
|
|
|
|
|
- 每天卖了多少杯咖啡
|
|
|
|
|
|
- 每种咖啡卖了多少杯
|
|
|
|
|
|
- 每笔订单的金额
|
|
|
|
|
|
- 顾客的等待时间
|
|
|
|
|
|
|
|
|
|
|
|
**一个网站的数据**:
|
|
|
|
|
|
- 每天有多少人访问
|
|
|
|
|
|
- 用户点击了哪些按钮
|
|
|
|
|
|
- 用户停留了多长时间
|
|
|
|
|
|
- 用户从哪里来(搜索引擎、社交媒体等)
|
|
|
|
|
|
|
|
|
|
|
|
::: tip 💡 关键理解
|
|
|
|
|
|
**数据 = 记录下来的信息**
|
|
|
|
|
|
只要能被记录、被存储、被计算的,都是数据。
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 2. 什么是分析?
|
|
|
|
|
|
|
|
|
|
|
|
**分析**就是"拆解 + 研究"的意思。就像侦探破案一样,从一堆线索中找到规律。
|
|
|
|
|
|
|
|
|
|
|
|
### 2.1 用侦探破案来类比
|
|
|
|
|
|
|
|
|
|
|
|
**侦探怎么做**:
|
|
|
|
|
|
1. 收集线索(指纹、脚印、监控录像)
|
|
|
|
|
|
2. 找线索之间的联系
|
|
|
|
|
|
3. 推理出真相
|
|
|
|
|
|
4. 抓住坏人
|
|
|
|
|
|
|
|
|
|
|
|
**数据分析怎么做**:
|
|
|
|
|
|
1. 收集数据(用户行为、销售记录、日志)
|
|
|
|
|
|
2. 找数据之间的联系
|
|
|
|
|
|
3. 发现规律和趋势
|
|
|
|
|
|
4. 做出决策
|
|
|
|
|
|
|
|
|
|
|
|
::: tip 💡 关键理解
|
|
|
|
|
|
**分析 = 从数据中找规律**
|
|
|
|
|
|
不是"看数据",而是"理解数据背后的故事"。
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 生活中的"分析"例子
|
|
|
|
|
|
|
|
|
|
|
|
**例子一:你发现咖啡总是卖完**
|
|
|
|
|
|
- **数据**:每天早上 10 点,拿铁咖啡就卖完了
|
|
|
|
|
|
- **分析**:为什么总是 10 点卖完?
|
|
|
|
|
|
- 查看销售记录 → 发现 8-10 点是高峰期
|
|
|
|
|
|
- 统计销售量 → 发现拿铁占总销量的 60%
|
|
|
|
|
|
- 分析顾客 → 发现大部分是上班族
|
|
|
|
|
|
- **结论**:上班族早高峰喜欢喝拿铁
|
|
|
|
|
|
- **行动**:多准备一些拿铁,或者提前制作
|
|
|
|
|
|
|
|
|
|
|
|
**例子二:你发现用户不爱用你的 App**
|
|
|
|
|
|
- **数据**:下载量 1 万,但每天只有 500 人打开
|
|
|
|
|
|
- **分析**:为什么用户不用?
|
|
|
|
|
|
- 查看用户行为 → 发现 80% 的用户注册后就没再回来
|
|
|
|
|
|
- 分析注册流程 → 发现需要填写 10 个字段
|
|
|
|
|
|
- 对比其他 App → 发现其他 App 只需要 2 个字段
|
|
|
|
|
|
- **结论**:注册流程太复杂,吓跑了用户
|
|
|
|
|
|
- **行动**:简化注册流程
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 3. 为什么要分析数据?
|
|
|
|
|
|
|
|
|
|
|
|
### 3.1 一个真实的场景
|
|
|
|
|
|
|
|
|
|
|
|
**老板问你**:"我们的用户增长怎么样?"
|
|
|
|
|
|
|
|
|
|
|
|
**如果不懂数据分析,你可能会说**:
|
|
|
|
|
|
- "挺好的吧,感觉用户变多了"
|
|
|
|
|
|
- "不太清楚,我看看后台"
|
|
|
|
|
|
- "昨天有 100 个新用户"
|
|
|
|
|
|
|
|
|
|
|
|
**如果懂数据分析,你会这样回答**:
|
|
|
|
|
|
```
|
|
|
|
|
|
过去 30 天的数据:
|
|
|
|
|
|
- 新增用户:3000 人(日均 100 人)
|
|
|
|
|
|
- 增长趋势:环比增长 15%(上个月是 2600 人)
|
|
|
|
|
|
- 用户质量:次日留存 45%,7 日留存 25%
|
|
|
|
|
|
- 来源分布:搜索引擎 40%,社交媒体 35%,直接访问 25%
|
|
|
|
|
|
|
|
|
|
|
|
结论:
|
|
|
|
|
|
1. 用户增长健康,且在加速
|
|
|
|
|
|
2. 社交媒体来源的留存最高(55%),应该加大投放
|
|
|
|
|
|
3. 搜索引擎来源的留存较低(30%),需要优化落地页
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**哪个回答更有价值?** 显然是第二个。
|
|
|
|
|
|
|
|
|
|
|
|
::: tip 💡 关键理解
|
|
|
|
|
|
**数据分析的价值 = 让你做出更好的决策**
|
|
|
|
|
|
- 不是"我觉得",而是"数据显示"
|
|
|
|
|
|
- 不是"大概",而是"准确"
|
|
|
|
|
|
- 不是"事后诸葛亮",而是"提前预测"
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
### 3.2 数据分析能帮你做什么?
|
|
|
|
|
|
|
|
|
|
|
|
| 场景 | 问题 | 数据分析能做什么 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| **做生意** | 不知道哪个商品好卖 | 统计销售数据,找出爆款 |
|
|
|
|
|
|
| **做产品** | 不知道用户喜欢什么 | 分析用户行为,优化功能 |
|
|
|
|
|
|
| **做运营** | 不知道广告效果如何 | 对比不同渠道的转化率 |
|
|
|
|
|
|
| **做投资** | 不知道买哪只股票 | 分析历史数据,预测趋势 |
|
|
|
|
|
|
| **个人生活** | 不知道钱花哪了 | 记账分析,找出浪费 |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 4. 数据分析的价值
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
**数据分析**是从数据中提取有价值信息的过程。它不是简单的"看数字",而是通过统计、聚合、可视化等方法,发现数据背后的规律和趋势。
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 4.1 用医学检查来类比
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 医学检查 | 数据分析 | 说明 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| 体温计 | 基础指标 | 温度/DAU 等单一数值 |
|
|
|
|
|
|
| 血常规 | 描述性统计 | 均值、中位数、分布情况 |
|
|
|
|
|
|
| CT 扫描 | 多维度分析 | 从不同角度看数据 |
|
|
|
|
|
|
| 趋势图 | 时间序列分析 | 观察变化趋势 |
|
|
|
|
|
|
| 诊断报告 | 数据洞察 | 得出结论和建议 |
|
|
|
|
|
|
|
|
|
|
|
|
### 1.2 数据分析的核心价值
|
|
|
|
|
|
|
|
|
|
|
|
| 价值 | 说明 | 示例 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| **描述现状** | 告诉你"发生了什么" | 今日 DAU 10 万,销售额 50 万 |
|
|
|
|
|
|
| **诊断问题** | 告诉你"为什么发生" | 留存率低是因为注册流程太长 |
|
|
|
|
|
|
| **预测趋势** | 告诉你"可能发生什么" | 根据过去 30 天数据,下月 DAU 增长 10% |
|
|
|
|
|
|
| **指导决策** | 告诉你"应该怎么做" | A/B 测试显示新版按钮转化率提高 20% |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 5. 描述性统计:从数据中"提炼信息"
|
|
|
|
|
|
|
|
|
|
|
|
**描述性统计**就是用几个数字来概括大量的数据。
|
|
|
|
|
|
|
|
|
|
|
|
想象一下,如果你要向朋友描述"你们班同学的身高",你会怎么说?
|
|
|
|
|
|
- ❌ "张三 170cm,李四 175cm,王五 168cm..."(说 10 分钟都说不完)
|
|
|
|
|
|
- ✅ "我们班平均身高 172cm"(一句话就说清楚了)
|
|
|
|
|
|
|
|
|
|
|
|
这就是描述性统计的作用:**把复杂的数据变成简单的指标**。
|
|
|
|
|
|
|
|
|
|
|
|
### 5.1 均值:数据的"平均值"
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景:计算平均成绩
|
|
|
|
|
|
|
|
|
|
|
|
你有 5 门课的成绩:80, 85, 90, 75, 95
|
|
|
|
|
|
|
|
|
|
|
|
**计算步骤**:
|
|
|
|
|
|
```
|
|
|
|
|
|
步骤 1:把所有成绩加起来
|
|
|
|
|
|
80 + 85 + 90 + 75 + 95 = 425
|
|
|
|
|
|
|
|
|
|
|
|
步骤 2:数一共有几门课
|
|
|
|
|
|
一共 5 门课
|
|
|
|
|
|
|
|
|
|
|
|
步骤 3:用总和除以数量
|
|
|
|
|
|
425 ÷ 5 = 85
|
|
|
|
|
|
|
|
|
|
|
|
所以平均成绩是 85 分。
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 用数学公式表示
|
|
|
|
|
|
|
|
|
|
|
|
**均值 = (所有数值的和) ÷ (数值的个数)**
|
|
|
|
|
|
|
|
|
|
|
|
**符号表示**:
|
|
|
|
|
|
- 均值的符号是 x̄(读作"x bar")
|
|
|
|
|
|
- 数据用 x₁, x₂, x₃... 表示
|
|
|
|
|
|
- 数据个数用 n 表示
|
|
|
|
|
|
|
|
|
|
|
|
**公式**:x̄ = (x₁ + x₂ + x₃ + ... + xₙ) ÷ n
|
|
|
|
|
|
|
|
|
|
|
|
**用成绩的例子**:
|
|
|
|
|
|
```
|
|
|
|
|
|
x̄ = (80 + 85 + 90 + 75 + 95) ÷ 5
|
|
|
|
|
|
= 425 ÷ 5
|
|
|
|
|
|
= 85
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 什么时候用均值?
|
|
|
|
|
|
|
|
|
|
|
|
**适合用均值的场景**:
|
|
|
|
|
|
- 数据分布比较均匀
|
|
|
|
|
|
- 想知道"整体水平"
|
|
|
|
|
|
- 没有极端的异常值
|
|
|
|
|
|
|
|
|
|
|
|
**例子**:
|
|
|
|
|
|
- ✅ 计算班级平均成绩(成绩通常分布在 60-100 之间)
|
|
|
|
|
|
- ✅ 计算店铺日均销售额(每天的销售差异不会太大)
|
|
|
|
|
|
- ✅ 计算用户平均年龄(大部分用户年龄相近)
|
|
|
|
|
|
|
|
|
|
|
|
#### 什么时候不能用均值?
|
|
|
|
|
|
|
|
|
|
|
|
**问题一:极端值会拉偏均值**
|
|
|
|
|
|
|
|
|
|
|
|
**场景:工资调查**
|
|
|
|
|
|
|
|
|
|
|
|
一个公司有 5 个人,工资分别是:
|
|
|
|
|
|
```
|
|
|
|
|
|
员工 A:3000 元
|
|
|
|
|
|
员工 B:4000 元
|
|
|
|
|
|
员工 C:5000 元
|
|
|
|
|
|
员工 D:6000 元
|
|
|
|
|
|
老板: 100000 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**计算均值**:
|
|
|
|
|
|
```
|
|
|
|
|
|
(3000 + 4000 + 5000 + 6000 + 100000) ÷ 5
|
|
|
|
|
|
= 118000 ÷ 5
|
|
|
|
|
|
= 23600 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:均值显示"平均工资 23600 元",但实际上 4 个员工工资都不到 6000 元。老板的高工资把均值拉高了。
|
|
|
|
|
|
|
|
|
|
|
|
**这时候应该用中位数(后面会讲)**。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**问题二:数据分布不均匀**
|
|
|
|
|
|
|
|
|
|
|
|
**场景:电商订单金额**
|
|
|
|
|
|
|
|
|
|
|
|
某电商平台今天的订单:
|
|
|
|
|
|
```
|
|
|
|
|
|
9.9 元 × 1000 单 = 9900 元
|
|
|
|
|
|
99 元 × 100 单 = 9900 元
|
|
|
|
|
|
999 元 × 10 单 = 9990 元
|
|
|
|
|
|
9999 元 × 1 单 = 9999 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**订单数**:1111 单
|
|
|
|
|
|
**总金额**:39789 元
|
|
|
|
|
|
**均值**:39789 ÷ 1111 ≈ 35.8 元
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:均值显示"平均订单 35.8 元",但实际上大部分订单(1000 单)都是 9.9 元。
|
|
|
|
|
|
|
|
|
|
|
|
**这时候应该用众数(后面会讲)**。
|
|
|
|
|
|
|
|
|
|
|
|
::: tip 💡 实战建议
|
|
|
|
|
|
- **看 DAU、GMV 等**:用均值即可(数据量大,极端值影响小)
|
|
|
|
|
|
- **看收入、房价等**:用中位数更准确(避免被极端值 skew)
|
|
|
|
|
|
- **看热销商品等**:用众数(最典型的情况)
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 5.2 中位数:排序后"中间"的值
|
|
|
|
|
|
|
|
|
|
|
|
#### 什么是中位数?
|
|
|
|
|
|
|
|
|
|
|
|
**中位数**就是把数据从小到大排序后,位于中间位置的那个值。
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景:计算工资中位数
|
|
|
|
|
|
|
|
|
|
|
|
**数据:一个公司 5 个人的工资**
|
|
|
|
|
|
```
|
|
|
|
|
|
3000, 4000, 5000, 6000, 100000
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**计算步骤**:
|
|
|
|
|
|
```
|
|
|
|
|
|
步骤 1:排序(从小到大)
|
|
|
|
|
|
3000, 4000, 5000, 6000, 100000 ✓(已经排序)
|
|
|
|
|
|
|
|
|
|
|
|
步骤 2:找到中间的位置
|
|
|
|
|
|
一共 5 个数,中间是第 3 个
|
|
|
|
|
|
|
|
|
|
|
|
步骤 3:取出中间的值
|
|
|
|
|
|
中位数 = 5000 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**对比均值**:
|
|
|
|
|
|
- 中位数 = 5000 元(更能代表普通员工的工资)
|
|
|
|
|
|
- 均值 = 23600 元(被老板的高工资拉高了)
|
|
|
|
|
|
|
|
|
|
|
|
#### 如果数据个数是偶数怎么办?
|
|
|
|
|
|
|
|
|
|
|
|
**数据:6 个人的工资**
|
|
|
|
|
|
```
|
|
|
|
|
|
3000, 4000, 5000, 6000, 7000, 100000
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**计算步骤**:
|
|
|
|
|
|
```
|
|
|
|
|
|
步骤 1:排序
|
|
|
|
|
|
3000, 4000, 5000, 6000, 7000, 100000
|
|
|
|
|
|
|
|
|
|
|
|
步骤 2:找到中间的位置
|
|
|
|
|
|
一共 6 个数,中间是第 3 和第 4 个之间
|
|
|
|
|
|
|
|
|
|
|
|
步骤 3:计算中间两个数的平均值
|
|
|
|
|
|
(5000 + 6000) ÷ 2 = 5500
|
|
|
|
|
|
|
|
|
|
|
|
中位数 = 5500 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 什么时候用中位数?
|
|
|
|
|
|
|
|
|
|
|
|
**适合用中位数的场景**:
|
|
|
|
|
|
- 数据有极端值(比如工资、房价)
|
|
|
|
|
|
- 想知道"典型情况"
|
|
|
|
|
|
- 数据分布不均匀
|
|
|
|
|
|
|
|
|
|
|
|
**例子**:
|
|
|
|
|
|
- ✅ 调查收入(避免被亿万富翁 skew)
|
|
|
|
|
|
- ✅ 统计房价(避免被豪宅 skew)
|
|
|
|
|
|
- ✅ 分析订单金额(避免被大单 skew)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 5.3 众数:出现"最多"的值
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
#### 什么是众数?
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**众数**就是数据中出现次数最多的值。
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景:找出最畅销的商品
|
|
|
|
|
|
|
|
|
|
|
|
**数据:某咖啡店今天的订单**
|
|
|
|
|
|
```
|
|
|
|
|
|
拿铁 × 15 杯
|
|
|
|
|
|
美式 × 8 杯
|
|
|
|
|
|
卡布奇诺 × 5 杯
|
|
|
|
|
|
摩卡 × 3 杯
|
|
|
|
|
|
玛奇朵 × 2 杯
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**计算步骤**:
|
|
|
|
|
|
```
|
|
|
|
|
|
步骤 1:统计每种咖啡出现的次数
|
|
|
|
|
|
拿铁:15 次
|
|
|
|
|
|
美式:8 次
|
|
|
|
|
|
卡布奇诺:5 次
|
|
|
|
|
|
摩卡:3 次
|
|
|
|
|
|
玛奇朵:2 次
|
|
|
|
|
|
|
|
|
|
|
|
步骤 2:找到出现次数最多的
|
|
|
|
|
|
拿铁出现 15 次,是最多的
|
|
|
|
|
|
|
|
|
|
|
|
众数 = 拿铁
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**结论**:拿铁是最受欢迎的咖啡。
|
|
|
|
|
|
|
|
|
|
|
|
#### 特殊情况:可能有多个众数
|
|
|
|
|
|
|
|
|
|
|
|
**数据:同学们的鞋码**
|
|
|
|
|
|
```
|
|
|
|
|
|
37 码 × 2 人
|
|
|
|
|
|
38 码 × 5 人
|
|
|
|
|
|
39 码 × 5 人
|
|
|
|
|
|
40 码 × 3 人
|
|
|
|
|
|
41 码 × 1 人
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**众数**:38 码和 39 码(都出现了 5 次)
|
|
|
|
|
|
|
|
|
|
|
|
**结论**:这个班有两种主流鞋码。
|
|
|
|
|
|
|
|
|
|
|
|
#### 什么时候用众数?
|
|
|
|
|
|
|
|
|
|
|
|
**适合用众数的场景**:
|
|
|
|
|
|
- 数据是分类(不是数字)
|
|
|
|
|
|
- 想知道"最热门"的选项
|
|
|
|
|
|
- 有多个峰值
|
|
|
|
|
|
|
|
|
|
|
|
**例子**:
|
|
|
|
|
|
- ✅ 最畅销的商品(iPhone、奶茶)
|
|
|
|
|
|
- ✅ 最常用的功能(点赞、评论)
|
|
|
|
|
|
- ✅ 最热门的搜索词(用户经常搜什么)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 5.4 集中趋势:数据的"中心"在哪里?
|
|
|
|
|
|
|
|
|
|
|
|
现在你已经了解了三个指标,让我们总结一下:
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 指标 | 定义 | 适用场景 | 示例 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **均值** | 所有数值的平均值 | 数据分布均匀时 | 用户平均年龄:28 岁 |
|
|
|
|
|
|
| **中位数** | 排序后位于中间的值 | 有极端值时 | 收入中位数:5000 元(避免被亿万富翁 skew) |
|
|
|
|
|
|
| **众数** | 出现次数最多的值 | 分类数据 | 最常买的商品:iPhone |
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
#### 为什么需要三个指标?
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**场景一:正常分布(三个指标接近)**
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```python
|
|
|
|
|
|
数据:[1, 2, 3, 4, 5]
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
均值 = (1 + 2 + 3 + 4 + 5) ÷ 5 = 3
|
|
|
|
|
|
中位数 = 排序后中间的数 = 3
|
|
|
|
|
|
众数 = 没有重复的数,无众数
|
|
|
|
|
|
|
|
|
|
|
|
→ 数据分布均匀,均值和中位数接近
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**场景二:有极端值(中位数更准确)**
|
|
|
|
|
|
```python
|
2026-02-23 12:09:47 +08:00
|
|
|
|
数据:[1, 2, 3, 4, 100]
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
均值 = (1 + 2 + 3 + 4 + 100) ÷ 5 = 22
|
|
|
|
|
|
中位数 = 排序后中间的数 = 3
|
|
|
|
|
|
众数 = 没有重复的数,无众数
|
|
|
|
|
|
|
|
|
|
|
|
→ 极端值(100)拉高了均值,中位数(3)更准确
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**场景三:电商订单(众数最典型)**
|
|
|
|
|
|
```python
|
2026-02-23 12:09:47 +08:00
|
|
|
|
数据:[9.9, 9.9, 9.9, 999, 9999]
|
2026-02-24 00:18:09 +08:00
|
|
|
|
|
|
|
|
|
|
均值 = (9.9 + 9.9 + 9.9 + 999 + 9999) ÷ 5 = 2005.72
|
|
|
|
|
|
中位数 = 排序后中间的数 = 9.9
|
|
|
|
|
|
众数 = 9.9(出现 3 次)
|
|
|
|
|
|
|
|
|
|
|
|
→ 大部分用户买 9.9 元商品,众数(9.9)最典型
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
---
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 5.5 离散程度:数据"分散"还是"集中"?
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
#### 为什么需要衡量离散程度?
|
|
|
|
|
|
|
|
|
|
|
|
**场景:两个班的平均成绩都是 80 分**
|
|
|
|
|
|
|
|
|
|
|
|
**A 班**:[78, 79, 80, 81, 82]
|
|
|
|
|
|
- 均值 = 80
|
|
|
|
|
|
- 标准差 = 1.41(很小)
|
|
|
|
|
|
- **解读**:成绩很集中,大家水平差不多
|
|
|
|
|
|
|
|
|
|
|
|
**B 班**:[50, 65, 80, 95, 100]
|
|
|
|
|
|
- 均值 = 80
|
|
|
|
|
|
- 标准差 = 18.71(很大)
|
|
|
|
|
|
- **解读**:成绩很分散,有的很好,有的很差
|
|
|
|
|
|
|
|
|
|
|
|
**结论**:虽然两个班平均分相同,但 A 班更"稳定",B 班差异"很大"。
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
#### 极差:最简单的衡量方法
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**极差 = 最大值 - 最小值**
|
|
|
|
|
|
|
|
|
|
|
|
**例子:考试成绩**
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```
|
2026-02-24 00:18:09 +08:00
|
|
|
|
成绩:[60, 75, 80, 85, 95]
|
|
|
|
|
|
|
|
|
|
|
|
步骤 1:找到最大值
|
|
|
|
|
|
最大值 = 95
|
|
|
|
|
|
|
|
|
|
|
|
步骤 2:找到最小值
|
|
|
|
|
|
最小值 = 60
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
步骤 3:计算极差
|
|
|
|
|
|
极差 = 95 - 60 = 35
|
|
|
|
|
|
|
|
|
|
|
|
所以成绩的极差是 35 分。
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**优点**:计算简单
|
|
|
|
|
|
**缺点**:只看最大和最小,容易被极端值影响
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 方差:衡量每个数据与均值的偏离
|
|
|
|
|
|
|
|
|
|
|
|
**什么是方差?**
|
|
|
|
|
|
|
|
|
|
|
|
方差衡量"每个数据与均值相差多少",然后取平均值。
|
|
|
|
|
|
|
|
|
|
|
|
**计算步骤**:
|
|
|
|
|
|
|
|
|
|
|
|
**例子:数据 [2, 4, 6, 8, 10]**
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
步骤 1:计算均值
|
|
|
|
|
|
(2 + 4 + 6 + 8 + 10) ÷ 5 = 6
|
|
|
|
|
|
|
|
|
|
|
|
步骤 2:计算每个数与均值的差
|
|
|
|
|
|
2 - 6 = -4
|
|
|
|
|
|
4 - 6 = -2
|
|
|
|
|
|
6 - 6 = 0
|
|
|
|
|
|
8 - 6 = 2
|
|
|
|
|
|
10 - 6 = 4
|
|
|
|
|
|
|
|
|
|
|
|
步骤 3:把差值平方(去掉负号)
|
|
|
|
|
|
(-4)² = 16
|
|
|
|
|
|
(-2)² = 4
|
|
|
|
|
|
0² = 0
|
|
|
|
|
|
2² = 4
|
|
|
|
|
|
4² = 16
|
|
|
|
|
|
|
|
|
|
|
|
步骤 4:计算平方的平均
|
|
|
|
|
|
(16 + 4 + 0 + 4 + 16) ÷ 5 = 8
|
|
|
|
|
|
|
|
|
|
|
|
所以方差 = 8
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**为什么要平方?**
|
|
|
|
|
|
- 因为差值有正有负(-4, 4),直接加会抵消
|
|
|
|
|
|
- 平方后都是正数,才能累加
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:方差是"平方"后的单位,不好理解。
|
|
|
|
|
|
- 如果原始数据是"元",方差就是"元²"
|
|
|
|
|
|
- 如果原始数据是"岁",方差就是"岁²"
|
|
|
|
|
|
|
|
|
|
|
|
**解决方案**:用标准差!
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 标准差:更直观的离散程度
|
|
|
|
|
|
|
|
|
|
|
|
**标准差 = 方差的平方根**
|
|
|
|
|
|
|
|
|
|
|
|
**例子:刚才的方差 = 8**
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
标准差 = √8 ≈ 2.83
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**优点**:
|
|
|
|
|
|
- 单位和原始数据一样(元、岁、分等)
|
|
|
|
|
|
- 更容易理解
|
|
|
|
|
|
|
|
|
|
|
|
**如何理解标准差?**
|
|
|
|
|
|
|
|
|
|
|
|
**经验法则(正态分布)**:
|
|
|
|
|
|
- **68% 的数据**在 [均值 - 1 标准差, 均值 + 1 标准差] 之间
|
|
|
|
|
|
- **95% 的数据**在 [均值 - 2 标准差, 均值 + 2 标准差] 之间
|
|
|
|
|
|
|
|
|
|
|
|
**例子:用户年龄**
|
|
|
|
|
|
```
|
|
|
|
|
|
均值 = 28 岁
|
|
|
|
|
|
标准差 = 5 岁
|
|
|
|
|
|
|
|
|
|
|
|
解读:
|
|
|
|
|
|
- 68% 的用户年龄在 23-33 岁之间(28 ± 5)
|
|
|
|
|
|
- 95% 的用户年龄在 18-38 岁之间(28 ± 10)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 标准差的应用场景
|
|
|
|
|
|
|
|
|
|
|
|
**场景一:判断用户行为是否一致**
|
|
|
|
|
|
|
|
|
|
|
|
**产品 A**:
|
|
|
|
|
|
- 日均使用时长:30 分钟
|
|
|
|
|
|
- 标准差:2 分钟(很小)
|
|
|
|
|
|
- **解读**:用户行为一致,产品体验稳定
|
|
|
|
|
|
|
|
|
|
|
|
**产品 B**:
|
|
|
|
|
|
- 日均使用时长:30 分钟
|
|
|
|
|
|
- 标准差:20 分钟(很大)
|
|
|
|
|
|
- **解读**:用户行为差异大,可能需要分群分析
|
|
|
|
|
|
|
|
|
|
|
|
**场景二:发现异常值**
|
|
|
|
|
|
|
|
|
|
|
|
**数据**:用户登录次数
|
|
|
|
|
|
```
|
|
|
|
|
|
均值 = 10 次/天
|
|
|
|
|
|
标准差 = 2 次/天
|
|
|
|
|
|
|
|
|
|
|
|
正常范围:[10 - 2×2, 10 + 2×2] = [6, 14]
|
|
|
|
|
|
|
|
|
|
|
|
某用户登录 50 次/天 → 异常!
|
|
|
|
|
|
(可能是在刷数据,或者是机器人)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
::: tip 💡 实战建议
|
2026-02-23 12:09:47 +08:00
|
|
|
|
- **标准差小**:用户行为一致,产品体验稳定
|
|
|
|
|
|
- **标准差大**:用户群体差异大,可能需要分群分析
|
2026-02-24 00:18:09 +08:00
|
|
|
|
- **超过 3 个标准差**:通常是异常值,需要检查
|
2026-02-23 12:09:47 +08:00
|
|
|
|
:::
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 5.6 离散程度:数据"分散"还是"集中"?
|
|
|
|
|
|
|
|
|
|
|
|
| 指标 | 定义 | 说明 | 计算复杂度 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **极差** | 最大值 - 最小值 | 最简单,但易受极端值影响 | ⭐ |
|
|
|
|
|
|
| **方差** | 各数据与均值差的平方的平均 | 数值越大,数据越分散 | ⭐⭐⭐ |
|
|
|
|
|
|
| **标准差** | 方差的平方根 | 与原始数据同单位,更直观 | ⭐⭐⭐ |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 5.7 交互式演示
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
👇 **动手试试看**:在下方输入一组数据,实时计算统计指标:
|
|
|
|
|
|
|
|
|
|
|
|
<DataAnalysisDemo />
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 6. 数据聚合:从明细到"洞察"
|
|
|
|
|
|
|
|
|
|
|
|
**数据聚合**就是把"明细数据"(每一行记录)汇总成"统计数据"(总数、平均值等)。
|
|
|
|
|
|
|
|
|
|
|
|
### 6.1 为什么需要数据聚合?
|
|
|
|
|
|
|
|
|
|
|
|
**场景一:从订单明细到总销售额**
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据(每一笔订单)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
订单 1:用户 A,2024-01-01,100 元
|
|
|
|
|
|
订单 2:用户 B,2024-01-01,150 元
|
|
|
|
|
|
订单 3:用户 A,2024-01-02,200 元
|
|
|
|
|
|
订单 4:用户 C,2024-01-02,180 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后(总销售额)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
总销售额 = 100 + 150 + 200 + 180 = 630 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**场景二:从用户行为到活跃用户数**
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据(每一条行为记录)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
用户 A 在 2024-01-01 点击了 5 次
|
|
|
|
|
|
用户 B 在 2024-01-01 点击了 3 次
|
|
|
|
|
|
用户 A 在 2024-01-02 点击了 2 次
|
|
|
|
|
|
用户 C 在 2024-01-02 点击了 4 次
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后(每日活跃用户数)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
2024-01-01:2 个活跃用户(A 和 B)
|
|
|
|
|
|
2024-01-02:2 个活跃用户(A 和 C)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
::: tip 💡 关键理解
|
|
|
|
|
|
**聚合 = 从"看个体"到"看整体"**
|
|
|
|
|
|
- **明细数据**:每一行记录(每个订单、每次点击)
|
|
|
|
|
|
- **聚合数据**:统计结果(总销售额、活跃用户数)
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 6.2 常用聚合操作
|
|
|
|
|
|
|
|
|
|
|
|
#### 计数(COUNT):统计"有多少个"
|
|
|
|
|
|
|
|
|
|
|
|
**场景:统计订单总数**
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
订单 1:用户 A,100 元
|
|
|
|
|
|
订单 2:用户 B,150 元
|
|
|
|
|
|
订单 3:用户 C,200 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后**:
|
|
|
|
|
|
```
|
|
|
|
|
|
订单总数 = 3
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT COUNT(*) as total_orders
|
|
|
|
|
|
FROM orders;
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | total_orders |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 3 |
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 求和(SUM):计算"总和"
|
|
|
|
|
|
|
|
|
|
|
|
**场景:计算总销售额**
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
订单 1:100 元
|
|
|
|
|
|
订单 2:150 元
|
|
|
|
|
|
订单 3:200 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后**:
|
|
|
|
|
|
```
|
|
|
|
|
|
总销售额 = 100 + 150 + 200 = 450 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT SUM(amount) as total_sales
|
|
|
|
|
|
FROM orders;
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | total_sales |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 450 |
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**详细注释**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
SUM(amount) as total_sales -- 把所有订单金额加起来
|
|
|
|
|
|
FROM orders; -- 从订单表
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 均值(AVG):计算"平均值"
|
|
|
|
|
|
|
|
|
|
|
|
**场景:计算平均订单额**
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
订单 1:100 元
|
|
|
|
|
|
订单 2:150 元
|
|
|
|
|
|
订单 3:200 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后**:
|
|
|
|
|
|
```
|
|
|
|
|
|
平均订单额 = (100 + 150 + 200) ÷ 3 = 150 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT AVG(amount) as avg_order_amount
|
|
|
|
|
|
FROM orders;
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | avg_order_amount |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 150 |
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 最大值(MAX):找"最大"的
|
|
|
|
|
|
|
|
|
|
|
|
**场景:找出最高单笔订单**
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
订单 1:100 元
|
|
|
|
|
|
订单 2:150 元
|
|
|
|
|
|
订单 3:200 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后**:
|
|
|
|
|
|
```
|
|
|
|
|
|
最高订单 = 200 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT MAX(amount) as max_order_amount
|
|
|
|
|
|
FROM orders;
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | max_order_amount |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 200 |
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 最小值(MIN):找"最小"的
|
|
|
|
|
|
|
|
|
|
|
|
**场景:找出最低单笔订单**
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
订单 1:100 元
|
|
|
|
|
|
订单 2:150 元
|
|
|
|
|
|
订单 3:200 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后**:
|
|
|
|
|
|
```
|
|
|
|
|
|
最低订单 = 100 元
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT MIN(amount) as min_order_amount
|
|
|
|
|
|
FROM orders;
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | min_order_amount |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 100 |
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 6.3 聚合操作总结
|
|
|
|
|
|
|
|
|
|
|
|
| 操作 | SQL 函数 | 说明 | 示例 | 生活类比 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **计数** | COUNT(*) | 统计行数 | 订单总数 | 数一数有几个苹果 |
|
|
|
|
|
|
| **求和** | SUM(amount) | 累加数值 | 总销售额 | 把所有苹果重量加起来 |
|
|
|
|
|
|
| **均值** | AVG(amount) | 计算平均 | 平均订单额 | 计算苹果的平均重量 |
|
|
|
|
|
|
| **最大值** | MAX(amount) | 找最大值 | 最高单笔订单 | 找出最重的苹果 |
|
|
|
|
|
|
| **最小值** | MIN(amount) | 找最小值 | 最低单笔订单 | 找出最轻的苹果 |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 6.4 分组聚合(GROUP BY):按"类别"统计
|
|
|
|
|
|
|
|
|
|
|
|
#### 什么是 GROUP BY?
|
|
|
|
|
|
|
|
|
|
|
|
**GROUP BY**就是把数据按某个维度"分组",然后对每组进行统计。
|
|
|
|
|
|
|
|
|
|
|
|
**生活类比**:
|
|
|
|
|
|
- 你有一堆水果(苹果、香蕉、橘子)
|
|
|
|
|
|
- 你想统计每种水果有多少个
|
|
|
|
|
|
- 你会先把它们"分组"(苹果一堆、香蕉一堆、橘子一堆)
|
|
|
|
|
|
- 然后数每一堆有多少个
|
|
|
|
|
|
|
|
|
|
|
|
这就是 GROUP BY 的思想。
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景一:统计每个用户的订单数和总消费
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据(orders 表)**:
|
|
|
|
|
|
|
|
|
|
|
|
| order_id | user_id | amount |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| 1 | U001 | 100 |
|
|
|
|
|
|
| 2 | U002 | 150 |
|
|
|
|
|
|
| 3 | U001 | 200 |
|
|
|
|
|
|
| 4 | U003 | 180 |
|
|
|
|
|
|
| 5 | U002 | 120 |
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:统计每个用户的订单数和总消费?
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
user_id, -- 选择用户 ID
|
|
|
|
|
|
COUNT(*) as order_count, -- 统计订单数
|
|
|
|
|
|
SUM(amount) as total_amount -- 计算总消费
|
|
|
|
|
|
FROM orders -- 从订单表
|
|
|
|
|
|
GROUP BY user_id; -- 按用户 ID 分组
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**执行过程**:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
步骤 1:按 user_id 分组
|
|
|
|
|
|
|
|
|
|
|
|
分组 1(U001):
|
|
|
|
|
|
订单 1:U001, 100
|
|
|
|
|
|
订单 3:U001, 200
|
|
|
|
|
|
|
|
|
|
|
|
分组 2(U002):
|
|
|
|
|
|
订单 2:U002, 150
|
|
|
|
|
|
订单 5:U002, 120
|
|
|
|
|
|
|
|
|
|
|
|
分组 3(U003):
|
|
|
|
|
|
订单 4:U003, 180
|
|
|
|
|
|
|
|
|
|
|
|
步骤 2:对每组进行聚合
|
|
|
|
|
|
|
|
|
|
|
|
分组 1(U001):
|
|
|
|
|
|
order_count = 2(2 笔订单)
|
|
|
|
|
|
total_amount = 100 + 200 = 300
|
|
|
|
|
|
|
|
|
|
|
|
分组 2(U002):
|
|
|
|
|
|
order_count = 2(2 笔订单)
|
|
|
|
|
|
total_amount = 150 + 120 = 270
|
|
|
|
|
|
|
|
|
|
|
|
分组 3(U003):
|
|
|
|
|
|
order_count = 1(1 笔订单)
|
|
|
|
|
|
total_amount = 180
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后(结果)**:
|
|
|
|
|
|
|
|
|
|
|
|
| user_id | order_count | total_amount |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| U001 | 2 | 300 |
|
|
|
|
|
|
| U002 | 2 | 270 |
|
|
|
|
|
|
| U003 | 1 | 180 |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景二:统计每个商品的销售总额
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据(order_items 表)**:
|
|
|
|
|
|
|
|
|
|
|
|
| order_id | product_name | price | quantity |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| 1 | iPhone | 5000 | 1 |
|
|
|
|
|
|
| 1 | 手机壳 | 50 | 2 |
|
|
|
|
|
|
| 2 | iPad | 3000 | 1 |
|
|
|
|
|
|
| 3 | iPhone | 5000 | 2 |
|
|
|
|
|
|
| 3 | AirPods | 1000 | 1 |
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:统计每个商品的销售总额?
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
product_name, -- 选择商品名称
|
|
|
|
|
|
SUM(price * quantity) as total_sales -- 计算销售总额
|
|
|
|
|
|
FROM order_items -- 从订单明细表
|
|
|
|
|
|
GROUP BY product_name; -- 按商品名称分组
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**详细注释**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
product_name, -- 商品名称
|
|
|
|
|
|
SUM(price * quantity) as total_sales -- 总额 = 单价 × 数量,然后求和
|
|
|
|
|
|
FROM order_items
|
|
|
|
|
|
GROUP BY product_name; -- 按商品分组
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**聚合后(结果)**:
|
|
|
|
|
|
|
|
|
|
|
|
| product_name | total_sales |
|
|
|
|
|
|
| :--- | :--- |
|
|
|
|
|
|
| iPhone | 15000(5000×1 + 5000×2) |
|
|
|
|
|
|
| 手机壳 | 100(50×2) |
|
|
|
|
|
|
| iPad | 3000(3000×1) |
|
|
|
|
|
|
| AirPods | 1000(1000×1) |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 6.5 多维度聚合:按"多个类别"统计
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景:统计每个用户每天的消费
|
|
|
|
|
|
|
|
|
|
|
|
**明细数据(orders 表)**:
|
|
|
|
|
|
|
|
|
|
|
|
| order_id | user_id | date | amount |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| 1 | U001 | 2024-01-01 | 100 |
|
|
|
|
|
|
| 2 | U002 | 2024-01-01 | 150 |
|
|
|
|
|
|
| 3 | U001 | 2024-01-02 | 200 |
|
|
|
|
|
|
| 4 | U002 | 2024-01-02 | 120 |
|
|
|
|
|
|
| 5 | U001 | 2024-01-02 | 180 |
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:统计每个用户每天的消费?
|
|
|
|
|
|
|
|
|
|
|
|
**SQL 代码**:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
user_id, -- 用户 ID
|
|
|
|
|
|
date, -- 日期
|
|
|
|
|
|
SUM(amount) as daily_amount -- 每天消费总额
|
|
|
|
|
|
FROM orders
|
|
|
|
|
|
GROUP BY user_id, date; -- 按用户和日期分组
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**执行过程**:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
步骤 1:按 user_id 和 date 分组
|
|
|
|
|
|
|
|
|
|
|
|
分组 1(U001, 2024-01-01):
|
|
|
|
|
|
订单 1:U001, 2024-01-01, 100
|
|
|
|
|
|
|
|
|
|
|
|
分组 2(U002, 2024-01-01):
|
|
|
|
|
|
订单 2:U002, 2024-01-01, 150
|
|
|
|
|
|
|
|
|
|
|
|
分组 3(U001, 2024-01-02):
|
|
|
|
|
|
订单 3:U001, 2024-01-02, 200
|
|
|
|
|
|
订单 5:U001, 2024-01-02, 180
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
分组 4(U002, 2024-01-02):
|
|
|
|
|
|
订单 4:U002, 2024-01-02, 120
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
步骤 2:对每组进行聚合
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
分组 1(U001, 2024-01-01):
|
|
|
|
|
|
daily_amount = 100
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
分组 2(U002, 2024-01-01):
|
|
|
|
|
|
daily_amount = 150
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
分组 3(U001, 2024-01-02):
|
|
|
|
|
|
daily_amount = 200 + 180 = 380
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
分组 4(U002, 2024-01-02):
|
|
|
|
|
|
daily_amount = 120
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**聚合后(结果)**:
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
| user_id | date | daily_amount |
|
2026-02-23 12:09:47 +08:00
|
|
|
|
| :--- | :--- | :--- |
|
2026-02-24 00:18:09 +08:00
|
|
|
|
| U001 | 2024-01-01 | 100 |
|
|
|
|
|
|
| U001 | 2024-01-02 | 380 |
|
|
|
|
|
|
| U002 | 2024-01-01 | 150 |
|
|
|
|
|
|
| U002 | 2024-01-02 | 120 |
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
::: tip 💡 GROUP BY 的核心思想
|
|
|
|
|
|
把"明细数据"按某个维度分组,然后对每组进行统计。
|
|
|
|
|
|
- **维度**:你想分析的角度(用户、商品、日期等)
|
|
|
|
|
|
- **指标**:你想统计的数值(订单数、销售额等)
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 6.6 常见错误:SELECT 中的字段必须在 GROUP BY 中
|
|
|
|
|
|
|
|
|
|
|
|
#### 错误示例
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- ❌ 错误:user_id 没有在 GROUP BY 中
|
|
|
|
|
|
SELECT user_id, SUM(amount) as total_amount
|
|
|
|
|
|
FROM orders;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**为什么会报错?**
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
因为你想要按 user_id 显示,但没有按 user_id 分组,数据库不知道"怎么显示 user_id"。
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**正确写法**:
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```sql
|
2026-02-24 00:18:09 +08:00
|
|
|
|
-- ✅ 正确:所有非聚合字段都要在 GROUP BY 中
|
2026-02-23 12:09:47 +08:00
|
|
|
|
SELECT
|
2026-02-24 00:18:09 +08:00
|
|
|
|
user_id, -- 非聚合字段,必须在 GROUP BY 中
|
|
|
|
|
|
SUM(amount) as total_amount -- 聚合字段,可以不在 GROUP BY 中
|
2026-02-23 12:09:47 +08:00
|
|
|
|
FROM orders
|
2026-02-24 00:18:09 +08:00
|
|
|
|
GROUP BY user_id; -- 按 user_id 分组
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
#### 记忆规则
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**SELECT 中的字段,只有两种情况**:
|
|
|
|
|
|
1. **聚合函数**:COUNT(), SUM(), AVG(), MAX(), MIN() → 不需要在 GROUP BY 中
|
|
|
|
|
|
2. **普通字段**:必须在 GROUP BY 中
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**例子**:
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```sql
|
2026-02-24 00:18:09 +08:00
|
|
|
|
-- ✅ 正确
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
user_id, -- 普通字段 → 在 GROUP BY 中
|
|
|
|
|
|
date, -- 普通字段 → 在 GROUP BY 中
|
|
|
|
|
|
SUM(amount) -- 聚合函数 → 不需要在 GROUP BY 中
|
2026-02-23 12:09:47 +08:00
|
|
|
|
FROM orders
|
2026-02-24 00:18:09 +08:00
|
|
|
|
GROUP BY user_id, date; -- 所有普通字段都在这里
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 7. 可视化基础:让数据"会说话"
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
好的可视化能让人一眼看懂数据的规律。
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 7.1 常用图表类型
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 图表类型 | 用途 | 示例 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| **折线图** | 展示趋势 | DAU 变化、销售额增长 |
|
|
|
|
|
|
| **柱状图** | 对比数值 | 各渠道用户数、各品类销售额 |
|
|
|
|
|
|
| **饼图** | 展示占比 | 用户来源分布、商品品类占比 |
|
|
|
|
|
|
| **散点图** | 探索关系 | 广告投入 vs 销售额 |
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 7.2 图表选择指南
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 想展示 | 选择图表 |
|
|
|
|
|
|
| :--- | :--- |
|
|
|
|
|
|
| **随时间的变化** | 折线图 |
|
|
|
|
|
|
| **类别之间的对比** | 柱状图 |
|
|
|
|
|
|
| **部分占整体的比例** | 饼图 |
|
|
|
|
|
|
| **两个变量的关系** | 散点图 |
|
|
|
|
|
|
| **多个变量的分布** | 箱线图 |
|
|
|
|
|
|
|
|
|
|
|
|
::: tip 💡 可视化原则
|
|
|
|
|
|
1. **简洁至上**:去掉不必要的装饰(3D 效果、渐变色等)
|
|
|
|
|
|
2. **突出重点**:用颜色、大小强调关键数据
|
|
|
|
|
|
3. **标注清晰**:标题、坐标轴、图例都要清楚
|
|
|
|
|
|
4. **避免误导**:Y 轴从 0 开始,不要截断坐标轴
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 8. 数据清洗:垃圾进,垃圾出
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
**"Garbage In, Garbage Out"** —— 如果数据质量差,分析结果就不可信。
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 8.1 常见数据问题
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 问题类型 | 示例 | 影响 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| **缺失值** | 年龄字段为 NULL | 统计结果偏差 |
|
|
|
|
|
|
| **重复值** | 同一订单出现两次 | 重复计算 |
|
|
|
|
|
|
| **异常值** | 年龄 = 200 岁 | 均值被拉偏 |
|
|
|
|
|
|
| **格式不一致** | 日期:2024-01-01 和 01/01/2024 | 无法正确排序 |
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 8.2 数据清洗步骤
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 步骤 | 操作 | SQL 示例 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| **1. 去重** | 删除重复记录 | `SELECT DISTINCT * FROM orders;` |
|
|
|
|
|
|
| **2. 处理缺失值** | 填充或删除 | `WHERE age IS NOT NULL;` |
|
|
|
|
|
|
| **3. 处理异常值** | 过滤或修正 | `WHERE age BETWEEN 0 AND 120;` |
|
|
|
|
|
|
| **4. 标准化格式** | 统一日期格式 | `TO_DATE(date_str, 'YYYY-MM-DD');` |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 9. 漏斗分析:找到转化瓶颈
|
|
|
|
|
|
|
|
|
|
|
|
**漏斗分析**就是追踪用户在一系列步骤中的转化情况,找到"流失最严重"的环节。
|
|
|
|
|
|
|
|
|
|
|
|
### 9.1 什么是漏斗分析?
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
#### 用生活例子来理解
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**场景:你开了一家咖啡店**
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
```
|
|
|
|
|
|
进店的人 → 品尝试饮 → 办理会员卡 → 成为常客
|
|
|
|
|
|
100人 → 50人 → 20人 → 10人
|
|
|
|
|
|
100% → 50% → 20% → 10%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:为什么最终只有 10 人成为常客?
|
|
|
|
|
|
|
|
|
|
|
|
**分析**:
|
|
|
|
|
|
- **进店 → 品尝**:流失 50 人(转化率 50%)
|
|
|
|
|
|
- **品尝 → 会员**:流失 30 人(转化率 40%)
|
|
|
|
|
|
- **会员 → 常客**:流失 10 人(转化率 50%)
|
|
|
|
|
|
|
|
|
|
|
|
**结论**:最大流失在"品尝 → 会员"环节,说明会员卡吸引力不够。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 电商购物流程的漏斗分析
|
|
|
|
|
|
|
|
|
|
|
|
**场景:用户在电商 App 购物**
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
访问商品页 → 加入购物车 → 进入结算页 → 完成支付
|
2026-02-24 00:18:09 +08:00
|
|
|
|
10000 → 6000 → 4000 → 2500
|
|
|
|
|
|
100% → 60% → 40% → 25%
|
2026-02-23 12:09:47 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**计算过程**:
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**步骤 1:访问商品页**
|
|
|
|
|
|
```
|
|
|
|
|
|
访问人数 = 10000 人
|
|
|
|
|
|
占比 = 10000 / 10000 = 100%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**步骤 2:加入购物车**
|
|
|
|
|
|
```
|
|
|
|
|
|
加购人数 = 6000 人
|
|
|
|
|
|
转化率 = 6000 / 10000 = 60%
|
|
|
|
|
|
流失率 = 1 - 60% = 40%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**步骤 3:进入结算页**
|
|
|
|
|
|
```
|
|
|
|
|
|
结算人数 = 4000 人
|
|
|
|
|
|
转化率 = 4000 / 10000 = 40%
|
|
|
|
|
|
流失率 = 1 - 40% = 60%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**步骤 4:完成支付**
|
|
|
|
|
|
```
|
|
|
|
|
|
支付人数 = 2500 人
|
|
|
|
|
|
转化率 = 2500 / 10000 = 25%
|
|
|
|
|
|
流失率 = 1 - 25% = 75%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 漏斗的 ASCII 图示
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
访问商品页 (10000 人)
|
|
|
|
|
|
███████████████████████████████████████████████████
|
|
|
|
|
|
│
|
|
|
|
|
|
│ 6000 人流失 (40%)
|
|
|
|
|
|
↓
|
|
|
|
|
|
加入购物车 (6000 人)
|
|
|
|
|
|
████████████████████████████████
|
|
|
|
|
|
│
|
|
|
|
|
|
│ 2000 人流失 (20%)
|
|
|
|
|
|
↓
|
|
|
|
|
|
进入结算页 (4000 人)
|
|
|
|
|
|
████████████████████
|
|
|
|
|
|
│
|
|
|
|
|
|
│ 1500 人流失 (15%)
|
|
|
|
|
|
↓
|
|
|
|
|
|
完成支付 (2500 人)
|
|
|
|
|
|
██████████████
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 关键指标
|
|
|
|
|
|
|
|
|
|
|
|
| 指标 | 定义 | 计算公式 | 示例 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **单步转化率** | 进入下一步的人数 / 当前步骤人数 | 下一步人数 / 当前人数 | 60% 的用户加入购物车 |
|
|
|
|
|
|
| **整体转化率** | 最终完成人数 / 初始人数 | 最终人数 / 初始人数 | 25% 的用户完成购买 |
|
|
|
|
|
|
| **单步流失率** | 1 - 单步转化率 | 1 - 转化率 | 40% 的用户在购物车环节流失 |
|
|
|
|
|
|
| **总体流失率** | 1 - 整体转化率 | 1 - 整体转化率 | 75% 的用户最终未完成购买 |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 9.2 如何计算漏斗的每一步?
|
|
|
|
|
|
|
|
|
|
|
|
#### SQL 代码示例
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
假设我们有一个用户行为表 `user_events`:
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
| event_id | user_id | event_name | timestamp |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| 1 | U001 | view_product | 2024-01-01 10:00:00 |
|
|
|
|
|
|
| 2 | U001 | add_to_cart | 2024-01-01 10:01:00 |
|
|
|
|
|
|
| 3 | U001 | checkout | 2024-01-01 10:02:00 |
|
|
|
|
|
|
| 4 | U001 | purchase | 2024-01-01 10:03:00 |
|
|
|
|
|
|
| 5 | U002 | view_product | 2024-01-01 10:00:00 |
|
|
|
|
|
|
| 6 | U002 | add_to_cart | 2024-01-01 10:01:00 |
|
|
|
|
|
|
| 7 | U003 | view_product | 2024-01-01 10:00:00 |
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:计算每个步骤的用户数?
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- 步骤 1:访问商品页的用户数
|
|
|
|
|
|
SELECT COUNT(DISTINCT user_id) as view_count
|
|
|
|
|
|
FROM user_events
|
|
|
|
|
|
WHERE event_name = 'view_product';
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | view_count |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 10000 |
|
|
|
|
|
|
|
|
|
|
|
|
-- 步骤 2:加入购物车的用户数
|
|
|
|
|
|
SELECT COUNT(DISTINCT user_id) as add_to_cart_count
|
|
|
|
|
|
FROM user_events
|
|
|
|
|
|
WHERE event_name = 'add_to_cart';
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | add_to_cart_count |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 6000 |
|
|
|
|
|
|
|
|
|
|
|
|
-- 步骤 3:进入结算页的用户数
|
|
|
|
|
|
SELECT COUNT(DISTINCT user_id) as checkout_count
|
|
|
|
|
|
FROM user_events
|
|
|
|
|
|
WHERE event_name = 'checkout';
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | checkout_count |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 4000 |
|
|
|
|
|
|
|
|
|
|
|
|
-- 步骤 4:完成支付的用户数
|
|
|
|
|
|
SELECT COUNT(DISTINCT user_id) as purchase_count
|
|
|
|
|
|
FROM user_events
|
|
|
|
|
|
WHERE event_name = 'purchase';
|
|
|
|
|
|
|
|
|
|
|
|
-- 结果:
|
|
|
|
|
|
-- | purchase_count |
|
|
|
|
|
|
-- | :--- |
|
|
|
|
|
|
-- | 2500 |
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 9.3 如何优化漏斗?
|
|
|
|
|
|
|
|
|
|
|
|
#### 步骤 1:找到最弱的环节
|
|
|
|
|
|
|
|
|
|
|
|
**分析漏斗数据**:
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
访问 → 加购 → 结算 → 支付
|
|
|
|
|
|
100% → 60% → 40% → 25%
|
|
|
|
|
|
-40% -20% -15%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**流失分析**:
|
|
|
|
|
|
- **访问 → 加购**:流失 40%(最大!)
|
|
|
|
|
|
- **加购 → 结算**:流失 20%
|
|
|
|
|
|
- **结算 → 支付**:流失 15%
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
**结论**:最大流失在"访问 → 加购"环节,说明**商品页没有吸引力**。
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
#### 步骤 2:针对性优化
|
|
|
|
|
|
|
|
|
|
|
|
| 问题环节 | 流失率 | 可能原因 | 优化方案 | 预期效果 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **访问 → 加购** | 40% | 商品详情不清晰 | 优化图片、描述、评价 | 提升至 60% |
|
|
|
|
|
|
| **加购 → 结算** | 20% | 运费不透明 | 明确显示总价(含运费) | 提升至 85% |
|
|
|
|
|
|
| **结算 → 支付** | 15% | 支付流程复杂 | 减少表单字段,支持一键支付 | 提升至 90% |
|
|
|
|
|
|
|
|
|
|
|
|
#### 步骤 3:验证优化效果
|
|
|
|
|
|
|
|
|
|
|
|
**优化前的漏斗**:
|
|
|
|
|
|
```
|
|
|
|
|
|
访问 → 加购 → 结算 → 支付
|
|
|
|
|
|
100% → 60% → 40% → 25%
|
|
|
|
|
|
```
|
|
|
|
|
|
**整体转化率:25%**
|
|
|
|
|
|
|
|
|
|
|
|
**优化后的漏斗**:
|
|
|
|
|
|
```
|
|
|
|
|
|
访问 → 加购 → 结算 → 支付
|
|
|
|
|
|
100% → 60% → 51% → 46%
|
|
|
|
|
|
```
|
|
|
|
|
|
**整体转化率:46%**
|
|
|
|
|
|
|
|
|
|
|
|
**提升**:整体转化率从 25% 提升到 46%,增长了 84%!
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 9.4 实战案例:优化注册流程
|
|
|
|
|
|
|
|
|
|
|
|
#### 背景
|
|
|
|
|
|
|
|
|
|
|
|
某社交 App 的注册流程:
|
|
|
|
|
|
```
|
|
|
|
|
|
打开 App → 输入手机号 → 输入验证码 → 设置密码 → 注册成功
|
|
|
|
|
|
10000 → 8000 → 6000 → 3000 → 1000
|
|
|
|
|
|
100% → 80% → 60% → 30% → 10%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:整体转化率只有 10%,太低了!
|
|
|
|
|
|
|
|
|
|
|
|
#### 分析
|
|
|
|
|
|
|
|
|
|
|
|
**最大流失环节**:设置密码 → 注册成功(流失 67%)
|
|
|
|
|
|
|
|
|
|
|
|
**用户调研**:
|
|
|
|
|
|
- "密码规则太复杂"
|
|
|
|
|
|
- "不想设置密码,想用微信登录"
|
|
|
|
|
|
- "输入密码后还要再输一遍,太麻烦"
|
|
|
|
|
|
|
|
|
|
|
|
#### 优化方案
|
|
|
|
|
|
|
|
|
|
|
|
**方案一:简化密码规则**
|
|
|
|
|
|
- ❌ 原来:必须包含大小写字母、数字、特殊符号,至少 8 位
|
|
|
|
|
|
- ✅ 优化后:6-20 位,任意字符
|
|
|
|
|
|
|
|
|
|
|
|
**方案二:支持第三方登录**
|
|
|
|
|
|
- 新增微信、Apple ID 一键登录
|
|
|
|
|
|
|
|
|
|
|
|
**方案三:去掉确认密码**
|
|
|
|
|
|
- 输入一次密码即可,用"显示密码"按钮代替确认
|
|
|
|
|
|
|
|
|
|
|
|
#### 优化后的漏斗
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
打开 App → 输入手机号 → 输入验证码 → 注册成功(第三方登录)
|
|
|
|
|
|
10000 → 9000 → 8000 → 4000
|
|
|
|
|
|
100% → 90% → 80% → 40%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**整体转化率**:从 10% 提升到 40%,增长了 4 倍!
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 10. 留存分析:衡量产品粘性
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
**留存率**衡量用户在首次使用后持续使用的情况,是产品健康度的核心指标。
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 10.1 什么是留存?
|
|
|
|
|
|
|
|
|
|
|
|
#### 用通俗的语言来理解
|
|
|
|
|
|
|
|
|
|
|
|
**生活例子**:你开了一家健身房
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
第一天:100 个人办了健身卡
|
|
|
|
|
|
第二天:只有 45 个人来锻炼
|
|
|
|
|
|
第七天:只有 20 个人来锻炼
|
|
|
|
|
|
第三十天:只有 10 个人来锻炼
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**这意味着什么?**
|
|
|
|
|
|
- **第二天**:55 个人不来了(流失了)
|
|
|
|
|
|
- **第七天**:80 个人不来了(流失了)
|
|
|
|
|
|
- **第三十天**:90 个人不来了(流失了)
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:为什么这么多人不来了?
|
|
|
|
|
|
|
|
|
|
|
|
**可能的原因**:
|
|
|
|
|
|
- 健身房太远
|
|
|
|
|
|
- 价格太贵
|
|
|
|
|
|
- 没有私教指导
|
|
|
|
|
|
- 设施不好
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 产品的留存:用户会不会"回头"
|
|
|
|
|
|
|
|
|
|
|
|
**场景:一个新闻 App**
|
|
|
|
|
|
|
|
|
|
|
|
**用户 A 的故事**:
|
|
|
|
|
|
```
|
|
|
|
|
|
第 1 天(1 月 1 日):下载 App,看了 3 篇新闻
|
|
|
|
|
|
第 2 天(1 月 2 日):又打开 App,看了 5 篇新闻 ✅ 留存了!
|
|
|
|
|
|
第 3 天(1 月 3 日):没打开
|
|
|
|
|
|
...
|
|
|
|
|
|
第 7 天(1 月 7 日):又打开了 App ✅ 留存了!
|
|
|
|
|
|
...
|
|
|
|
|
|
第 30 天(1 月 30 日):没打开 ❌ 没有留存
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**用户 A 的留存情况**:
|
|
|
|
|
|
- **次日留存**:✅ 留存(1 月 2 日打开)
|
|
|
|
|
|
- **7 日留存**:✅ 留存(1 月 7 日打开)
|
|
|
|
|
|
- **30 日留存**:❌ 未留存(1 月 30 日没打开)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 10.2 留存率类型
|
|
|
|
|
|
|
|
|
|
|
|
#### 次日留存(Day 1 Retention)
|
|
|
|
|
|
|
|
|
|
|
|
**定义**:注册第二天还活跃的用户占比
|
|
|
|
|
|
|
|
|
|
|
|
**计算公式**:
|
|
|
|
|
|
```
|
|
|
|
|
|
次日留存率 = 第二天还活跃的用户数 / 第一天注册的用户数
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**例子**:
|
|
|
|
|
|
```
|
|
|
|
|
|
1 月 1 日注册的用户:1000 人
|
|
|
|
|
|
1 月 2 日还活跃的用户:450 人
|
|
|
|
|
|
|
|
|
|
|
|
次日留存率 = 450 / 1000 = 45%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 7 日留存(Day 7 Retention)
|
|
|
|
|
|
|
|
|
|
|
|
**定义**:注册第 7 天还活跃的用户占比
|
|
|
|
|
|
|
|
|
|
|
|
**计算公式**:
|
|
|
|
|
|
```
|
|
|
|
|
|
7 日留存率 = 第 7 天还活跃的用户数 / 注册用户数
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**例子**:
|
|
|
|
|
|
```
|
|
|
|
|
|
1 月 1 日注册的用户:1000 人
|
|
|
|
|
|
1 月 7 日还活跃的用户:320 人
|
|
|
|
|
|
|
|
|
|
|
|
7 日留存率 = 320 / 1000 = 32%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 30 日留存(Day 30 Retention)
|
|
|
|
|
|
|
|
|
|
|
|
**定义**:注册第 30 天还活跃的用户占比
|
|
|
|
|
|
|
|
|
|
|
|
**计算公式**:
|
|
|
|
|
|
```
|
|
|
|
|
|
30 日留存率 = 第 30 天还活跃的用户数 / 注册用户数
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**例子**:
|
|
|
|
|
|
```
|
|
|
|
|
|
1 月 1 日注册的用户:1000 人
|
|
|
|
|
|
1 月 30 日还活跃的用户:180 人
|
|
|
|
|
|
|
|
|
|
|
|
30 日留存率 = 180 / 1000 = 18%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 10.3 留存率总结
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 类型 | 定义 | 计算公式 | 健康标准 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **次日留存** | 注册第二天还来的用户占比 | Day 1 活跃 / 注册用户 | > 40% |
|
|
|
|
|
|
| **7 日留存** | 注册第 7 天还来的用户占比 | Day 7 活跃 / 注册用户 | > 20% |
|
|
|
|
|
|
| **30 日留存** | 注册第 30 天还来的用户占比 | Day 30 活跃 / 注册用户 | > 10% |
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 10.4 如何计算留存率?
|
|
|
|
|
|
|
|
|
|
|
|
#### 留存表
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
**示例:1 月 1 日注册的 1000 名用户**
|
|
|
|
|
|
|
|
|
|
|
|
| 日期 | 注册用户 | 次日留存 | 7 日留存 | 30 日留存 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| 2024-01-01 | 1000 | 45% (450 人) | 32% (320 人) | 18% (180 人) |
|
|
|
|
|
|
| 2024-01-02 | 1200 | 42% (504 人) | 28% (336 人) | 15% (180 人) |
|
2026-02-24 00:18:09 +08:00
|
|
|
|
| 2024-01-03 | 900 | 48% (432 人) | 35% (315 人) | 20% (180 人) |
|
|
|
|
|
|
|
|
|
|
|
|
**计算示例(1 月 1 日)**:
|
|
|
|
|
|
```
|
|
|
|
|
|
次日留存率 = 1 月 2 日还活跃的用户 / 1 月 1 日注册用户
|
|
|
|
|
|
= 450 / 1000
|
|
|
|
|
|
= 45%
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
7 日留存率 = 1 月 7 日还活跃的用户 / 1 月 1 日注册用户
|
|
|
|
|
|
= 320 / 1000
|
|
|
|
|
|
= 32%
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
30 日留存率 = 1 月 30 日还活跃的用户 / 1 月 1 日注册用户
|
|
|
|
|
|
= 180 / 1000
|
|
|
|
|
|
= 18%
|
|
|
|
|
|
```
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 10.5 留存率的健康标准
|
|
|
|
|
|
|
|
|
|
|
|
| 留存率 | 产品状态 | 说明 | 建议 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **高留存** (>40%) | 健康增长 | 用户喜欢,持续使用 | 继续保持,扩大规模 |
|
|
|
|
|
|
| **中等留存** (20-40%) | 需要优化 | 产品还行,但不够吸引人 | 分析用户行为,优化核心功能 |
|
|
|
|
|
|
| **低留存** (<20%) | 危险 | 用户来一次就走,产品有问题 | 重新审视产品定位,解决核心问题 |
|
|
|
|
|
|
|
|
|
|
|
|
#### 不同产品的留存标准
|
|
|
|
|
|
|
|
|
|
|
|
| 产品类型 | 次日留存 | 7 日留存 | 30 日留存 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| **社交 App** | 40-50% | 25-35% | 15-25% |
|
|
|
|
|
|
| **游戏** | 35-45% | 15-25% | 5-15% |
|
|
|
|
|
|
| **电商** | 25-35% | 10-20% | 5-10% |
|
|
|
|
|
|
| **工具类** | 30-40% | 15-25% | 10-20% |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 10.6 留存率的意义
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景一:高 DAU + 低留存 = "烧钱买量"
|
|
|
|
|
|
|
|
|
|
|
|
**数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
DAU:10 万
|
|
|
|
|
|
次日留存:15%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**分析**:
|
|
|
|
|
|
- 虽然 DAU 很高,但留存很低
|
|
|
|
|
|
- 说明大部分用户只来一次就走
|
|
|
|
|
|
- 这是在"烧钱买量",不可持续
|
|
|
|
|
|
|
|
|
|
|
|
**问题**:为什么用户不回头?
|
|
|
|
|
|
|
|
|
|
|
|
**可能原因**:
|
|
|
|
|
|
- 广告宣传与实际产品不符
|
|
|
|
|
|
- 注册流程太复杂,用户流失
|
|
|
|
|
|
- 产品没有核心价值
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景二:低 DAU + 高留存 = "慢热型产品"
|
|
|
|
|
|
|
|
|
|
|
|
**数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
DAU:1 万
|
|
|
|
|
|
次日留存:50%
|
|
|
|
|
|
30 日留存:30%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**分析**:
|
|
|
|
|
|
- 虽然 DAU 不高,但留存很高
|
|
|
|
|
|
- 说明产品很好,用户很喜欢
|
|
|
|
|
|
- 这是"慢热型产品",需要时间积累
|
|
|
|
|
|
|
|
|
|
|
|
**建议**:
|
|
|
|
|
|
- 继续优化产品
|
|
|
|
|
|
- 加强口碑传播
|
|
|
|
|
|
- 逐步扩大用户规模
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 场景三:高 DAU + 高留存 = 健康增长
|
|
|
|
|
|
|
|
|
|
|
|
**数据**:
|
|
|
|
|
|
```
|
|
|
|
|
|
DAU:10 万
|
|
|
|
|
|
次日留存:50%
|
|
|
|
|
|
30 日留存:30%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**分析**:
|
|
|
|
|
|
- DAU 高,留存也高
|
|
|
|
|
|
- 说明产品很成功,用户很喜欢
|
|
|
|
|
|
- 这是健康增长的标志 🎯
|
|
|
|
|
|
|
|
|
|
|
|
**建议**:
|
|
|
|
|
|
- 继续保持
|
|
|
|
|
|
- 扩大规模
|
|
|
|
|
|
- 探索商业模式
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
::: tip 💡 留存 vs DAU
|
|
|
|
|
|
- **高 DAU + 低留存** = "烧钱买量",不可持续
|
|
|
|
|
|
- **低 DAU + 高留存** = "慢热型产品",需要时间积累
|
|
|
|
|
|
- **高 DAU + 高留存** = 健康增长 🎯
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 10.7 如何提升留存率?
|
|
|
|
|
|
|
|
|
|
|
|
#### 步骤 1:分析用户流失原因
|
|
|
|
|
|
|
|
|
|
|
|
**方法一:用户访谈**
|
|
|
|
|
|
- 联系流失用户,问他们为什么不用
|
|
|
|
|
|
- 找到共性问题
|
|
|
|
|
|
|
|
|
|
|
|
**方法二:行为分析**
|
|
|
|
|
|
- 分析用户在哪里流失
|
|
|
|
|
|
- 找到流失前的行为模式
|
|
|
|
|
|
|
|
|
|
|
|
**方法三:A/B 测试**
|
|
|
|
|
|
- 测试不同的产品改进方案
|
|
|
|
|
|
- 看哪个方案能提升留存
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
#### 步骤 2:针对性优化
|
|
|
|
|
|
|
|
|
|
|
|
**问题一:次日留存低**
|
|
|
|
|
|
|
|
|
|
|
|
**可能原因**:
|
|
|
|
|
|
- 注册流程太复杂
|
|
|
|
|
|
- 产品不会用
|
|
|
|
|
|
- 没有找到核心价值
|
|
|
|
|
|
|
|
|
|
|
|
**优化方案**:
|
|
|
|
|
|
- 简化注册流程
|
|
|
|
|
|
- 添加新手引导
|
|
|
|
|
|
- 优化核心功能的体验
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**问题二:7 日留存低**
|
|
|
|
|
|
|
|
|
|
|
|
**可能原因**:
|
|
|
|
|
|
- 新鲜感消失
|
|
|
|
|
|
- 没有持续使用的动力
|
|
|
|
|
|
- 找不到使用场景
|
|
|
|
|
|
|
|
|
|
|
|
**优化方案**:
|
|
|
|
|
|
- 添加个性化推荐
|
|
|
|
|
|
- 推送通知(不要太多)
|
|
|
|
|
|
- 设计"每日任务"或"签到奖励"
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**问题三:30 日留存低**
|
|
|
|
|
|
|
|
|
|
|
|
**可能原因**:
|
|
|
|
|
|
- 内容更新太慢
|
|
|
|
|
|
- 用户需求变化
|
|
|
|
|
|
- 竞品更好
|
|
|
|
|
|
|
|
|
|
|
|
**优化方案**:
|
|
|
|
|
|
- 加快内容更新
|
|
|
|
|
|
- 添加新功能
|
|
|
|
|
|
- 建立用户社区
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### 10.8 实战案例:如何提升游戏的留存率
|
|
|
|
|
|
|
|
|
|
|
|
#### 背景
|
|
|
|
|
|
|
|
|
|
|
|
某休闲游戏:
|
|
|
|
|
|
```
|
|
|
|
|
|
次日留存:25%(目标:40%)
|
|
|
|
|
|
7 日留存:10%(目标:20%)
|
|
|
|
|
|
30 日留存:3%(目标:10%)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 分析
|
|
|
|
|
|
|
|
|
|
|
|
**用户行为分析**:
|
|
|
|
|
|
```
|
|
|
|
|
|
第一天玩游戏的用户:
|
|
|
|
|
|
- 100% 完成了新手教程
|
|
|
|
|
|
- 60% 玩到了第 5 关
|
|
|
|
|
|
- 20% 玩到了第 10 关
|
|
|
|
|
|
- 5% 玩到了第 20 关
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**结论**:大部分用户在第 5-10 关流失。
|
|
|
|
|
|
|
|
|
|
|
|
**用户调研**:
|
|
|
|
|
|
- "第 6 关太难了"
|
|
|
|
|
|
- "每次都要从头开始,太累"
|
|
|
|
|
|
- "没有奖励,不想玩了"
|
|
|
|
|
|
|
|
|
|
|
|
#### 优化方案
|
|
|
|
|
|
|
|
|
|
|
|
**方案一:调整难度曲线**
|
|
|
|
|
|
- ❌ 原来:第 6 关突然变难
|
|
|
|
|
|
- ✅ 优化后:难度渐进式提升
|
|
|
|
|
|
|
|
|
|
|
|
**方案二:增加存档点**
|
|
|
|
|
|
- ❌ 原来:每次都要从头开始
|
|
|
|
|
|
- ✅ 优化后:每 5 关自动存档
|
|
|
|
|
|
|
|
|
|
|
|
**方案三:添加奖励系统**
|
|
|
|
|
|
- ❌ 原来:通关没有奖励
|
|
|
|
|
|
- ✅ 优化后:通关送金币、道具
|
|
|
|
|
|
|
|
|
|
|
|
#### 优化后的效果
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
次日留存:25% → 45% ✅(提升 80%)
|
|
|
|
|
|
7 日留存:10% → 25% ✅(提升 150%)
|
|
|
|
|
|
30 日留存:3% → 12% ✅(提升 300%)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 11. 实战:用户行为分析
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
假设你负责一个电商 App 的数据分析,以下是完整的分析流程。
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 11.1 问题定义
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
**目标**:提高订单转化率
|
|
|
|
|
|
|
|
|
|
|
|
**现状**:访问商品页 10 万人,最终下单 2000 人,转化率 2%
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 11.2 数据收集
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 维度 | 数据 |
|
|
|
|
|
|
| :--- | :--- |
|
|
|
|
|
|
| **用户属性** | 年龄、性别、地域、注册时间 |
|
|
|
|
|
|
| **行为数据** | 浏览记录、加购、下单、支付 |
|
|
|
|
|
|
| **交易数据** | 订单金额、商品品类、优惠券使用 |
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 11.3 数据分析
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
**步骤 1:漏斗分析**
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
浏览商品 → 加购 → 结算 → 支付
|
|
|
|
|
|
10万 → 5万 → 3万 → 2万
|
|
|
|
|
|
100% → 50% → 30% → 2%
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
发现:"结算 → 支付"环节流失最严重(30% → 2%)。
|
|
|
|
|
|
|
|
|
|
|
|
**步骤 2:分群分析**
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- 按用户来源分析转化率
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
traffic_source,
|
|
|
|
|
|
COUNT(*) as total_users,
|
|
|
|
|
|
SUM(CASE WHEN order_id IS NOT NULL THEN 1 ELSE 0 END) as converted_users,
|
|
|
|
|
|
SUM(CASE WHEN order_id IS NOT NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) as conversion_rate
|
|
|
|
|
|
FROM user_events
|
|
|
|
|
|
GROUP BY traffic_source;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**结果**:
|
|
|
|
|
|
|
|
|
|
|
|
| 来源 | 用户数 | 转化用户 | 转化率 |
|
|
|
|
|
|
| :--- | :--- | :--- | :--- |
|
|
|
|
|
|
| 搜索引擎 | 50000 | 500 | 1% |
|
|
|
|
|
|
| 社交媒体 | 30000 | 900 | 3% |
|
|
|
|
|
|
| 直接访问 | 20000 | 600 | 3% |
|
|
|
|
|
|
|
|
|
|
|
|
**结论**:搜索引擎用户转化率最低(1%),可能是因为搜索来的用户"只是看看"。
|
|
|
|
|
|
|
|
|
|
|
|
**步骤 3:留存分析**
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- 不同来源用户的次日留存
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
traffic_source,
|
|
|
|
|
|
AVG(retention_day1) as avg_retention
|
|
|
|
|
|
FROM user_retention
|
|
|
|
|
|
GROUP BY traffic_source;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**结果**:
|
|
|
|
|
|
|
|
|
|
|
|
| 来源 | 次日留存 |
|
|
|
|
|
|
| :--- | :--- |
|
|
|
|
|
|
| 搜索引擎 | 25% |
|
|
|
|
|
|
| 社交媒体 | 45% |
|
|
|
|
|
|
| 直接访问 | 55% |
|
|
|
|
|
|
|
|
|
|
|
|
**结论**:搜索引擎用户留存低,说明"质量不高"。
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 11.4 行动建议
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
| 问题 | 原因 | 建议 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| 转化率低 (2%) | 结算流程复杂 | 简化表单,支持一键支付 |
|
|
|
|
|
|
| 搜索引擎转化低 | 用户意向不明确 | 优化落地页,突出商品价值 |
|
|
|
|
|
|
| 搜索引擎留存低 | 用户找到就离开 | 增加"相关推荐",引导浏览更多商品 |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
## 12. 用 AI 辅助数据分析
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
AI 可以帮你快速生成 SQL、分析数据、生成报告。
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 12.1 提示词模板
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
你是一位资深的数据分析师,精通 SQL 和数据可视化。请帮我分析以下数据。
|
|
|
|
|
|
|
|
|
|
|
|
## 业务背景
|
|
|
|
|
|
[描述你的业务场景,例如:电商 App、社交媒体等]
|
|
|
|
|
|
|
|
|
|
|
|
## 数据表结构
|
|
|
|
|
|
[描述数据表的字段,例如:
|
|
|
|
|
|
- users: user_id, age, gender, register_date
|
|
|
|
|
|
- orders: order_id, user_id, amount, created_at]
|
|
|
|
|
|
|
|
|
|
|
|
## 分析需求
|
|
|
|
|
|
[列出你想回答的问题,例如:
|
|
|
|
|
|
1. 统计每个用户的订单数和总消费
|
|
|
|
|
|
2. 找出消费 Top 10 的用户
|
|
|
|
|
|
3. 分析用户消费的分布情况]
|
|
|
|
|
|
|
|
|
|
|
|
## 输出要求
|
|
|
|
|
|
1. SQL 查询语句
|
|
|
|
|
|
2. 结果解读
|
|
|
|
|
|
3. 数据可视化建议
|
|
|
|
|
|
4. 业务洞察
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-02-24 00:18:09 +08:00
|
|
|
|
### 12.2 实战示例:用户分群
|
2026-02-23 12:09:47 +08:00
|
|
|
|
|
|
|
|
|
|
**输入提示词**:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
你是一位资深的数据分析师。请帮我设计一个用户分群方案。
|
|
|
|
|
|
|
|
|
|
|
|
## 业务背景
|
|
|
|
|
|
我是一个电商 App 的数据分析师,我们有 100 万用户,想根据用户的消费行为进行分群,以便精准营销。
|
|
|
|
|
|
|
|
|
|
|
|
## 数据表结构
|
|
|
|
|
|
- orders: order_id, user_id, amount, created_at
|
|
|
|
|
|
- users: user_id, age, gender, register_date
|
|
|
|
|
|
|
|
|
|
|
|
## 分析需求
|
|
|
|
|
|
1. 根据 RFM 模型(最近消费、消费频率、消费金额)对用户分群
|
|
|
|
|
|
2. 计算每个分群的用户数和贡献的销售额
|
|
|
|
|
|
3. 为每个分群设计营销策略
|
|
|
|
|
|
|
|
|
|
|
|
## 输出要求
|
|
|
|
|
|
请提供:
|
|
|
|
|
|
1. SQL 查询语句
|
|
|
|
|
|
2. 分群逻辑说明
|
|
|
|
|
|
3. 每个分群的特征和营销策略
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**AI 输出示例**:
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- RFM 分群 SQL
|
|
|
|
|
|
WITH rfm AS (
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
user_id,
|
|
|
|
|
|
DATEDIFF('2024-01-31', MAX(created_at)) as recency,
|
|
|
|
|
|
COUNT(*) as frequency,
|
|
|
|
|
|
SUM(amount) as monetary
|
|
|
|
|
|
FROM orders
|
|
|
|
|
|
WHERE created_at >= '2024-01-01'
|
|
|
|
|
|
GROUP BY user_id
|
|
|
|
|
|
),
|
|
|
|
|
|
rfm_scores AS (
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
user_id,
|
|
|
|
|
|
recency,
|
|
|
|
|
|
frequency,
|
|
|
|
|
|
monetary,
|
|
|
|
|
|
NTILE(4) OVER (ORDER BY recency DESC) as r_score,
|
|
|
|
|
|
NTILE(4) OVER (ORDER BY frequency) as f_score,
|
|
|
|
|
|
NTILE(4) OVER (ORDER BY monetary) as m_score
|
|
|
|
|
|
FROM rfm
|
|
|
|
|
|
)
|
|
|
|
|
|
SELECT
|
|
|
|
|
|
user_id,
|
|
|
|
|
|
CASE
|
|
|
|
|
|
WHEN r_score IN (3, 4) AND f_score IN (3, 4) AND m_score IN (3, 4) THEN '重要价值用户'
|
|
|
|
|
|
WHEN r_score IN (1, 2) AND f_score IN (3, 4) AND m_score IN (3, 4) THEN '重要保持用户'
|
|
|
|
|
|
WHEN r_score IN (3, 4) AND f_score IN (1, 2) AND m_score IN (3, 4) THEN '重要发展用户'
|
|
|
|
|
|
WHEN r_score IN (1, 2) AND f_score IN (1, 2) AND m_score IN (3, 4) THEN '重要挽留用户'
|
|
|
|
|
|
WHEN r_score IN (3, 4) AND f_score IN (3, 4) AND m_score IN (1, 2) THEN '一般价值用户'
|
|
|
|
|
|
WHEN r_score IN (1, 2) AND f_score IN (3, 4) AND m_score IN (1, 2) THEN '一般保持用户'
|
|
|
|
|
|
WHEN r_score IN (3, 4) AND f_score IN (1, 2) AND m_score IN (1, 2) THEN '一般发展用户'
|
|
|
|
|
|
ELSE '流失用户'
|
|
|
|
|
|
END as user_segment
|
|
|
|
|
|
FROM rfm_scores;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**分群策略**:
|
|
|
|
|
|
|
|
|
|
|
|
| 分群 | 特征 | 营销策略 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| **重要价值用户** | 最近消费、高频、高金额 | VIP 待遇,专属客服 |
|
|
|
|
|
|
| **重要保持用户** | 很久没来、但曾经高频 | 唤醒活动,优惠券召回 |
|
|
|
|
|
|
| **重要发展用户** | 最近消费、低频、高金额 | 引导复购,推荐关联商品 |
|
|
|
|
|
|
| **重要挽留用户** | 很久没来、低频、高金额 | 主动联系,了解原因 |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 名词速查表
|
|
|
|
|
|
|
|
|
|
|
|
| 名词 | 英文 | 解释 |
|
|
|
|
|
|
| :--- | :--- | :--- |
|
|
|
|
|
|
| **描述性统计** | Descriptive Statistics | 用均值、中位数、标准差等指标概括数据 |
|
|
|
|
|
|
| **均值** | Mean | 所有数值的平均值 |
|
|
|
|
|
|
| **中位数** | Median | 排序后位于中间的值 |
|
|
|
|
|
|
| **众数** | Mode | 出现次数最多的值 |
|
|
|
|
|
|
| **标准差** | Standard Deviation | 衡量数据分散程度 |
|
|
|
|
|
|
| **方差** | Variance | 标准差的平方 |
|
|
|
|
|
|
| **聚合** | Aggregation | 将明细数据汇总(求和、计数等) |
|
|
|
|
|
|
| **分组** | Group By | 按某个维度将数据分组 |
|
|
|
|
|
|
| **漏斗分析** | Funnel Analysis | 分析用户在一系列步骤中的转化情况 |
|
|
|
|
|
|
| **转化率** | Conversion Rate | 完成目标行为的用户占比 |
|
|
|
|
|
|
| **留存率** | Retention Rate | 用户持续使用产品的比例 |
|
|
|
|
|
|
| **次日留存** | Day 1 Retention | 注册第二天还活跃的用户占比 |
|
|
|
|
|
|
| **数据清洗** | Data Cleaning | 处理缺失值、重复值、异常值 |
|
|
|
|
|
|
| **可视化** | Visualization | 用图表展示数据 |
|
|
|
|
|
|
| **折线图** | Line Chart | 展示随时间的变化趋势 |
|
|
|
|
|
|
| **柱状图** | Bar Chart | 对比不同类别的数值 |
|
|
|
|
|
|
| **饼图** | Pie Chart | 展示各部分占整体的比例 |
|
|
|
|
|
|
| **散点图** | Scatter Plot | 展示两个变量的关系 |
|
|
|
|
|
|
| **RFM 模型** | RFM Model | 根据最近消费、频率、金额对用户分群 |
|
|
|
|
|
|
| **DAU** | Daily Active Users | 日活跃用户数 |
|
|
|
|
|
|
| **GMV** | Gross Merchandise Value | 商品交易总额 |
|
|
|
|
|
|
| **ARPU** | Average Revenue Per User | 每用户平均收入 |
|
|
|
|
|
|
| **LTV** | Lifetime Value | 用户生命周期价值 |
|