大数据——项目实战(使用Zeppelin分析电子商务消费行为分析)
使用Zeppelin分析电子商务消费行为分析任务描述需求概述问题分析Customer表Transaction表Store表Review表数据结构连接至Zeppelin电子商务消费行为分析文件提取码:39r1任务描述需求概述对某零售企业最近一年门店收集的数据进行数据分析潜在客户画像用户消费统计门店的资源利用率消费的特征人群定位数据的可视化展现问题分析Customer表customer_details
·
使用Zeppelin分析电子商务消费行为分析
- 任务描述
- 问题分析
- 连接至Zeppelin
- 1.数据获取
- 2.检查行数和header行
- 3.上传数据到HDFS
- 4.建表查表
- 5.数据清洗
- 6.Customer分析
- 7.Transaction分析
- 7.1计算每月总收入
- 7.2计算每个季度的总收入
- 7.3按年计算总收入
- 7.4按工作日计算总收入
- 7.5按时间段计算总收入(需要清理数据)
- 7.6按时间段计算平均消费
- 7.7按工作日计算平均消费
- 7.8计算年、月、日的交易总数
- 7.9找出交易量最大的10个客户
- 7.10找出消费最多的前10位顾客
- 7.11统计该期间交易数量最少的用户
- 7.12计算每个季度的独立客户总数
- 7.13计算每周的独立客户总数
- 7.14计算整个活动客户平均花费的最大值
- 7.15统计每月花费最多的客户
- 7.16统计每月访问次数最多的客户
- 7.17按总价找出最受欢迎的5种产品
- 7.18根据购买频率找出最畅销的5种产品
- 7.19根据客户数量找出最受欢迎的5种产品
- 7.20验证前5个details
- 8.Store分析
- 9.Review分析
电子商务消费行为分析文件
提取码:39r1
任务描述
需求概述
对某零售企业最近一年门店收集的数据进行数据分析
- 潜在客户画像
- 用户消费统计
- 门店的资源利用率
- 消费的特征人群定位
- 数据的可视化展现
问题分析
Customer表
customer_details | details |
---|---|
customer_id | Int,1-500 |
first_name | string |
last_name | string |
string,such as willddy@gmail.com | |
gender | string,Male or Female |
address | string |
country | string |
language | string |
job | string,job title/position |
credit_type | string,credit card type。such as visa |
credit_no | strin,credit card number |
问题:language字段数据存在错误
Transaction表
transaction_details | details |
---|---|
transaction_id | Int,1-1000 |
customer_id | Int,1-500 |
store_id | Int,1-5 |
price | decimal,such as 5.08 |
product | string,things bought |
date | string,when to purchase |
time | string,what time to purchase |
问题:表中transa_id有重复,但数据有效,需要修复数据
Store表
store_details | details |
---|---|
store_id | Int,1-5 |
store_name | string |
employee_number | Int,在store有多少employee |
Review表
store_review | details |
---|---|
stransaction_id | Int,1-8000 |
store_id | Int,1-5 |
review_store | Int,1-5 |
问题:表中有效的score数据
表中有奖transaction_id映射到错误的store_id
数据结构
Customer表
- customer_details
Transaction表
- transaction_details
Store表
- store_details
Review表
- store_review
连接至Zeppelin
Zeppelin安装和配置
打开Zeppelin网站
导入电子商务消费行为分析数据及模板
1.数据获取
Download Data
%sh
--创建数据库
mkdir /mnt/hivetest
cd /mnt/hivetest
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/customer_details.csv
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/transaction_details.csv
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/store_details.csv
wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/store_review.csv
2.检查行数和header行
Understand the Data
%sh
## /tmp/data/
cd /mnt/hivetest
head -2 customer_details.csv
#head -2 store_details.csv
#head -2 store_review.csv
#head -2 transaction_details.csv
3.上传数据到HDFS
Upload the file to HDFS
%sh
hdfs dfs -mkdir -p /apps/shopping/customer
hdfs dfs -put /mnt/hivetest/customer_details.csv /apps/shopping/customer
hdfs dfs -mkdir -p /apps/shopping/transaction
hdfs dfs -put /mnt/hivetest/transaction_details.csv /apps/shopping/transaction
hdfs dfs -mkdir -p /apps/shopping/store
hdfs dfs -put /mnt/hivetest/store_details.csv /apps/shopping/store
hdfs dfs -mkdir -p /apps/shopping/review
hdfs dfs -put /mnt/hivetest/store_review.csv /apps/shopping/review
##/tmp/shopping/data/customer/
##/tmp/shopping/data/transaction/
##/tmp/shopping/data/store/
##/tmp/shopping/data/review/
4.建表查表
4.1 Clear all tables if exists
%hive
create database shopping
use shopping
%hive
--创建顾客表
drop table ext_transaction_details
%hive
create external table if not exists ext_customer_details(
customer_id string,
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string)
row format SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/customer'
%hive
--创建交易流水表
create external table if not exists ext_transaction_details(
transaction_id string,
customer_id string,
store_id string,
price decimal(10,2),
product string,
purchase_date string,
purchase_time string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"quoteChar"="\"",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/transaction'
%hive
--创建商店详情表
create external table if not exists ext_store_details(
store_id string,
store_name string,
employee_number int)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/store'
tblproperties("skip.head.line.count"="1")
%hive
--创建评价表
create external table if not exists ext_store_review(
stransaction_id string,
store_id string,
review_score int)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored as textfile
location '/apps/shopping/review'
4.2 Verify all Tables are Created
%hive
--select * from ext_customer_details limit 20
--select distinct language from ext_customer_details
--select * from ext_transaction_details limit 20
select transaction_id,count(transaction_id) cnt from ext_transaction_details group by transaction_id order by cnt desc
--select * from ext_store_details limit 20
--select * from ext_store_review limit 20
5.数据清洗
解决以下有问题的数据
- 对transaction_details中的重复数据生成新ID
- 过滤掉store_review中没有评分的数据
- 可以把清洗好的数据放到另一个表或者用View表示
- 找出PII(personal information identification)或PCI(personal confidential information)数据进行加密或hash
- 重新组织transaction数据按照日期YYYY-MM做分区
5.1 Clean and Mask customer_details
%hive
-- 敏感信息加密
-- drop view vm_customer_details
create view if not exists vm_customer_details as
select
customer_id ,
first_name ,
unbase64(last_name) lastname,
unbase64(email) email,
gender ,
unbase64(address) address,
country ,
language,
job ,
credit_type ,
unbase64(credit_no) credit_no
from
ext_customer_details
5.2 Clean transaction_details into partition table
%hive
-- 创建流水详情表
create table if not exists transaction_details
(
transaction_id string,
customer_id string,
store_id string,
price decimal(8,2),
product string,
purchase_date date,
purchase_time string
)
partitioned by(purchase_month string)
-- select transaction_id,count(1) from ext_transaction_details group by transaction_id having count(1)>1
-- select * from ext_transaction_details where transaction_id=8001
set hive.exec.dynamic.partition.mode=nonstrict -- 开启动态分区
-- 重写数据
with base as (
select
transaction_id,
customer_id ,
store_id ,
price ,
product,
purchase_date,
purchase_time,
from_unixtime(unix_timestamp(purchase_date,'yyyy-MM-dd'),'yyyy-MM') as purchase_month,
row_number() over (partition by transaction_id order by store_id) as rn
from ext_transaction_details
)
insert overwrite table transaction_details partition(purchase_month)
select
if(rn=1,transaction_id,concat_ws('-',transaction_id,'_fix')) ,
customer_id ,
store_id ,
price ,
product,
purchase_date ,
purchase_time,
purchase_month
from base
-- 查看修复信息
select * from transaction_details where transaction_id like '%fix%'
5.3 Clean store_review table
%hive
create view if not exists vw_store_review as
select
transaction_id,
review_score
from ext_store_review where review_score <> ''
show tables
最终会有七个表
6.Customer分析
6.1找出顾客最常用的信用卡
%hive
select credit_type,count(credit_type) cnt from customer_details group by credit_type order by cnt desc
6.2找出客户资料中排名前五的职位名称
%hive
select job ,count(1) as pn from customer_details group by job
order by pn desc limit 5
6.3在美国女性最常用的信用卡
%hive
select credit_type,count(credit_type) cnt from customer_details where gender='Female' and country='United States' group by credit_type order by cnt desc limit 10
6.4按性别和国家进行客户统计
%hive
select country,gender,count(1) cnt from customer_details group by country,gender
7.Transaction分析
7.1计算每月总收入
%hive
select month,store_id,sum(price) total from transaction_details group by store_id,month
7.2计算每个季度的总收入
%hive
select year(purchase_date) year,concat
("第",floor((month(purchase_date)-1)/3)+1,"季度") season,sum(price)
from transaction_details group by year(purchase_date) ,
concat("第",floor((month(purchase_date)-1)/3)+1,"季度") order by year desc,season
7.3按年计算总收入
%hive
select year(purchase_date) yy,sum(price) toatal
from transaction_details group by year(purchase_date) order by yy desc
7.4按工作日计算总收入
%hive
select dayofweek(purchase_date) work_date,sum(price) from transaction_details
where dayofweek(purchase_date) between 1 and 5 group by dayofweek(purchase_date)
7.5按时间段计算总收入(需要清理数据)
%hive
-- 使用正则表达式清理数据然后使用case when 分组查询
with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
0,
cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
else 'night'
end as timeSplit
from t1)
select t2.timeSplit,sum(price)
from t2
group by t2.timeSplit
7.6按时间段计算平均消费
%hive
-- 使用正则表达式清理数据然后使用case when 分组查询
with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
0,
cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
else 'night'
end as timeSplit
from t1)
select t2.timeSplit,avg(price)
from t2
group by t2.timeSplit
7.7按工作日计算平均消费
%hive
select dayofweek(purchase_date) work_date,avg(price) from transaction_details
where dayofweek(purchase_date) between 1 and 5 group by dayofweek(purchase_date)
7.8计算年、月、日的交易总数
%hive
--select purchase_date,count(1) from transaction_details group by purchase_date
--select concat(year(purchase_date),"-",month(purchase_date)),count(1) from transaction_details group by year(purchase_date),month(purchase_date)
--select year(purchase_date),count(1) from transaction_details group by year(purchase_date)\
select count(1) over(partition by year(purchase_date)) year,
count(1) over(partition by year(purchase_date),month(purchase_date)) month,
count(1) over(partition by year(purchase_date),month(purchase_date),day(purchase_date))day from transaction_details
7.9找出交易量最大的10个客户
%hive
select customer_id,count(transaction_id)a from transaction_details group by customer_id order by a desc limit 10
7.10找出消费最多的前10位顾客
%hive
select customer_id,sum(price)a from transaction_details group by customer_id order by a desc limit 10
7.11统计该期间交易数量最少的用户
%hive
select customer_id,count(transaction_id)a from transaction_details group by customer_id order by a limit 1
7.12计算每个季度的独立客户总数
%hive
select year(purchase_date) yy,
concat("季度",floor((month(purchase_date)-1)/3)+1) season,
count(distinct customer_id) from transaction_details group by year(purchase_date),
concat("季度",floor((month(purchase_date)-1)/3)+1)
order by yy desc,season
7.13计算每周的独立客户总数
%hive
select concat(year(purchase_date),'年第',weekofyear(purchase_date),'周'),count(distinct customer_id) from transaction_details group by year(purchase_date),weekofyear(purchase_date)
7.14计算整个活动客户平均花费的最大值
%hive
select a.customer_id,max(a.av) from (select customer_id,avg(price)av from
transaction_details group by customer_id)a group by a.customer_id
7.15统计每月花费最多的客户
%hive
select b.m,b.id,b.s
from(
select a.m,a.id,a.s ,row_number() over(partition by a.m order by a.s desc) as win1
from(
select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id,sum(price) s
from transaction_details
group by year(purchase_date),month(purchase_date),customer_id)a) b
where b.win1=1
7.16统计每月访问次数最多的客户
%hive
select b.m,b.id,b.c
from(
select a.m,a.id,a.c,row_number() over(partition by a.m order by a.c desc) as win1
from(
select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id, count(1) c
from transaction_details
group by year(purchase_date),month(purchase_date),customer_id) a) b
where b.win1=1
7.17按总价找出最受欢迎的5种产品
%hive
select product,sum(price)a from transaction_details group by product order by a desc limit 5
7.18根据购买频率找出最畅销的5种产品
%hive
select product,count(transaction_id)a from transaction_details group by product order by a desc limit 5
7.19根据客户数量找出最受欢迎的5种产品
%hive
select product,count(distinct customer_id)a from transaction_details group by product order by a desc limit 5
7.20验证前5个details
%hive
select * from transaction_details where product in ('Goat - Whole Cut')
8.Store分析
8.1按客流量找出最受欢迎的商店
%hive
select store_id,count(transaction_id) a from transaction_details group by store_id order by a desc limit 1
8.2根据顾客消费价格找出最受欢迎的商店
%hive
select store_id,sum(price)a from transaction_details group by store_id order by a desc limit 1
8.3根据顾客交易情况找出最受欢迎的商店
%hive
select store_id,count(transaction_id)a,sum(price)b from transaction_details group by store_id order by a desc,b desc limit 1
8.4根据商店和唯一的顾客id获取最受欢迎的产品
%hive
select b.store_id,b.product
from (
select a.store_id,a.product,a.c ,row_number() over(partition by store_id order by a.c desc )as win1
from(
select store_id,product,count(distinct customer_id) c
from transaction_details
group by store_id,product) a )b
where b.win1 =1
8.5获取每个商店的员工与顾客比
%hive
select a.store_id,concat_ws(':',cast(ceil(round(s.employee_number/a.c*100))as string),'100')
from(
select t.store_id,count(distinct customer_id) c
from transaction_details t
group by t.store_id)a join ext_store_details s
on a.store_id=s.store_id
8.6按年和月计算每家店的收入
%hive
select store_id,year(purchase_date),month(purchase_date), sum(price) from transaction_details group by store_id,year(purchase_date),month(purchase_date)
8.7按店铺制作总收益饼图
%hive
select store_id,sum(price)a from transaction_details group by store_id
8.8找出每个商店最繁忙的时间段
%hive
with
t1 as(
select *, if(instr(purchase_time,'PM')>0,
if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,
0,
cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),
cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans
from transaction_details), t2 as(
select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'
when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'
when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'
when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'
when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'
else 'night'
end as timeSplit
from t1),
t3 as(
select t2.store_id,t2.timeSplit,count(1) c
from t2
group by t2.store_id,t2.timeSplit),
t4 as(
select t3.store_id,t3.timeSplit,row_number() over(partition by store_id order by t3.timeSplit desc)as win1
from t3 )
select t4.store_id,t4.timeSplit
from t4
where t4.win1=1
8.9找出每家店的忠实顾客
%hive
select b.* from (select store_id,customer_id,count(transaction_id)a from transaction_details group by store_id,customer_id)b where b.a>=5
8.10根据每位员工的最高收入找出明星商店
%hive
-- 求总收入与雇员比值的最大值
with
t1 as (
select store_id,sum(price) s
from transaction_details
group by store_id)
select t1.store_id,t1.s/s.employee_number ss
from t1 join ext_store_details s on s.store_id= t1.store_id
order by ss desc
limit 1
9.Review分析
9.1在ext_store_review中找出存在冲突的交易映射关系
%hive
select stransaction_id
from store_review
group by stransaction_id
having count(1)>1
9.2了解客户评价的覆盖率
%hive
-- 求各个店共有多少顾客评价
with
t1 as(
select t2.store_id,t1.transaction_id,t2.customer_id
from store_review t1 join transaction_details t2 on t1.transaction_id=t2.transaction_id)
select t1.store_id,count(distinct t1.customer_id)
from t1
group by t1.store_id
9.3根据评分了解客户的分布情况
%hive
-- 求每家店每个评分有多少个客户给的
with
t1 as(
select t2.store_id ,t1.review_score,t2.customer_id
from store_review t1 join transaction_details t2 on t1.transaction_id=t2.transaction_id)
select t1.store_id,t1.review_score,count(distinct customer_id)
from t1
group by t1.store_id,t1.review_score
9.4根据交易了解客户的分布情况
%hive
-- 求每家店每个客户的订单数
select store_id,customer_id,count(1)
from transaction_details
group by store_id,customer_id
9.5客户给出的最佳评价是否总是同一家门店
%hive
with yes as(select customer_id,count(distinct s.store_id)cnt,"yes" from store_review s join transaction_details on substr(transaction_id,0,length(transaction_id)-2)=stransaction_id where substr(transaction_id,length(transaction_id),length(transaction_id))<>'2' group by customer_id having cnt=1),
no as(select customer_id,count(distinct s.store_id)cnt,"no" from store_review s join transaction_details on substr(transaction_id,0,length(transaction_id)-2)=stransaction_id where substr(transaction_id,length(transaction_id),length(transaction_id))<>'2' group by customer_id having cnt>1)
select * from yes union all select * from no
更多推荐
所有评论(0)