影评案例分析能够较为完整的学习Hive,今天在复习Hive的过程中又将这个案例操作了一遍。
一、数据说明和下载
在本案例中总共有三份数据,分别是:
(1)users.dat 数据格式为: 2::M::56::16::70072,
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
(2)movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy,
对应字段为:MovieID BigInt, Title String, Genres String
(3)ratings.dat 数据格式为: 1::1193::5::978300760,
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
数据下载链接:
https://github.com/qianhonglinIT/HiveMovieCaseStudy
<https://github.com/qianhonglinIT/HiveMovieCaseStudy>
二、建表和导入数据
在数据库中创建3张表,t_user,t_movie,t_rating,由于原始数据是使用::进行分割的,所以需要使用能解析多字节分隔符的Serde
create table t_user( userid bigint, sex string, age int, occupation string,
zipcode string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with
serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s
%2$s %3$s %4$s %5$s') stored as textfile; use movie; create table t_movie(
movieid bigint, moviename string, movietype string) row format serde
'org.apache.hadoop.hive.serde2.RegexSerDe' with
serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s
%2$s %3$s') stored as textfile; use movie; create table t_rating( userid
bigint, movieid bigint, rate double, times string) row format serde
'org.apache.hadoop.hive.serde2.RegexSerDe' with
serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s
%2$s %3$s %4$s') stored as textfile;
导入数据:
load data local inpath "/home/hive/users.dat" into table t_user; load data
local inpath "/home/hive/movies.dat" into table t_movie; load data local inpath
"/home/hive/ratings.dat" into table t_rating;
三、题目和解答
(1)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
分析:按照电影名进行分组统计,求出每部电影的评分次数并按照评分次数降序排序
create table question1 as select a.moviename as moviename,count(a.moviename)
as total from t_movie a join t_rating b on a.movieid=b.movieid group by
a.moviename order by total desc limit 10;
结果:
(2)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
分析:三表联合查询,按照性别过滤条件,电影名作为分组条件,影评分作为排序条件进行查询
女性:
create table question2_1 as select "F" as sex, c.moviename as name,
avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user
b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="F"
group by c.moviename having total >= 50 order by avgrate desc limit 10;
结果:
男性:
create table question2_2 as select "M" as sex, c.moviename as name,
avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user
b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="M"
group by c.moviename having total >= 50 order by avgrate desc limit 10;
结果:
明天继续。。。。。。。。。。。。。。。
热门工具 换一换