Hive练习之影评案例分析（一） - 好文

影评案例分析能够较为完整的学习Hive，今天在复习Hive的过程中又将这个案例操作了一遍。

一、数据说明和下载

在本案例中总共有三份数据，分别是：
（1）users.dat 数据格式为： 2::M::56::16::70072，
对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String

（2）movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children's|Fantasy，
对应字段为：MovieID BigInt, Title String, Genres String

（3）ratings.dat 数据格式为： 1::1193::5::978300760，
对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String

数据下载链接：

https://github.com/qianhonglinIT/HiveMovieCaseStudy
<https://github.com/qianhonglinIT/HiveMovieCaseStudy>

二、建表和导入数据

在数据库中创建3张表，t_user，t_movie，t_rating，由于原始数据是使用::进行分割的，所以需要使用能解析多字节分隔符的Serde
create table t_user( userid bigint, sex string, age int, occupation string,
zipcode string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with
serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s
%2$s %3$s %4$s %5$s') stored as textfile; use movie; create table t_movie(
movieid bigint, moviename string, movietype string) row format serde
'org.apache.hadoop.hive.serde2.RegexSerDe' with
serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s
%2$s %3$s') stored as textfile; use movie; create table t_rating( userid
bigint, movieid bigint, rate double, times string) row format serde
'org.apache.hadoop.hive.serde2.RegexSerDe' with
serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s
%2$s %3$s %4$s') stored as textfile;
导入数据：
load data local inpath "/home/hive/users.dat" into table t_user; load data
local inpath "/home/hive/movies.dat" into table t_movie; load data local inpath
"/home/hive/ratings.dat" into table t_rating;
三、题目和解答

（1）求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

分析：按照电影名进行分组统计，求出每部电影的评分次数并按照评分次数降序排序
create table question1 as select a.moviename as moviename,count(a.moviename)
as total from t_movie a join t_rating b on a.movieid=b.movieid group by
a.moviename order by total desc limit 10;
结果：

（2）分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

分析：三表联合查询，按照性别过滤条件，电影名作为分组条件，影评分作为排序条件进行查询

女性：
create table question2_1 as select "F" as sex, c.moviename as name,
avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user
b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="F"
group by c.moviename having total >= 50 order by avgrate desc limit 10;
结果：

男性：
create table question2_2 as select "M" as sex, c.moviename as name,
avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user
b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="M"
group by c.moviename having total >= 50 order by avgrate desc limit 10;
结果：

明天继续。。。。。。。。。。。。。。。

热门工具换一换