Hadoop生态系统Hive:SQL执行(一)
本文主要是记录一些Hive的SQL语句的特殊之处。
实际问题记录
知识点一:Hive直接访问非GROUP BY字段报错
报错信息提示如下:
Error while compiling statement: FAILED: SemanticException [Error 10025]: line 1:7 Expression not in GROUP BY key 'sex'
原因分析和解决
Hive中不允许直接访问非GROUP BY字段,如果需要访问,可以用 collect_set函数收集这些字段,将该字段的所有值保存在一个数组中,使用下标访问其中的元素[1]。
SELECT collect_set(t_user.sex)[0] AS sex, t_movie.moviename,
AVG(t_rating.rate) AS avgrate, COUNT(*) AS total
FROM hive_sql_test1.t_user AS t_user
INNER JOIN hive_sql_test1.t_rating AS t_rating
ON t_user.userid=t_rating.userid
INNER JOIN hive_sql_test1.t_movie AS t_movie
ON t_movie.movieid=t_rating.movieid
WHERE t_user.sex='M'
GROUP BY t_movie.moviename
HAVING total>50
ORDER BY avgrate DESC
LIMIT 10;
知识点二:Hive不支持IN/EXISTS
Hive不支持IN,可以采用来替代[2]。
CREATE TABLE A (
X varchar(255)
);
CREATE TABLE B (
Y varchar(255)
);
INSERT INTO A VALUES ('Amy'), ('John'), ('Lisa'), ('Marco'), ('Phil');
INSERT INTO B VALUES ('Lisa'), ('Marco'), ('Phil'), ('Tim'),('Vincent');
SELECT B.Y FROM A
LEFT SEMI JOIN B
ON A.X = B.Y
等价于:
SELECT B.Y FROM A
WHERE A.X
IN (SELECT B.Y FROM B);