Bootstrap

Hadoop生态系统Hive:SQL执行(一)

本文主要是记录一些Hive的SQL语句的特殊之处。

实际问题记录

知识点一:Hive直接访问非GROUP BY字段报错

报错信息提示如下:

Error while compiling statement: FAILED: SemanticException [Error 10025]: line 1:7 Expression not in GROUP BY key 'sex'

原因分析和解决

Hive中不允许直接访问非GROUP BY字段,如果需要访问,可以用 collect_set函数收集这些字段,将该字段的所有值保存在一个数组中,使用下标访问其中的元素[1]。

SELECT collect_set(t_user.sex)[0] AS sex, t_movie.moviename,
        AVG(t_rating.rate) AS avgrate, COUNT(*) AS total
    FROM hive_sql_test1.t_user AS t_user
    INNER JOIN hive_sql_test1.t_rating AS t_rating
        ON t_user.userid=t_rating.userid
    INNER JOIN hive_sql_test1.t_movie AS t_movie
        ON t_movie.movieid=t_rating.movieid
    WHERE t_user.sex='M'
    GROUP BY t_movie.moviename
    HAVING total>50
    ORDER BY avgrate DESC
    LIMIT 10;

知识点二:Hive不支持IN/EXISTS

Hive不支持IN,可以采用来替代[2]。

CREATE TABLE A (
    X varchar(255)
);

CREATE TABLE B (
    Y varchar(255)
);

INSERT INTO A VALUES ('Amy'), ('John'), ('Lisa'), ('Marco'), ('Phil');
INSERT INTO B VALUES ('Lisa'), ('Marco'), ('Phil'), ('Tim'),('Vincent');

SELECT B.Y FROM A
    LEFT SEMI JOIN B
    ON A.X = B.Y

等价于:

SELECT B.Y FROM A
	WHERE A.X
  IN (SELECT B.Y FROM B);

参考资料