【HIVE】数据分析HQL的编写方法/思路
阅读原文时间:2023年07月10日阅读:1

SQL编写一般思路:

1)复杂的查询,先划分为小任务,以降低难度。分别实现各个小任务后,再进行汇总;
2)涉及多表时,先进行联表查询;
3)简单分组,一般只需要group by即可;
4)组内TopN问题,使用row_number,rank,dense_rank;
5)熟练掌握常用函数;

1)字符串

split,分割字符串为数组,split(“a|b|c”, “\|”) => 返回数组 [a, b, c]

参数1:待分割到字符串;

参数2:分割字符,因为"|“在Java中是特殊字符,所以需要进行转义,转义使用两个”";

substr,取子字符串,substr(moviename, -5, 4)

参数1:原字符串;

参数2:截取的开始位置,如果是负数,则从右往左计数,如-1表示最后一个字符,-2表示倒数第二个字符;

参数3:截取长度;

示例:
        0: jdbc:hive2://master:10000> select *, substr(moviename, -5, 4) as year from t_movie limit 5;
        +------------------+-------------------------------------+-------------------------------+-------+
        | t_movie.movieid  |          t_movie.moviename          |       t_movie.movietype       | year  |
        +------------------+-------------------------------------+-------------------------------+-------+
        | 1                | Toy Story (1995)                    | Animation|Children's|Comedy   | 1995  |
        | 2                | Jumanji (1995)                      | Adventure|Children's|Fantasy  | 1995  |
        | 3                | Grumpier Old Men (1995)             | Comedy|Romance                | 1995  |
        | 4                | Waiting to Exhale (1995)            | Comedy|Drama                  | 1995  |
        | 5                | Father of the Bride Part II (1995)  | Comedy                        | 1995  |
        +------------------+-------------------------------------+-------------------------------+-------+

2)时间函数

year,获取时间的年份;

month,获取时间的月份;

from_unixtime,将时间戳转换为时间;

unix_timestamp():获取当前时间戳;

unix_timestamp(string date):时间转换为时间戳;

获取当前日期 & 时间:
            当前日期:
                    0: jdbc:hive2://master:10000> select current_date();
                    +-------------+
                    |     _c0     |
                    +-------------+
                    | 2019-09-21  |
                    +-------------+
            当前时间:
                    0: jdbc:hive2://master:10000> select current_timestamp();
                    +--------------------------+
                    |           _c0            |
                    +--------------------------+
                    | 2019-09-21 18:05:27.768  |
                    +--------------------------+
            当前时间戳:
                    0: jdbc:hive2://master:10000> select unix_timestamp();
                    +-------------+
                    |     _c0     |
                    +-------------+
                    | 1569060416  |
                    +-------------+

        从时间中获取年份:
                    0: jdbc:hive2://master:10000> select year("2019-09-21 18:05:27.768 ") as year;
                    +-------+
                    | year  |
                    +-------+
                    | 2019  |
                    +-------+

        从时间戳中获取月份:
                    0: jdbc:hive2://master:10000> select month(from_unixtime(1569060416)) as month;
                    +--------+
                    | month  |
                    +--------+
                    | 9      |
                    +--------+

3)聚合函数

sum、avg等;

4)explode

将数组等拆分为多行

0: jdbc:hive2://master:10000> select m.*, t.type from t_movie m lateral view explode(split(movietype, "\\|")) t as type limit 10;
        +------------+---------------------------+-------------------------------+-------------+
        | m.movieid  |        m.moviename        |          m.movietype          |   t.type    |
        +------------+---------------------------+-------------------------------+-------------+
        | 1          | Toy Story (1995)          | Animation|Children's|Comedy   | Animation   |
        | 1          | Toy Story (1995)          | Animation|Children's|Comedy   | Children's  |
        | 1          | Toy Story (1995)          | Animation|Children's|Comedy   | Comedy      |
        | 2          | Jumanji (1995)            | Adventure|Children's|Fantasy  | Adventure   |
        | 2          | Jumanji (1995)            | Adventure|Children's|Fantasy  | Children's  |
        | 2          | Jumanji (1995)            | Adventure|Children's|Fantasy  | Fantasy     |
        | 3          | Grumpier Old Men (1995)   | Comedy|Romance                | Comedy      |
        | 3          | Grumpier Old Men (1995)   | Comedy|Romance                | Romance     |
        | 4          | Waiting to Exhale (1995)  | Comedy|Drama                  | Comedy      |
        | 4          | Waiting to Exhale (1995)  | Comedy|Drama                  | Drama       |
        +------------+---------------------------+-------------------------------+-------------+

5)collect_set,可以理解为该函数实现了explode相反到功能;

collect_list:可以包含重复数据;collect_set:去重;

将多行某字段到数据,合并为一个数组,需要结合group by进行分组,以确定合并到行到范围。

0: jdbc:hive2://master:10000> select moviename, collect_set(type) as types from (select m.*, t.type from t_movie m lateral view explode(split(movietype, "\\|")) t as type limit 10) t group by moviename;
        +---------------------------+---------------------------------------+
        |         moviename         |                 types                 |
        +---------------------------+---------------------------------------+
        | Jumanji (1995)            | ["Children's","Adventure","Fantasy"]  |
        | Toy Story (1995)          | ["Comedy","Children's","Animation"]   |
        | Grumpier Old Men (1995)   | ["Comedy","Romance"]                  |
        | Waiting to Exhale (1995)  | ["Drama","Comedy"]                    |
        +---------------------------+---------------------------------------+


2. 常见场景及方法
    1)简单条件过滤;
        使用where,显示movieid为1到电影名:
        0: jdbc:hive2://master:10000> select moviename from t_movie where movieid = 1;
        +-------------------+
        |     moviename     |
        +-------------------+
        | Toy Story (1995)  |
        +-------------------+

    2)联表条件过滤;
        使用join on,获取评分为5的电影名:
        select moviename, rate from t_rating r join t_movie m on r.movieid=m.movieid where rate=5 limit 5;
        +-----------------------------------------+-------+
        |                moviename                | rate  |
        +-----------------------------------------+-------+
        | One Flew Over the Cuckoo's Nest (1975)  | 5.0   |
        | Bug's Life, A (1998)                    | 5.0   |
        | Ben-Hur (1959)                          | 5.0   |
        | Christmas Story, A (1983)               | 5.0   |
        | Beauty and the Beast (1991)             | 5.0   |
        +-----------------------------------------+-------+

    3)分组统计;
        使用group by和聚合函数
        获取电影的平均评分:
        select movieid, avg(rate) avg_rate from t_rating group by movieid order by movieid limit 5;
        +----------+---------------------+
        | movieid  |      avg_rate       |
        +----------+---------------------+
        | 1        | 4.146846413095811   |
        | 2        | 3.20114122681883    |
        | 3        | 3.01673640167364    |
        | 4        | 2.7294117647058824  |
        | 5        | 3.0067567567567566  |
        +----------+---------------------+

    4)组内TopN;
        使用row_number,rank,dense_rank;
        获取各部门工资最高的三名员工:
        select * from (select deptid, name, (salary+nvl(bonus, 0)) salary, dense_rank() over(partition by deptid order by salary desc) as rank from emp) t where rank<=3;
        +-----------+---------+-----------+---------+
        | t.deptid  | t.name  | t.salary  | t.rank  |
        +-----------+---------+-----------+---------+
        | 10        | KING    | 5000.0    | 1       |
        | 10        | CLARK   | 2450.0    | 2       |
        | 10        | MILLER  | 1300.0    | 3       |
        | 20        | FORD    | 3000.0    | 1       |
        | 20        | SCOTT   | 3000.0    | 1       |
        | 20        | JONES   | 2975.0    | 2       |
        | 20        | ADAMS   | 1100.0    | 3       |
        | 30        | BLAKE   | 2850.0    | 1       |
        | 30        | ALLEN   | 1900.0    | 2       |
        | 30        | TURNER  | 1500.0    | 3       |
        +-----------+---------+-----------+---------+

    5)累加;
        使用sum() over(partition by order by)
        获取每个人按月累计消费:
        select name, dt, cost, sum(cost) over(partition by name, month(dt) order by cost) as sum from t_order order by name;
        +-------+-------------+-------+------+
        | name  |     dt      | cost  | sum  |
        +-------+-------------+-------+------+
        | jack  | 2015-01-01  | 10    | 10   |
        | jack  | 2015-01-05  | 46    | 56   |
        | jack  | 2015-01-08  | 55    | 111  |
        | jack  | 2015-02-03  | 23    | 23   |
        | jack  | 2015-04-06  | 42    | 42   |
        | mart  | 2015-04-08  | 62    | 62   |
        | mart  | 2015-04-09  | 68    | 130  |
        | mart  | 2015-04-11  | 75    | 205  |
        | mart  | 2015-04-13  | 94    | 299  |

        注意,如果over语句中没有order by,则求取的是该月份的消费总金额,而不会出现累加。
        select name, dt, cost, sum(cost) over(partition by name, month(dt)) as sum from t_order order by name;
        +-------+-------------+-------+------+
        | name  |     dt      | cost  | sum  |
        +-------+-------------+-------+------+
        | jack  | 2015-01-01  | 10    | 111  |
        | jack  | 2015-01-05  | 46    | 111  |
        | jack  | 2015-01-08  | 55    | 111  |
        | jack  | 2015-02-03  | 23    | 23   |
        | jack  | 2015-04-06  | 42    | 42   |
        | mart  | 2015-04-08  | 62    | 299  |
        | mart  | 2015-04-09  | 68    | 299  |
        | mart  | 2015-04-11  | 75    | 299  |
        | mart  | 2015-04-13  | 94    | 299  |

手机扫一扫

移动阅读更方便

阿里云服务器
腾讯云服务器
七牛云服务器

你可能感兴趣的文章