Druid学习之查询语法
阅读原文时间:2023年07月10日阅读:3

写在前面

最近一段时间都在做druid实时数据查询的工作,本文简单将官网上的英文文档加上自己的理解翻译成中文,同时将自己遇到的问题及解决方法list下,防止遗忘。

本文的demo示例均来源于官网。

Druid查询概述

Druid的查询是使用Rest风格的http请求查询服务节点,客户端通过发送Json对象请求查询接口。可以使用shell脚本查询或通过Google的ARC插件构造Post请求进行查询。

Query构建

curl -X POST ':/druid/v2/?pretty' -H 'Content-Type:application/json' -d @

其中:为broker、historical或realtime进程所在机器的ip和提供服务的端口,query_json_file为json配置文件路径。

见下图

Druid查询

不同的查询场景使用不同的查询方式。Druid有很多查询类型,对于各种类型的查询类型的配置可以通过配置不同的Query实现。Druid的查询类型,概括为以下3类:

1.聚合查询:时间序列查询(Timeseroes),Top查询(TopN),GroupBy查询(GroupBy)
2.元数据查询:时间范围(Time Boundary),段元数据(Segment Metadata),数据源(DataSource)
2.Search查询(Search)

一般聚合查询使用的较多,其他类型的查询方式使用场景较少且较简单,可直接参考官网给出的demo即可查询;本文主要介绍聚合查询。一般情况下,Timeseries和TopN查询性能优于GroupBy,GroupBy查询方式最灵活但是最耗性能。Timeseries查询性能明显优于GroupBy,因为聚合不需要其他GroupBy其他维度;对于Groupby和排序在一个单一维度的场景,TopN优于GroupBy。

一条Druid query中主要包含以下几种属性:

1.queryType:查询类型,即timeseries,topN,groupBy等;
2.dataSource:数据源,类似Mysql中的表的概念;
3.granularity:聚合粒度,聚合粒度有none,all,week,day,hour等;
4.filter:过滤条件,类似Mysql中的where条件;
5.aggregator:聚合方式,类似Mysql中的count,sum等操作

2.1 granularity简介

2.1.1 简单的聚合粒度

简单的聚合粒度有:all、none、second、minute、fifteen_minute、thirty_minute、hour、day、week、month、quarter、year;简单聚合粒度的查询取决于druid存储数据的最小粒度,如果构建数据的最小粒度是小时,使用minute粒度去查询,结果数据也是小时粒度的数据。

假设存储在Druid中的数据使用毫秒粒度构建,数据格式如下:

{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}
{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}
{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}

提交一个小时粒度的groupBy查询,查询query如下:

{
"queryType":"groupBy",
"dataSource":"my_dataSource",
"granularity":"hour",
"dimensions":[
"language"
],
"aggregations":[
{
"type":"count",
"name":"count"
}
],
"intervals":[
"2000-01-01T00:00Z/3000-01-01T00:00Z"
]
}

按小时粒度进行的groupby查询结果中timestamp值精确到小时,比小时粒度更小粒度值自动补填零,以此类推按天查询,则小时及小粒度补零。timestamp值为UTC。查询结果如下:

[ {
"version" : "v1",
"timestamp" : "2013-08-31T01:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-01T01:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-02T23:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-03T03:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
} ]

如若指定聚合粒度为day,则按照天为单位对数据进行聚合,查询结果如下:

[ {
"version" : "v1",
"timestamp" : "2013-08-31T00:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-01T00:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-02T00:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-03T00:00:00.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
} ]

如若聚合粒度设置为none,则按照druid中build数据的最小粒度查询数据,即不进行聚合,如bulid数据的粒度是ms,则聚合出来的结果也是毫秒:

[ {
"version" : "v1",
"timestamp" : "2013-08-31T01:02:33.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-01T01:02:33.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-02T23:32:45.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-03T03:32:45.000Z",
"event" : {
"count" : 1,
"language" : "en"
}
} ]

如若将聚合粒度设置为all,则返回数据的长度为1,即把查询时间段的数据做一个汇总:

[ {
  "version" : "v1",
  "timestamp" : "2000-01-01T00:00:00.000Z",
  "event" : {
    "count" : 4,
    "language" : "en"
  }
} ]

2.1.2 时间聚合粒度

可指定一定的时间段进行聚合,返回UTC时间;支持可选属性origin;不指定时间,默认的开始时间=1970-01-01T00:00:00Z;

持续时间段2小时,从1970-01-01T00:00:00开始:

{"type": "duration", "duration": 7200000}

2.1.3 常用时间段聚合粒度

时间聚合粒度的特例,方便使用,如年、月、日、小时等,日期标准是ISO 8601。无特别指定的情况下,year从1月份开始,month从1号开始,week从周一开始。

一般的格式为:其中timeZone可选,默认值是UTC;origin可选,默认1970-01-01T00:00:00;

{"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}

period的一般写法为:

month:P2M代表2个月作为一个聚合粒度;
week:P2W代表2周作为一个聚合粒度;
day:P1D代表1天作为一个聚合粒度;
hour:PT1H代表1个小时作为一个聚合粒度;
minute:PT0.750S代表750s作为一个聚合粒度;

如提交一个1d作为聚合粒度的groupby查询的query:

{
"queryType":"groupBy",
"dataSource":"my_dataSource",
"granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"},
"dimensions":[
"language"
],
"aggregations":[
{
"type":"count",
"name":"count"
}
],
"intervals":[
"1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00"
]
}

查询得到的结果为:

[ {
"version" : "v1",
"timestamp" : "2013-08-30T00:00:00.000-07:00",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-08-31T00:00:00.000-07:00",
"event" : {
"count" : 1,
"language" : "en"
}
}, {
"version" : "v1",
"timestamp" : "2013-09-02T00:00:00.000-07:00",
"event" : {
"count" : 2,
"language" : "en"
}
} ]

官网给出的例子是以美国洛杉矶的时区为准,一般中国的时区这样使用,更多时区可移步该链接查询:

"granularity": {
"period": "PT1H",
"timeZone": "+08:00",
"type": "period"
}

2.2 filter简介

一个filter即一个json对象,代表一个过滤条件,等价于mysql中的一个where条件;过滤器的类型主要有:Selector filter,Regular expression filter(正则表达式过滤)、Logical expression filters(AND、OR、NOT)、In filter、Bound filter、Search filter、JavaScript filter、Extraction filter;

2.2.1 Selector 过滤器

等价于 WHERE <dimension_string> = '<dimension_value_string>'

json格式:

"filter": { "type": "selector", "dimension": , "value": }

2.2.2 正则表达式  过滤器

类似Selector过滤器,只不过过滤使用的是正则表达式;正则表达式为标准的java正则表达式规范;

"filter": { "type": "regex", "dimension": , "pattern": }

2.2.3 逻辑表达式 过滤器

AND

"filter": { "type": "and", "fields": [, , …] }

OR

"filter": { "type": "not", "field": }

NOT

"filter": { "type": "not", "field": }

IN

等价于

SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')
{
"type": "in",
"dimension": "outlaw",
"values": ["Good", "Bad", "Ugly"]
}

BOUND

数值型:21<=age<=31

{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }

数值型:21<age<31

{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }

字符型:‘foo’<=name<='hoo'

{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }

2.3 aggregations简介

aggregations即汇总数据记性druid之前提供的一个数据采集一种聚合方式。常用的聚合类型主要有:count,sum,min/max,approximate,miscellaneous;

2.3.1 Count aggregator

符合查询条件的行数,类似mysql中的count计算:

{ "type" : "count", "name" : }

Note: Druid进行Count查询的数据量并不一定等于数据采集时导入的数据量,因为Druid在采集数据查询时已经按照相应的聚合方式对数据进行了聚合。

2.3.2 Sum aggregator

与底层druid表中的字段类型一致。

longSum

{ "type" : "longSum", "name" : , "fieldName" : }

doubleSum

{ "type" : "doubleSum", "name" : , "fieldName" : }

2.3.3 MIN/MAX  aggregator

doubleMin

{ "type" : "doubleMin", "name" : , "fieldName" : }

doubleMax

{ "type" : "doubleMax", "name" : , "fieldName" : }

long类型类似,不在赘述。其他聚合方式请移步到官网查询示例。

2.4 聚合查询

2.4.1 Timeseries query

query

{
"queryType": "timeseries",
"dataSource": "sample_datasource",
"granularity": "day",
"descending": "true",
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
{ "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
{ "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
],
"postAggregations": [
{ "type": "arithmetic",
"name": "sample_divide",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
{ "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}

query属性说明

属性

描述

是否必填

querytype

字符串类型,时间序列 "timeseries"

dataSource

字符串类型,数据源

descending

排序方式,默认false

intervals

查询时间范围

granularity

聚合粒度,说明

filter

过滤条件,说明

aggregations

聚合,说明

postAggregations

后聚合,说明

context

上下文,说明

 否

上述query的返回结果:

[
{
"timestamp": "2012-01-01T00:00:00.000Z",
"result": { "sample_name1": , "sample_name2": , "sample_divide": }
},
{
"timestamp": "2012-01-02T00:00:00.000Z",
"result": { "sample_name1": , "sample_name2": , "sample_divide": }
}
]

2.4.2 TopN query

TopN查询根据规范返回给定维度的有序的结果集,从概念上来讲,TopN查询被认为单维度、有序的类似分组查询。在某些情况下,TopN查询比分组查询(groupby query)快。TopN查询结果返回Json数组对象。TopN在每个节点将顶上K个结果排名,在Druid默认情况下最大值为1000。在实践中,如果你要求前1000个项顺序排名,那么从第1-999个项的顺序正确性是100%,其后项的结果顺序没有保证。你可以通过增加threshold值来保证顺序准确。

query

{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"threshold": 5,
"metric": "count",
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "dim1",
"value": "some_value"
},
{
"type": "selector",
"dimension": "dim2",
"value": "some_other_val"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
},
{
"type": "doubleSum",
"name": "some_metric",
"fieldName": "some_metric"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "sample_divide",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "some_metric",
"fieldName": "some_metric"
},
{
"type": "fieldAccess",
"name": "count",
"fieldName": "count"
}
]
}
],
"intervals": [
"2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
]
}

query属性说明

属性

描述

是否必填

querytype

字符串类型,时间序列 "topN"

dataSource

字符串类型,数据源,说明

dimension

groupBy的维度,说明

intervals

查询时间范围

granularity

聚合粒度,说明

filter

过滤条件,说明

aggregations

聚合,说明

postAggregations

后聚合,说明

threshold

topN的N值

metric

字符串或Json对象指定度量对Top N个结果排序,说明

context

上下文,说明

 否

上述query的查询结果形如:

[
{
"timestamp": "2013-08-31T00:00:00.000Z",
"result": [
{
"dim1": "dim1_val",
"count": 111,
"some_metrics": 10669,
"average": 96.11711711711712
},
{
"dim1": "another_dim1_val",
"count": 88,
"some_metrics": 28344,
"average": 322.09090909090907
},
{
"dim1": "dim1_val3",
"count": 70,
"some_metrics": 871,
"average": 12.442857142857143
},
{
"dim1": "dim1_val4",
"count": 62,
"some_metrics": 815,
"average": 13.14516129032258
},
{
"dim1": "dim1_val5",
"count": 60,
"some_metrics": 2787,
"average": 46.45
}
]
}
]

2.4.3 GroupBy query

类似mysql中的groupBy查询方式。

query

{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"threshold": 5,
"metric": "count",
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "dim1",
"value": "some_value"
},
{
"type": "selector",
"dimension": "dim2",
"value": "some_other_val"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
},
{
"type": "doubleSum",
"name": "some_metric",
"fieldName": "some_metric"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "sample_divide",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "some_metric",
"fieldName": "some_metric"
},
{
"type": "fieldAccess",
"name": "count",
"fieldName": "count"
}
]
}
],
"intervals": [
"2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
]
}

query属性说明

属性

描述

是否必填

querytype

字符串类型,时间序列 "topN"

dataSource

字符串类型,数据源,说明

dimensions

groupBy的维度,说明

intervals

查询时间范围

granularity

聚合粒度,说明

filter

过滤条件,说明

aggregations

聚合,说明

postAggregations

后聚合,说明

limitSpec

返回指定数量的查询结果,类似mysql中的limit字句,说明

having

类似mysql中的having字句,说明

context

上下文,说明

 否

上述query的查询结果形如:

[
{
"timestamp": "2013-08-31T00:00:00.000Z",
"result": [
{
"dim1": "dim1_val",
"count": 111,
"some_metrics": 10669,
"average": 96.11711711711712
},
{
"dim1": "another_dim1_val",
"count": 88,
"some_metrics": 28344,
"average": 322.09090909090907
},
{
"dim1": "dim1_val3",
"count": 70,
"some_metrics": 871,
"average": 12.442857142857143
},
{
"dim1": "dim1_val4",
"count": 62,
"some_metrics": 815,
"average": 13.14516129032258
},
{
"dim1": "dim1_val5",
"count": 60,
"some_metrics": 2787,
"average": 46.45
}
]
}
]

groupBy多值字段

Druid中的字段会有多值查询,针对多值查询的groupBy操作,满足多值中一个过滤条件,查询结果中会把多值字段中的每个值都返回。下面通过例子进行说明。

底层数据格式

{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]} #row2
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4

查询query:

{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"filter": {
"type": "selector",
"dimension": "tags",
"value": "t3"
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "tags",
"outputName": "tags"
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}

返回结果:命中row1和row2

[
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t1"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t2"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t3"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t4"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t5"
}
}
]

遇到的问题及解决方法

问题1:北京时区进行查询

解决方法:通过设置timeZone的时区解决

"granularity": {
"period": "PT1H",
"timeZone": "+08:00",
"type": "period"
}

问题2:获取datasource的最新build时间

解决方法:在context中设置requireMessageWatermark=true,在http返回结果的header中拿到该数据;

if (HttpStatus.SC_OK == httpResponse.getStatusLine().getStatusCode()) {
HttpEntity entity = httpResponse.getEntity();
result.setResultStr(EntityUtils.toString(entity, "utf-8"));
String waterMark = httpResponse.getHeaders(DRUID_WATERMARK_HEADER)[0].getValue();
if (StringUtil.isNotEmpty(waterMark)) {
JSONObject obj = JSONObject.parseObject(waterMark);
waterMark = obj.getString(DRUID_MESSAGE_WATERMARK);
result.setDruidWatermarkStr(waterMark);
} // end if
}

问题3:设置不同query的优先级

解决方法:在context中设置priority属性;

问题4:设置query的超时时间

解决方法:在context中设置timeout属性;

问题5:多值groupBy指定返回想要的值

解决方法:使用listField属性设置多值列中想要的值;

{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"filter": {
"type": "selector",
"dimension": "tags",
"value": "t3"
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "listFiltered",
"delegate": {
"type": "default",
"dimension": "tags",
"outputName": "tags"
},
"values": ["t3"]
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}

附Druid官网链接

官网链接