所有分类
  • 所有分类
  • 未分类

ElasticSearch–去重查询的方法

简介

本文介绍如何根据某一个字段进行去重。包括:获取去重后的结果,统计去重后的数量。

在SQL中,我们可以用dinstinct语句进行去重,例如:     

  • 获取去重后的结果:SELECT DISTINCT name, sex FROM person;
  • 统计去重后的数量:SELECT COUNT(DISTINCT name, sex) FROM person;

Elasticsearch也可以做到获取去重后的结果,统计去重后的数量,例如:

  • 获取去重后的结果
    • 方案1:collapse折叠功能(ES5.3之后支持)
      • 推荐。原因:性能高,占内存小
    • 方案2:字段聚合+top_hits聚合
      • 不推荐。原因:性能差,占内存大
  • 统计去重后的数量
    • 聚合+cardinality聚合函数

索引结构及数据

索引结构

http://localhost:9200/
PUT blog

{
    "mappings": {
        "properties": {
            "id":{
                "type":"long"
            },
            "title": {
                "type": "text"
            },
            "content": {
                "type": "text"
            },
            "author":{
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            },
            "category":{
                "type": "keyword"
            },
            "createTime": {
                "type": "date",
                "format":"yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-dd'T'HH:mm:ss.SSS||yyyy-MM-dd HH:mm:ss||epoch_millis"
            },
            "updateTime": {
                "type": "date",
                "format":"yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-dd'T'HH:mm:ss.SSS||yyyy-MM-dd HH:mm:ss||epoch_millis"
            },
            "status":{
                "type":"integer"
            },
            "serialNum": {
                "type": "keyword"
            }
        }
    }
}

数据

  • 每个文档必须独占一行,不能换行。
  • 此命令要放到postman中去执行,如果用head执行会失败

http://localhost:9200/
POST _bulk

{"index":{"_index":"blog","_id":1}}
{"blogId":1,"title":"Spring Data ElasticSearch学习教程1","content":"这是批量添加的文档1","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"1","createTime":"2021-10-10 11:52:01.249","updateTime":null}
{"index":{"_index":"blog","_id":2}}
{"blogId":2,"title":"Spring Data ElasticSearch学习教程2","content":"这是批量添加的文档2","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"2","createTime":"2021-10-10 11:52:02.249","updateTime":null}
{"index":{"_index":"blog","_id":3}}
{"blogId":3,"title":"Spring Data ElasticSearch学习教程3","content":"这是批量添加的文档3","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"3","createTime":"2021-10-10 11:52:03.249","updateTime":null}
{"index":{"_index":"blog","_id":4}}
{"blogId":4,"title":"Spring Data ElasticSearch学习教程4","content":"这是批量添加的文档4","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"4","createTime":"2021-10-10 11:52:04.249","updateTime":null}
{"index":{"_index":"blog","_id":5}}
{"blogId":5,"title":"Spring Data ElasticSearch学习教程5","content":"这是批量添加的文档5","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"5","createTime":"2021-10-10 11:52:05.249","updateTime":null}
{"index":{"_index":"blog","_id":6}}
{"blogId":6,"title":"Java学习教程6","content":"这是批量添加的文档6","author":"Tony","category":"ElasticSearch","status":1,"serialNum":"6","createTime":"2021-10-10 11:52:06.249","updateTime":null}
{"index":{"_index":"blog","_id":7}}
{"blogId":7,"title":"Java学习教程7","content":"这是批量添加的文档7","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"7","createTime":"2021-10-10 11:52:07.249","updateTime":null}
{"index":{"_index":"blog","_id":8}}
{"blogId":8,"title":"Java学习教程8","content":"这是批量添加的文档8","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"8","createTime":"2021-10-10 11:52:08.249","updateTime":null}
{"index":{"_index":"blog","_id":9}}
{"blogId":9,"title":"Java学习教程9","content":"这是批量添加的文档9","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"9","createTime":"2021-10-10 11:52:09.249","updateTime":null}
{"index":{"_index":"blog","_id":10}}
{"blogId":10,"title":"Java学习教程10","content":"这是批量添加的文档10","author":"Pepper","category":"ElasticSearch","status":1,"serialNum":"10","createTime":"2021-10-10 11:52:10.249","updateTime":null}

执行之后的结果

实例:手写DSL

去重的字段不能是text类型。所以,author的mapping要有keyword,且通过author.keyword去重。

如果去重字段是其他可以直接去重的类型,比如:数字类型、keyword、日期等,则直接用字段名就可以。即:如果本处author是keyword,则author.keyword处写成author就行。

collapse获取去重结果

POST /blog/_search

{
    "query": {
        "match": {
            "title":{
                "query": "java"
            }
        }
    },
    "collapse":{
        "field": "author.keyword"
    }
}

结果:

我把结果全部贴出来

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": null,
        "hits": [{
                "_index": "blog",
                "_type": "_doc",
                "_id": "9",
                "_score": 1.0596458,
                "_source": {
                    "blogId": 9,
                    "title": "Java学习教程9",
                    "content": "这是批量添加的文档9",
                    "author": "Pepper",
                    "category": "ElasticSearch",
                    "status": 1,
                    "serialNum": "9",
                    "createTime": "2021-10-10 11:52:09.249",
                    "updateTime": null
                },
                "fields": {
                    "author.keyword": [
                        "Pepper"
                    ]
                }
            }, {
                "_index": "blog",
                "_type": "_doc",
                "_id": "6",
                "_score": 0.7361701,
                "_source": {
                    "blogId": 6,
                    "title": "Java学习教程6",
                    "content": "这是批量添加的文档6",
                    "author": "Tony",
                    "category": "ElasticSearch",
                    "status": 1,
                    "serialNum": "6",
                    "createTime": "2021-10-10 11:52:06.249",
                    "updateTime": null
                },
                "fields": {
                    "author.keyword": [
                        "Tony"
                    ]
                }
            }
        ]
    }
}

聚合获取去重结果

POST /blog/_search

{
    "query": {
        "match": {
            "title": {
                "query": "java"
            }
        }
    },
    "size": 0,
    "aggs": {
        "author_aggs": {
            "terms": {
                "field": "author.keyword",
                "size": 10
            },
            "aggs": {
                "author_top": {
                    "top_hits": {
                        "sort": [{
                                "author.keyword": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "size": 1
                    }
                }
            }
        }
    }
}

结果

我把结果全部贴出来

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "author_aggs": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [{
                    "key": "Pepper",
                    "doc_count": 4,
                    "author_top": {
                        "hits": {
                            "total": {
                                "value": 4,
                                "relation": "eq"
                            },
                            "max_score": null,
                            "hits": [{
                                    "_index": "blog",
                                    "_type": "_doc",
                                    "_id": "9",
                                    "_score": null,
                                    "_source": {
                                        "blogId": 9,
                                        "title": "Java学习教程9",
                                        "content": "这是批量添加的文档9",
                                        "author": "Pepper",
                                        "category": "ElasticSearch",
                                        "status": 1,
                                        "serialNum": "9",
                                        "createTime": "2021-10-10 11:52:09.249",
                                        "updateTime": null
                                    },
                                    "sort": [
                                        "Pepper"
                                    ]
                                }
                            ]
                        }
                    }
                }, {
                    "key": "Tony",
                    "doc_count": 1,
                    "author_top": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": null,
                            "hits": [{
                                    "_index": "blog",
                                    "_type": "_doc",
                                    "_id": "6",
                                    "_score": null,
                                    "_source": {
                                        "blogId": 6,
                                        "title": "Java学习教程6",
                                        "content": "这是批量添加的文档6",
                                        "author": "Tony",
                                        "category": "ElasticSearch",
                                        "status": 1,
                                        "serialNum": "6",
                                        "createTime": "2021-10-10 11:52:06.249",
                                        "updateTime": null
                                    },
                                    "sort": [
                                        "Tony"
                                    ]
                                }
                            ]
                        }
                    }
                }
            ]
        }
    }
}

聚合获取去重数量

POST /blog/_search

{
    "query": {
        "match": {
            "title": {
                "query": "java"
            }
        }
    },
    "size": 0,
    "aggs": {
        "author_aggs": {
            "cardinality": {
                "field": "author.keyword"
            }
        }
    }
}

结果

collapse获取去重结果

见:Spring Data Elasticsearch-ElasticsearchRestTemplate-使用实例 – 自学精灵

0

评论0

请先

显示验证码
没有账号?注册  忘记密码?

社交账号快速登录