%0 Journal Article %T 分布式数据库系统中的并行分组聚合实现 %A 徐石磊 %A 魏星 %A 江红 %A 钱卫宁 %A 周傲英 %J 华东师范大学学报(自然科学版) %D 2018 %R 10.3969/j.issn.1000-5641.2018.05.005 %X 摘要 伴随着新型互联网应用中对数据统计、分析需求的增大,分组、聚合已经成为数据分析应用中出现频率最多的请求之一.本文就类OLAP(on-line transactionprocessing)应用中常见的Aggregation、GroupBy原理进行了分析.针对一般事务型数据库采用排序分组的缺点,提出了两种Hash分组聚合的具体实现方案,并提出一种利用统计信息动态决策Hash桶数、Hash分组聚合方案的策略.根据分布式数据库多副本的特点,本文又提出了一种Hash分组聚合节点级的并行方案.最后,在开源数据库OceanBase进行了具体的实现.通过实验证明,本文提出的利用统计信息动态决策Hash分组聚合方案相比排序分组具有极大的效率提升.</br>Abstract:With the increase in demand for data statistics and analysis in new Internet applications, data grouping and aggregation have become amongst the most common operations in data analysis applications. This paper analyzes the operating principles of the Aggregation and GroupBy functions commonly used in analytical applications. Based on the disadvantages of sort grouping for general-transactional databases, two kinds of Hash GroupBy implementations are proposed; in addition,a strategy for dynamically determining the number of Hash buckets and Hash GroupBy schemes, based on statistical information, is proposed. Based on the characteristics of distributed clusters, implementation of the Hash GroupBy operator push down is proposed. Experiments have shown that the use of statistical information to dynamically determine the Hash group option improves efficiency. %K OceanBase %K GroupBy %K Hash %K 数据分布< %K /br> %K Key words: OceanBase GroupBy Hash Data distribution %U http://xblk.ecnu.edu.cn/CN/abstract/abstract25548.shtml