In our last article, we examined the critical role Distinct Counting (also known as Count Distinct) plays when it comes to working with massive datasets. While Distinct Counting is invaluable for navigating today’s data-driven business landscape, it’s not without its weaknesses.

Distinct Counting with Big Data is a resource-intensive computational process and performing it in a timely way is a tall order. Effectively utilizing Big Data to generate actionable insights is what sets market leaders ahead of their industry peers, but speed is a factor as well. It doesn’t matter how much data you have, or how much you’re able to use, it’s meaningless if it takes too long to manage and analyze it.

Fortunately, two approaches exist that make Distinct Counting quick when it comes to working with Big Data: HyperLogLog (HLL) and Bitmap.

HyperLogLog (HLL) vs. Bitmap

As we discussed previously, HyperLogLog (HLL) and Bitmap are two popular algorithms for optimizing the calculations required when performing Distinct Counting on Big Data. We won’t rehash the argument for or against them here, but here’s a recap of the main points we’ve already covered:

  1. HLL has a low complexity level in terms of storage space. Depending on the level of accuracy, one HLL takes up between 1KB to 64KB of space. On the other hand, since Bitmap uses one bit to represent each ID, as the dataset size increases, the space required will also increase. Bitmap requires a lot more storage space than HLL by an order of one or two.
  2. HLL supports multiple types of data entries as inputs making it very easy to use; Bitmap only supports int/long types of values as input, therefore, if the original value is a string, then the user needs to map it into int/long before we can use Bitmap.
  3. HLL supports multiple data types because it uses Hash Functions, but because it uses Hash Functions and probability estimations, there’s no guarantee of accuracy. Its theoretical margin of error is still over 1%.
  4. Bitmap faithfully uses a bit, (1) or (0), for every ID. So, as long as it can guarantee the fact that every unique user has a unique ID value, Bitmap’s results are usually very accurate.

Generally speaking, HLL is very good but it lacks accuracy; Bitmap may take up a lot more space than HLL, but it does guarantee accuracy.

Speed vs. Accuracy with Distinct Counting

So, which is the correct approach? The truth is that it really depends. For most organizations, HLL has proven to be the approach of choice. As a result, its use has become ubiquitous in the field of Big Data.

This, of course, is a major reason why Apache Kylin has always supported HLL calculations. The rapid increase in dataset size, along with the storage constraints as well as the speed and processing requirements of modern businesses, left few other options. The cost of lower accuracy seemed to be worth it.

Apache Kylin Overview

If someone had asked us about the existence of possible errors in our calculation? We believed that within the thousands of millions of results, users would not pay as much attention to that 1% error.

However, we came to discover that many users often did not share this belief. In some situations, having even a slight error in the result was unacceptable.

For example, in traffic redirection or advertisement placement, costs are calculated through the summation of the number of channels or clicks from viewers. Having a slight mistake in the values is intolerable for both sides of this business. On one hand, the buyer is worried about paying too much, while on the other, the provider is worried about receiving too little.

As we discussed above, HLL is not 100% accurate. 99% of the time its margin of error is within 1%, with the remaining 1% of the time resulting in even larger margins of error. If the error does happen to be extremely large, it stands to reason that it would lead to extreme problems.

Additionally, if we must do multiplication or division with our UV (Unique Visitors) results, then this error will increase in magnitude. For instance, (users increase rate) = (today’s user rate) / (yesterday’s user rate); if the numerator is slightly larger and the denominator is slightly smaller, then this could ultimately result in a huge mistake and you won’t be able to determine how large this error is. If we have 100,000,000 counts of users, then a 1% error means an error of 1,000,000 users.

For websites and apps with a constant flowrate of visitors, this error is more than enough to completely overshadow their actual operational effects, meaning the data is not able to provide useful feedback or insight for the business.

So, if you do not want to receive a phone call from your boss or business partners in the middle of the night to check on the data, then it is in your best interest to figure out a solution for the accuracy of this calculation (so you can secure a good night’s sleep).

How to Ensure Speed and Accuracy for Analytics on Big Data

When it came to developing Apache Kylin, we very quickly realized that simply having a good estimate was not enough. Kylin needed to support accurate Distinct Counting. If it couldn’t, users would surely lose out on opportunities and significant insights during major calculations.

To address this challenge, the Kylin community came together to develop a new approach. In our next article, we’ll delve into the development process and explain how Kylin was able to find a way to deliver both speed and accuracy when using Distinct Count with Big Data.

Why Accuracy Is So Important for Distinct Counting相关推荐

  1. SQL优化(二) 快速计算Distinct Count

    2019独角兽企业重金招聘Python工程师标准>>> 原创文章,首发自个人站点 ,转载请务必注明出处 http://www.jasongj.com/2015/03/15/count ...

  2. 快速计算Distinct Count

    标签 PostgreSQL , 估值计算 , PipelineDB , hll , bloom , T-D , TOP-K , SSF 背景 本文转发自技术世界,原文链接 http://www.jas ...

  3. Arcface v1 论文翻译与解读

    神罗Noctis 2019-10-13 16:14:39  543  收藏 4 展开 论文地址:http://arxiv.org/pdf/1801.07698v1.pdf 最新版本v3的论文翻译:Ar ...

  4. 机器学习算法如何应用于控制_将机器学习算法应用于NBA MVP数据

    机器学习算法如何应用于控制 A step-by-step tutorial in R R中的分步教程 1引言 (1 Introduction) This blog makes up the Machi ...

  5. 《Imaging Systems For Medical Diagnostics》——12. X-ray components and systems (3) X射线组件和系统(3)

    <Imaging System 医学影像>@EnzoReventon <Imaging Systems For Medical Diagnostics>--12.X-ray c ...

  6. Cross-validation

    2019独角兽企业重金招聘Python工程师标准>>> 1: Introduction To Validation So far, we've been evaluating acc ...

  7. centos ipvsadm 规则保存_从VAR精神到判罚规则:终场哨声后的VAR点球,到底是否合理?...

    虎扑资深足球作者|不懂球专栏FoolballComment曼联与布莱顿的比赛,在裁判吹响了终场哨后,又通过回看VAR补吹了曼联一个点球,符合规则吗?先给出结论:尽管比赛中有诸多值得争议的点,但VAR在 ...

  8. 为什么选择做班级管理系统_为什么即使在平衡的班级下准确性也很麻烦

    为什么选择做班级管理系统 Accuracy is a go-to metric because it's highly interpretable and low-cost to evaluate. ...

  9. 为什么即使在班级均衡的情况下,准确度仍然令人困扰

    Accuracy is a go-to metric because it's highly interpretable and low-cost to evaluate. For this reas ...

最新文章

  1. C++_泛型编程与标准库(二)
  2. 使用 Windows 命令行删除结果
  3. iOS开发-单例模式
  4. c++ 共享内存_关于Linux共享内存的实验 [二] - 原因
  5. linux进阶命令2
  6. grunt入门 出处:http://artwl.cnblogs.com
  7. (转)Struts2的拦截器
  8. Javaimport以及Java类的搜索路径
  9. 干干净净用java_十四步 干干净净卸载Oracle
  10. android获取上下文对象,如何在Android服务类中获取上下文
  11. linux ping 命令_Linux ping命令示例
  12. mapjoin的使用方法以及注意事项
  13. 立创EDA学习笔记(4)——原理图绘制
  14. 系统架构设计师考试需要看哪些书?
  15. 2023成都精密光学展览会
  16. 记录Request + BeautifulSoup爬取中国现代诗歌大全网站上的4000+现代诗的过程
  17. 【测评】国外AR平台ENTITI测评-网页编辑器(1)
  18. 如何修改鼠标右键新建对象的顺序
  19. 软考中级 真题 2016年下半年 系统集成项目管理工程师 基础知识 上午试卷
  20. leon3详细开发教程

热门文章

  1. thinkphp mysql json数据类型_ThinkPHP:JSON字段类型的使用(ORM)
  2. java 共享锁 独占锁_Java并发编程锁之独占公平锁与非公平锁比较
  3. 查看mysql整个库的数据大小_查看mysql数据库容量大小
  4. 2021江苏南通名师高考成绩查询,南通2021高考成绩排名榜单,南通各高中高考成绩喜报...
  5. 眼图在通信系统中有什么意义_悟空CRM:施行CRM系统对汽车行业有什么意义
  6. 皮一皮:傻傻分不清,这究竟是教室还是...
  7. 最完整的Explain总结,SQL优化不再困难
  8. 前瞻:在 Java 16 中会带来哪些新特性?
  9. 关于 JShell,开发人员需要知道的10件事情
  10. bigdecimal 和负数比较_Java中BigDecimal精度和相等比较的坑