数据库中去重的需求比较常见,比较常见的如单列去重、多列去重、行去重等。pg中针对这些不同的去重要求,我们也可以使用不同的去重方法。https://www.cndba.cn/foucus/article/3916

1、单列去重单列去重应该是最常见的了,就是将某一列中重复的记录去除掉,我们可以根据要求保留最新或者最旧的记录。

—创建测试数据

bill=# create table test1(id int primary key, c1 int, c2 timestamp);

CREATE TABLE

bill=# insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp();

INSERT 0 1000000

bill=# create index idx_test1 on test1(c1,id);

CREATE INDEX

—方法1:

聚合,not in

https://www.cndba.cn/foucus/article/3916

bill=# explain delete from test1 where id not in (select max(id) from test1 group by c1);

QUERY PLAN

------------------------------------------------------------------------------------------------------------------

Delete on test1 (cost=30609.23..48515.23 rows=500000 width=6)

-> Seq Scan on test1 (cost=30609.23..48515.23 rows=500000 width=6)

Filter: (NOT (hashed SubPlan 1))

SubPlan 1

-> GroupAggregate (cost=0.42..30606.73 rows=1001 width=8)

Group Key: test1_1.c1

-> Index Only Scan using idx_test1 on test1 test1_1 (cost=0.42..25596.72 rows=1000000 width=8)

(7 rows)

—方法2:

使用窗口查询,in

bill=# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1;

QUERY PLAN

--------------------------------------------------------------------------------------------------

Subquery Scan on t (cost=0.42..55596.72 rows=995000 width=4)

Filter: (t.rn <> 1)

-> WindowAgg (cost=0.42..43096.72 rows=1000000 width=16)

-> Index Only Scan using idx_test1 on test1 (cost=0.42..25596.72 rows=1000000 width=8)

(4 rows)

—方法3:

使用游标的方式去遍历,每一条记录比较一次。

bill=# do language plpgsql $$

bill$# declare

bill$# v_rec record;

bill$# v_c1 int;

bill$# cur1 cursor for select c1,id from test1 order by c1,id for update;

bill$# begin

bill$# for v_rec in cur1 loop

bill$# if v_rec.c1 = v_c1 then

bill$# delete from test1 where current of cur1;

bill$# end if;

bill$# v_c1 := v_rec.c1;

bill$# end loop;

bill$# end;

bill$# $$;

DO

上面三种方式,方法二效率最高,其次是方法三。

https://www.cndba.cn/foucus/article/3916https://www.cndba.cn/foucus/article/3916

2、多列去重

和单列类似,只是变成了去除多个列的重复记录。

https://www.cndba.cn/foucus/article/3916https://www.cndba.cn/foucus/article/3916

—创建测试数据

bill=# create table test1(id int primary key, c1 int, c2 int, c3 timestamp);

CREATE TABLE

bill=# insert into test1 select generate_series(1,1000000), random()*1000, random()*1000, clock_timestamp();

INSERT 0 1000000

bill=# create index idx_test1 on test1(c1,c2,id);

CREATE INDEX

—方法1:

bill=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id not in (select max(id) from test1 group by c1,c2);

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Delete on public.test1 (cost=37036.38..55906.38 rows=500000 width=6) (actual time=1924.854..1924.854 rows=0 loops=1)

Buffers: shared hit=1373911 read=3834

-> Seq Scan on public.test1 (cost=37036.38..55906.38 rows=500000 width=6) (actual time=1255.586..1672.129 rows=367700 loops=1)

Output: test1.ctid

Filter: (NOT (hashed SubPlan 1))

Rows Removed by Filter: 632300

Buffers: shared hit=1006211 read=3834

SubPlan 1

-> GroupAggregate (cost=0.42..36786.38 rows=100000 width=12) (actual time=0.061..1001.212 rows=632300 loops=1)

Output: max(test1_1.id), test1_1.c1, test1_1.c2

Group Key: test1_1.c1, test1_1.c2

Buffers: shared hit=999841 read=3834

-> Index Only Scan using idx_test1 on public.test1 test1_1 (cost=0.42..28286.38 rows=1000000 width=12) (actual time=0.052..708.625 rows=1000000 loops=1)

Output: test1_1.c1, test1_1.c2, test1_1.id

Heap Fetches: 1000000

Buffers: shared hit=999841 read=3834

Planning Time: 0.345 ms

Execution Time: 1931.117 ms

(18 rows)

—方法2:

bill=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id in (select id from (select row_number() over(partition by c1,c2 order by id) as rn, id from test1) t where t.rn<>1);

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Delete on public.test1 (cost=47204.90..79033.85 rows=629138 width=34) (actual time=625.967..625.968 rows=0 loops=1)

Buffers: shared hit=3836

-> Hash Semi Join (cost=47204.90..79033.85 rows=629138 width=34) (actual time=625.966..625.967 rows=0 loops=1)

Output: test1.ctid, t.*

Hash Cond: (test1.id = t.id)

Buffers: shared hit=3836

-> Seq Scan on public.test1 (cost=0.00..12693.00 rows=632300 width=10) (actual time=0.007..0.007 rows=1 loops=1)

Output: test1.ctid, test1.id

Buffers: shared hit=1

-> Hash (cost=35039.68..35039.68 rows=629138 width=32) (actual time=625.801..625.801 rows=0 loops=1)

Output: t.*, t.id

Buckets: 131072 Batches: 8 Memory Usage: 1024kB

Buffers: shared hit=3835

-> Subquery Scan on t (cost=0.42..35039.68 rows=629138 width=32) (actual time=625.800..625.800 rows=0 loops=1)

Output: t.*, t.id

Filter: (t.rn <> 1)

Rows Removed by Filter: 632300

Buffers: shared hit=3835

-> WindowAgg (cost=0.42..27135.92 rows=632300 width=20) (actual time=0.041..574.119 rows=632300 loops=1)

Output: row_number() OVER (?), test1_1.id, test1_1.c1, test1_1.c2

Buffers: shared hit=3835

-> Index Only Scan using idx_test1 on public.test1 test1_1 (cost=0.42..14489.92 rows=632300 width=12) (actual time=0.024..89.633 rows=632300 loops=1)

Output: test1_1.c1, test1_1.c2, test1_1.id

Heap Fetches: 0

Buffers: shared hit=3835

Planning Time: 0.505 ms

Execution Time: 626.029 ms

(27 rows)

—方法3:

bill=# do language plpgsql $$

bill$# declare

bill$# v_rec record;

bill$# v_c1 int;

bill$# v_c2 int;

bill$# cur1 cursor for select c1,c2 from test1 order by c1,c2,id for update;

bill$# begin

bill$# for v_rec in cur1 loop

bill$# if v_rec.c1 = v_c1 and v_rec.c2=v_c2 then

bill$# delete from test1 where current of cur1;

bill$# end if;

bill$# v_c1 := v_rec.c1;

bill$# v_c2 := v_rec.c2;

bill$# end loop;

bill$# end;

bill$# $$;

DO

3、行去重

行去重一般可以使用ctid。https://www.cndba.cn/foucus/article/3916

—创建测试数据:

https://www.cndba.cn/foucus/article/3916

bill=# create table test1(c1 int, c2 int);

CREATE TABLE

bill=# insert into test1 select random()*1000, random()*1000 from generate_series(1,1000000);

INSERT 0 1000000

—方法1:

bill=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid not in (select max(ctid) from test1 group by c1,c2);

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------

Delete on public.test1 (cost=135831.29..152756.29 rows=500000 width=6) (actual time=2290.808..2290.808 rows=0 loops=1)

Buffers: shared hit=376170, temp read=2944 written=2954

-> Seq Scan on public.test1 (cost=135831.29..152756.29 rows=500000 width=6) (actual time=1643.262..2040.646 rows=367320 loops=1)

Output: test1.ctid

Filter: (NOT (hashed SubPlan 1))

Rows Removed by Filter: 632680

Buffers: shared hit=8850, temp read=2944 written=2954

SubPlan 1

-> GroupAggregate (cost=124581.29..135581.29 rows=100000 width=14) (actual time=732.049..1390.277 rows=632680 loops=1)

Output: max(test1_1.ctid), test1_1.c1, test1_1.c2

Group Key: test1_1.c1, test1_1.c2

Buffers: shared hit=4425, temp read=2944 written=2954

-> Sort (cost=124581.29..127081.29 rows=1000000 width=14) (actual time=732.035..1015.066 rows=1000000 loops=1)

Output: test1_1.c1, test1_1.c2, test1_1.ctid

Sort Key: test1_1.c1, test1_1.c2

Sort Method: external merge Disk: 23552kB

Buffers: shared hit=4425, temp read=2944 written=2954

-> Seq Scan on public.test1 test1_1 (cost=0.00..14425.00 rows=1000000 width=14) (actual time=0.010..138.017 rows=1000000 loops=1)

Output: test1_1.c1, test1_1.c2, test1_1.ctid

Buffers: shared hit=4425

Planning Time: 0.176 ms

Execution Time: 2304.495 ms

(22 rows)

—方法2:

bill=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where ctid = any(array( select ctid from (select row_number() over(partition by c1,c2 order by ctid) as rn, ctid from test1) t where t.rn<>1));

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------

Delete on public.test1 (cost=100501.36..100514.46 rows=10 width=6) (actual time=1092.431..1092.431 rows=0 loops=1)

Buffers: shared hit=4430, temp read=2013 written=2019

InitPlan 1 (returns $0)

-> Subquery Scan on t (cost=78357.55..100501.35 rows=629517 width=6) (actual time=1092.420..1092.420 rows=0 loops=1)

Output: t.ctid

Filter: (t.rn <> 1)

Rows Removed by Filter: 632680

Buffers: shared hit=4430, temp read=2013 written=2019

-> WindowAgg (cost=78357.55..92592.85 rows=632680 width=22) (actual time=459.611..1042.708 rows=632680 loops=1)

Output: row_number() OVER (?), test1_1.ctid, test1_1.c1, test1_1.c2

Buffers: shared hit=4430, temp read=2013 written=2019

-> Sort (cost=78357.55..79939.25 rows=632680 width=14) (actual time=459.598..616.859 rows=632680 loops=1)

Output: test1_1.ctid, test1_1.c1, test1_1.c2

Sort Key: test1_1.c1, test1_1.c2, test1_1.ctid

Sort Method: external merge Disk: 16104kB

Buffers: shared hit=4430, temp read=2013 written=2019

-> Seq Scan on public.test1 test1_1 (cost=0.00..10751.80 rows=632680 width=14) (actual time=0.006..83.917 rows=632680 loops=1)

Output: test1_1.ctid, test1_1.c1, test1_1.c2

Buffers: shared hit=4425

-> Tid Scan on public.test1 (cost=0.01..13.11 rows=10 width=6) (actual time=1092.429..1092.429 rows=0 loops=1)

Output: test1.ctid

TID Cond: (test1.ctid = ANY ($0))

Buffers: shared hit=4430, temp read=2013 written=2019

Planning Time: 0.204 ms

Execution Time: 1096.153 ms

(25 rows)

—方法3:https://www.cndba.cn/foucus/article/3916https://www.cndba.cn/foucus/article/3916

bill=# do language plpgsql $$

bill$# declare

bill$# v_rec record;

bill$# v_c1 int;

bill$# v_c2 int;

bill$# cur1 cursor for select c1,c2 from test1 order by c1,c2,ctid for update;

bill$# begin

bill$# for v_rec in cur1 loop

bill$# if v_rec.c1 = v_c1 and v_rec.c2=v_c2 then

bill$# delete from test1 where current of cur1;

bill$# end if;

bill$# v_c1 := v_rec.c1;

bill$# v_c2 := v_rec.c2;

bill$# end loop;

bill$# end;

bill$# $$;

DO

Time: 2320.653 ms (00:02.321)

bill=# DO

版权声明:本文为博主原创文章,未经博主允许不得转载。

postgresal去重_PostgreSQL数据去重相关推荐

  1. oracle 多条件去重,Oracle数据去重

    Oracle数据去重保留一条 多个条件分组 delete from M_BASEPRICE_0719_CC m where (m.brand_code,m.supplier_code,m.part_c ...

  2. mysql count if 去重_MYSQL数据去重

    我们用的数据库是mysql,偶尔会因为程序处理上的方便或者sql未优化而增加了一些重复数据,最后需要对这些重复的数据进行删除 对于数据量不大的时候我一般用not in的方式来处理,或者删全表,导出不重 ...

  3. mysql数据去重_mysql 数据去重

    数据库版本mysql5.5.10  操作工具navicate for mysql 插入数据重复了   用mysql语句去重 --  查询数量是否大于1 大于1  表示有重复数据 SELECT  cou ...

  4. list某字段去重再合并统计_java mapreduce实现数据去重

    概念:"数据去重"主要是为了掌握和利用并行化思想来对数据进行有意义的筛选.统计大数据集上的数据种类个数.从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重. 数据去重的最终 ...

  5. 5.1 数据去重 完全去重

    数据清洗 是一项复杂且繁琐的工作,同时也是整个数据分析过程中最为重要的环节.数据清洗的目的在于提高数据质量,将脏数据(脏数据在这里指的是对数据分析没有实际意义.格式非法.不在指定范围内的数据)清洗干净 ...

  6. MapReduce的数据去重功能

    实验材料及说明 现有某电商网站用户对商品的收藏数据,记录了用户收藏的商品id以及收藏日期,文件名为buyer_favorite.buyer_favorite包含:买家id,商品id,收藏日期这三个字段 ...

  7. Python使用matplotlib绘制数据去重前后的柱状图对比图(在同一个图中显示去重操作之后同一数据集的变化情况)

    Python使用matplotlib绘制数据去重前后的柱状图对比图(在同一个图中显示去重操作之后同一数据集的变化情况) #仿真数据 import pandas as pd import numpy a ...

  8. MapReduce基础开发之二数据去重和排序

    因Hadoop集群平台网络限制,只能在eclipse里先写好代码再提交jar到集群平台namenode上执行,不能实时调试,所以没有配置eclipse的hadoop开发环境,只是引入了hadoop的l ...

  9. python数据去重的函数_python去重函数是什么

    数据去重可以使用duplicated()和drop_duplicates()两个方法. DataFrame.duplicated(subset = None,keep ='first')返回boole ...

最新文章

  1. linux lab命令,Linux lab 命令
  2. svn 常见问题记录
  3. windows快捷键命令汇总整理
  4. Access导入文本文件的Schema.ini文件格式
  5. Oracle 表空间常用sql
  6. linux虚拟机网络配制方法及遇到问题的解决方法
  7. 不是一流大学毕业,却通过自学软件测试,进了阿里年薪初始22K
  8. Tomcat压缩传输设置
  9. 第二周博客作业西北师范大学|李晓婷
  10. Huge pages (标准大页)和 Transparent Huge pages(透明大页)
  11. LINUX未来的发展前景
  12. 做人温和一点,做事狠一点。
  13. React学习资源汇总
  14. 浅谈 MVC、MVP 和 MVVM 架构模式
  15. 新手学做网站的建议教程
  16. 2、生成二维码API接口,免费好用
  17. 从布尔函数的真值表求其代数正规型(ANF)
  18. 世界国家中英文(json格式)
  19. 2017第二十三届上海国际加工包装展览会会刊(参展商名录)
  20. 腹直肌整体(05):仰卧屈膝两头起

热门文章

  1. Windows MObile中ListView控件的用法详解
  2. 数据结构-排序(插入排序)
  3. ESP8266和MQTT
  4. c#获取本地ip地址网关子网掩码_教你如何修改路由器LAN口IP地址的方法
  5. 《研磨设计模式》chap9 原型模式Prototype
  6. 知识图谱 (1)基本概念
  7. buu Windows系统密码
  8. 06-密码学基础-混合密码系统
  9. [register]-TCR(Translation Control Register)寄存器详解
  10. [crypto]-30-The Armv8 Cryptographic Extension在linux中的应用