
3.1 Concurrency并发

ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications

Cicada: Dependably Fast Multi-Core In-Memory Transactions

BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications

Transaction Repair for Multi-Version Concurrency Control

Concerto: A High Concurrency Key-Value Store with Integrity

Fast Failure Recovery for Main-Memory DBMSs on Multicores

Bringing Modular Concurrency Control to the Next Level

3.2 Storage and Distribution 存储与分布式

Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics

OctopusFS: A Distributed File System with Tiered Storage Management
octopus usfs:一个具有分层存储管理的分布式文件系统

Monkey: Optimal Navigable Key-Value Store
Monkey: 最佳的适合航行的、可驾驶的键值对存储
Navigable:adj. 可航行的;可驾驶的;适于航行的

Wide Table Layout Optimization based on Column Ordering and Duplication

Query Centric Partitioning and Allocation for Partially Replicated Database Systems

Spanner: Becoming a SQL System

3.3 Streams 数据流

Enabling Signal Processing over Data Streams

Complete Event Trend Detection in High-Rate Event Streams

LittleTable: A Time-Series Database and Its Uses

3.4 Versions and Incremental Maintenance 版本和增量维护

Incremental View Maintenance over Array Data

Incremental Graph Computations: Doable and Undoable 增量图计算:可操作和不可操作

DEX: Query Execution in a Delta-based Storage System

3.5 Parallel and Distributed Query Processing

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study 全基因组序列数据的大规模并行处理:一项深入的性能研究

Distributed Provenance Compression

Network provenance, which records the execution history of network events as meta-data, is becoming increasingly important for network accountability and failure diagnosis. For example, network provenance may be used to trace the path that a message traversed in a network, or to reveal how a particular routing entry was derived and the parties involved in its derivation. A challenge when storing the provenance of a live network is that the large number of arriving messages may incur substantial storage overhead. In this paper, we explore techniques to dynamically compress distributed provenance stored at scale. Logically, compression is achieved by grouping equivalent provenance trees and maintaining only one concrete copy for each equivalence class. To efficiently identify the equivalent provenance, we (1) introduce distributed event-based linear programs (DELPs) to specify distributed network applications, and (2) statically analyze DELPs to allow for quick detection of provenance equivalence at runtime. Our experimental results demonstrate that our approach leads to significant storage reduction and query latency improvement over alternative approaches.


ROBUS: Fair Cache Allocation for Data-parallel Workloads ROBUS:数据并行工作负载的公平缓存分配

Heterogeneity-aware Distributed Parameter Servers 了解异质性的分布参数服务器

Distributed Algorithms on Exact Personalized PageRank 分布式算法的精确个性化PageRank

Parallelizing Sequential Graph Computations (Best paper award)

3.6 Tree & Graph Processing

Landmark Indexing for Evaluation of Label-Constrained Reachability Queries 标记索引用于评估标签约束的可达性查询

Efficient Ad-Hoc Graph Inference and Matching in Biological Databases

DAG Reduction: Fast Answering Reachability Queries

Flexible and Feasible Support Measures for Mining Frequent Patterns in Large Labeled Graphs

Exploiting Common Patterns for Tree-Structured Data

Extracting and Analyzing Hidden Graphs from Relational Databases

TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model

ZipG: A Memory-efficient Graph Store for Interactive Queries

All-in-One: Graph Processing in RDBMSs Revisited

Computing A Near-Maximum Independent Set in Linear Time by Reducing-Peeling

3.7 New Hardware

Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
FPGA-based Data Partitioning

Template Skycube Algorithms for Heterogeneous Parallelism on Multicore and GPU Architectures

3.8 Interactive Data Exploration and AQP 交互式数据探索和AQP

Controlling False Discoveries During Interactive Data Exploration

MacroBase: Prioritizing Attention in Fast Data

Data Canopy: Accelerating Exploratory Statistical Analysis

Two-Level Sampling for Join Size Estimation

A General-Purpose Counting Filter: Making Every Bit Count

BePI: Fast and Memory-Efficient Method for Billion-Scale Random Walk with Restart

3.9 Beliefs, Conflicts, Knowledge

Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning

Database Learning: Toward a Database that Becomes Smarter Every Time

Staging User Feedback toward Rapid Conflict Resolution in Data Fusion

3.10 Influence in Social Networks

Discovering Your Selling Points: Personalized Social Influential Tags Exploration

Coarsening Massive Influence Networks for Scalable Diffusion Analysis

Debunking the Myths of Influence Maximization: An In-Depth Benchmarking Study

3.11Mappings, Transformations, Pricing

Interactive Mapping Specification with Exemplar Tuples

Foofah: Transforming Data By Example

QIRANA: A Framework for Scalable Query Pricing

3.12 Optimization and Performance

Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?

Optimization of Disjunctive Predicates for Main Memory Column Stores

A Top-Down Approach to Achieving Performance Predictability in Database Systems

An Experimental Study of Bitmap Compression vs. Inverted List Compression

Automatic Database Management System Tuning Through Large-scale Machine Learning

Solving the Join Ordering Problem via Mixed Integer Linear Programming

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

3.13 User Preferences

Determining the Impact Regions of Competing Options in Preference Space

Efficient Computation of Regret-ratio Minimizing Set: A Compact Maxima Representative

FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems

Feedback-Aware Social Event-Participant Arrangement

3.14 Machine Learning

Schema Independent Relational Learning

Scalable Kernel Density Classification via Threshold-Based Pruning

The BUDS Language for Distributed Bayesian Machine Learning

A Cost-based Optimizer for Gradient Descent Optimization

3.15 Encryption 加密

Fast Searchable Encryption With Tunable Locality

Cryptanalysis of Comparable Encryption in SIGMOD’16

BLOCKBENCH: A Framework for Analyzing Private Blockchains

3.16 Cleaning, Versioning, Fusion 清洗、版本控制、融合

Living in Parallel Realities: Co-Existing Schema Versions with a Bidirectional Database Evolution Language

Synthesizing Mapping Relationships Using Table Corpus

Waldo: An Adaptive Human Interface for Crowd Entity Resolution

Online Deduplication for Databases

QFix: Diagnosing Errors through Query Histories

UGuide: User-Guided Discovery of FD-Detectable Errors

SLiMFast: Guaranteed Results for Data Fusion and Source Reliability

3.17 Spatial and Multidimensional Data 空间和多维数据

Utility-Aware Ridesharing on Road Networks

Ridesharing enables drivers to share any empty seats in their vehicles with riders to improve the efficiency of transportation for the benefit of both drivers and riders. Different from existing studies in ridesharing that focus on minimizing the travel costs of vehicles, we consider that the satisfaction of riders (the utility values) is more important nowadays. Thus, we formulate the problem of utility-aware ridesharing on road networks (URR) with the goal of providing the optimal rider schedules for vehicles to maximize the overall utility, subject to spatial-temporal and capacity constraints. To assign a new rider to a given vehicle, we propose an efficient algorithm with a minimum increase in travel cost without reordering the existing schedule of the vehicle. We prove that the URR problem is NP-hard by reducing it from the 0-1 Knapsack problem and it is unlikely to be approximated within any constant factor in polynomial time through a reduction from the DENS kSUBGRAPH problem. Therefore, we propose three efficient approximate algorithms, including a bilateral arrangement algorithm, an efficient greedy algorithm and a grouping-based scheduling algorithm, to assign riders to suitable vehicles with a high overall utility. Through extensive experiments, we demonstrate the efficiency and effectiveness of our URR approaches on both real and synthetic data sets.
【单词】 Ridesharing:汽车共享、驾驶共享;
formulate:vt. 规划;用公式表示;明确地表达;
utility:n. 实用;效用;公共设施;功用;adj. 实用的;通用的;有多种用途的;
constraints:n. [数] 约束;限制;约束条件(constraint的复数形式);
Subsequently:adv. 随后,其后;后来;
in the sequel:后来,后续,结果;

Distance Oracle on Terrain Surface

Efficient Computation of Top-k Frequent Terms over Spatio-temporal Ranges

The wide availability of tracking devices has drastically increased the role of geolocation in social networks, resulting in new commercial applications; for example, marketers can identify current trending topics within a region of interest and focus their products accordingly. In this paper we study a basic analytics query on geotagged data, namely: given a spatiotemporal region, find the most frequent terms among the social posts in that region. While there has been prior work on keyword search on spatial data (find the objects nearest to the query point that contain the query keywords), and on group keyword search on spatial data (retrieving groups of objects), our problem is different in that it returns keywords and aggregated frequencies as output, instead of having the keyword as input. Moreover, we differ from works addressing the streamed version of this query in that we operate on large, disk resident data and we provide exact answers. We propose an index structure and algorithms to efficiently answer such top-k spatiotemporal range queries, which we refer as Top-k Frequent Spatiotemporal Terms (kFST) queries. Our index structure employs an R-tree augmented by top-k sorted term lists (STLs), where a key challenge is to balance the size of the index to achieve faster execution and smaller space requirements. We theoretically study and experimentally validate the ideal length of the stored term lists, and perform detailed experiments to evaluate the performance of the proposed methods compared to baselines on real datasets.

跟踪设备的广泛可用性极大地增加了地理定位在社交网络中的作用,从而产生了新的商业应用;例如,市场营销人员可以在感兴趣的区域内识别当前的趋势话题,并相应地关注他们的产品。本文研究了地理位置数据的基本分析查询,即:给定一个时空区域,在该区域的social posts中找到最频繁的terms。虽然之前一直工作在关键词搜索空间数据(找到距离查询点最近的,且包含查询关键词的对象),和组合关键字搜索空间数据(检索组对象),我们的问题是不同的,它返回关键字和聚合频率作为输出,而不是关键字作为输入。此外,我们与处理这个查询的流式版本的工作不同,因为我们的运算是大的、磁盘驻留数据并提供准确的答案。我们提出了一种索引结构和算法可以有效地回答这样的top-k时空范围查询,我们称之为top-k频繁的时空terms(kFST)查询。我们的索引结构采用了由top-k排序术语列表(STLs)增强的r树,其中的关键挑战是平衡索引的大小,以实现更快的执行速度和更小的空间需求。我们从理论上的研究和真实实验验证了存储术语列表的理想长度,并进行了详细的实验,在实际数据集上评估提出方法的性能与baseline进行比较。

novel data:新颖的数据;
spatiotemporal:adj. 时空的;存在于时间与空间上的; user-specified:用户指定的; intersection:n. 交叉;十字路口;交集;交叉点;
leverage:n. 手段,影响力;杠杆作用;杠杆效率;v. 利用;举债经营;
degrade:vt. 贬低;使……丢脸;使……降级;使……降解;vi. 降级,降低;退化;
materialize:vi. 实现,成形;突然出现;vt. 使具体化,使有形;使突然出现;使重物质而轻精神;

Scaling Locally Linear Embedding

Locally Linear Embedding (LLE) is a popular approach to dimensionality reduction as it can effectively represent nonlinear structures of high-dimensional data. For dimensionality reduction, it computes a nearest neighbor graph from a given dataset where edge weights are obtained by applying the Lagrange multiplier method, and it then computes eigenvectors of the LLE kernel where the edge weights are used to obtain the kernel. Although LLE is used in many applications, its computation cost is significantly high. This is because, in obtaining edge weights, its computation cost is cubic in the number of edges to each data point. In addition, the computation cost in obtaining the eigenvectors of the LLE kernel is cubic in the number of data points. Our approach, Ripple, is based on two ideas: (1) it incrementally updates the edge weights by exploiting the Woodbury formula and (2) it efficiently computes eigenvectors of the LLE kernel by exploiting the LU decomposition-based inverse power method. Experiments show that Ripple is significantly faster than the original approach of LLE by guaranteeing the same results of dimensionality reduction.


eigenvectors:n. [数] 特征向量; 本征矢量(eigenvector的复数形式);
cubic:adj. 立方体的,立方的;
decomposition:n. 分解,腐烂;变质;

Dynamic Density Based Clustering

Dynamic clustering—how to efficiently maintain data clusters along with updates in the underlying dataset—is a difficult topic. This is especially true for density-based clustering, where objects are aggregated based on transitivity of proximity, under which deciding the cluster(s) of an object may require the inspection of numerous other objects. The phenomenon is unfortunate, given the popular usage of this clustering approach in many applications demanding data updates. Motivated by the above, we investigate the algorithmic principles for dynamic clustering by DBSCAN, a successful representative of density-based clustering, and ρ-approximate DBSCAN, proposed to bring down the computational hardness of the former on static data. Surprisingly, we prove that the ρ-approximate version suffers from the very same hardness when the dataset is fully dynamic, namely, when both insertions and deletions are allowed. We also show that this issue goes away as soon as tiny further relaxation is applied, yet still ensuring the same quality—known as the “sandwich guarantee”—of ρ-approximate DBSCAN. Our algorithms guarantee near-constant update processing, and outperform existing approaches by a factor over two orders of magnitude.

出于上述,我们调查DBSCAN算法动态聚类原则,一个成功的基于密度的聚类代表、和ρ-approximate DBSCAN,提出降低计算难度前的静态数据。令人惊讶的是,我们证明ρ-approximate版本有同样的难度的数据集是完全动态的,也就是说,当插入和删除都是允许的。我们还表明,这个问题就消失微小的进一步放松,但仍然保证quality-known一样“三明治保证”——ρ-approximate DBSCAN。我们的算法保证了几乎恒定的更新处理,并且在两个数量级上超过了现有方法。

Extracting Top-K Insights from Multi-dimensional Data
从多维数据中提取Top-K insight

OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: “Brand B’s rank (across brands) falls along the year, in terms of the increase in sales”. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.

OLAP工具已经被企业广泛地用于做出更好、更快的决策。尽管如此,它们要求用户指定需要进行group-by的属性,并准确地知道他们需要什么。本文首次尝试从多维数据中自动提取top-k insights。这不仅对非专业用户有用,而且减少了数据分析师的手工工作。特别是,我们提出了insight的概念,它捕捉了从聚合结果中得到的有趣的观察结果,这些观察通过使用多个步骤实现,(例如,按维进行排序,按照维度计算度量的百分比(the percentage of measure))。insight的一个例子是:“品牌B的排名(各品牌之间的排名)逐年下降,以销售额的增长来衡量”。我们的问题是用一个score打分函数来计算top-k insight。它对结果的有效性和计算的效率提出了挑战。我们提出了一个有意义的评分函数来处理(i)。然后,我们为top-k insights提供了一个计算框架,并提供了一套优化技术(例如剪枝、排序、专用数据集、计算共享).我们对真实数据和合成数据的实验研究验证了我们提出的解决方案的有效性和有效性。

QUILTS: Multidimensional Data Partitioning Framework Based on Query-Aware and Skew-Tolerant Space-Filling Curves
QUILTS: 基于查询感知和双向空间填充曲线的多维数据分区框架

Recently, massive data management plays an increasingly important role in data analytics because data access is a major bottleneck. Data skipping is a promising technique to reduce the number of data accesses. Data skipping partitions data into pages and accesses only pages that contain data to be retrieved by a query. Therefore, effective data partitioning is required to minimize the number of page accesses. However, it is an NP-hard problem to obtain optimal data partitioning given query pattern and data distribution.

We propose a framework that involves a multidimensional indexing technique based on a space-filling curve. A space-filling curve is a way to define which portion of data can be stored in the same page. Therefore, the problem can be interpreted as selecting a curve that distributes data to be accessed by a query to minimize the number of page accesses. To solve this problem, we analyzed how different space-filling curves affect the number of page accesses. We found that it is critical for a curve to fit a query pattern and be robust against any data distribution. We propose a cost model for measuring how well a space-filling curve fits a given query pattern and tolerates data skew. Also we propose a method for designing a query-aware and skew-tolerant curve for a given query pattern.

We prototyped our framework using the defined query-aware and skew-tolerant curve. We conducted experiments using a skew data set, and confirmed that our framework can reduce the number of page accesses by an order of magnitude for data warehousing (DWH) and geographic information systems (GIS) applications with real-world data.

【单词】 skew:n. 斜交; 歪斜; adj. 斜交的; 歪斜的;

3.18 Optimization and Main Memory 优化和主存

Optimizing Iceberg Queries with Complex Joins

Iceberg queries, commonly used for decision support, find groups whose aggregate values are above or below a threshold. In practice, iceberg queries are often posed over complex joins that are expensive to evaluate. This paper proposes a framework for combining a number of techniques—a-priori, memoization, and pruning—to optimize iceberg queries with complex joins. A-priori pushes partial GROUP BY and HAVING condition before a join to reduce its input size. Memoization caches and reuses join computation results. Pruning uses cached results to infer that certain tuples cannot contribute to the final query result, and short-circuits join computation. We formally derive conditions for correctly applying these techniques. Our practical rewrite algorithm produces highly efficient SQL that can exploit combinations of optimization opportunities in ways previously not possible. We evaluate our PostgreSQL-based implementation experimentally and show that it outperforms both baseline PostgreSQL and a commercial database system.

冰山查询(通常用于决策支持)查找聚合值高于或低于阈值的组。在实践中,冰山查询经常出现在复杂的连接上,而这些连接的计算成本很高。本文提出了一种结合若干技术-A-priori算法、memoization和剪枝 -来优化复杂连接的冰山查询的框架。A-priori算法在join前执行Group By和Having操作,以有效减少其输入大小。记忆缓存(v.)并重用连接计算结果。剪枝(pruning)使用缓存的结果推断出某些元组无法生成最终的查询结果,并使用短路连接(join)计算结果。我们正式地推导出正确应用这些技术的条件。我们实用的重写算法生成高效的SQL,可以使用以前不可能的方式利用优化方式进行组合。我们对基于PostgreSQL的实现进行了实验性的评估,并显示它优于baseline PostgreSQL和商业数据库系统。

The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates

Modern computing tasks such as real-time analytics require refresh of query results under high update rates. Incremental View Maintenance (IVM) approaches this problem by materializing results in order to avoid recomputation. IVM naturally induces a trade-off between the space needed to maintain the materialized results and the time used to process updates. In this paper, we show that the full materialization of results is a barrier for more general optimization strategies. In particular, we present a new approach for evaluating queries under updates. Instead of the materialization of results, we require a data structure that allows: (1) linear time maintenance under updates, (2) constant-delay enumeration of the output, (3) constant-time lookups in the output, while (4) using only linear space in the size of the database. We call such a structure a Dynamic Constantdelay Linear Representation (DCLR) for the query. We show that Dyn, a dynamic version of the Yannakakis algorithm, yields DCLRs for the class of free-connex acyclic CQs. We show that this is optimal in the sense that no DCLR can exist for CQs that are not free-connex acyclic. Moreover, we identify a sub-class of queries for which Dyn features constant-time update per tuple and show that this class is maximal. Finally, using the TPC-H and TPC-DS benchmarks, we experimentally compare Dyn and a higherorder IVM (HIVM) engine. Our approach is not only more efficient in terms of memory consumption (as expected), but is also consistently faster in processing updates.

现代计算任务,如实时分析,要求在高更新率下刷新查询结果。增量视图维护(IVM)通过实现结果来解决这个问题,以避免重新计算。IVM自然地会在需要的空间之间进行权衡,以维持物化的结果和用于处理更新的时间。在本文中,我们证明了结果的完全物化是更一般的优化策略的一个障碍。特别是,我们提出了一种评估更新中查询的新方法。我们不需要实现结果,我们需要一个数据结构,它允许:(1)在更新下进行线性时间维护,(2)输出的常量延迟枚举,(3)输出中的常量时间查找,(4)仅使用数据库大小的线性空间。我们称这种结构为查询的动态常量延迟线性表示(DCLR)。我们展示了Dyn, Yannakakis算法的一个动态版本,为自由connex无环CQs类生成DCLRs。我们证明这是最优的,因为没有DCLR可以存在于非自由-connex无环的cq中。此外,我们还确定了查询的一个子类,其中Dyn的特性是每个元组进行固定时间更新,并显示这个类是最大值。最后,使用TPC-H和TPC-DS基准,我们在实验上比较了Dyn和一个高阶IVM (HIVM)引擎。我们的方法不仅在内存消耗方面(如预期)更有效,而且在处理更新时也总是更快。

Revisiting Reuse in Main Memory Database Systems

Reusing intermediates in databases to speed-up analytical query processing was studied in prior work. Existing solutions require intermediate results of individual operators to be materialized using materialization operators. However, inserting such materialization operations into a query plan not only incurs additional execution costs but also often eliminates important cache- and register-locality opportunities, resulting in even higher performance penalties. This paper studies a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuseaware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our prototype implementation, demonstrate performance gains of 2× for typical analytical workloads with no additional overhead for materializing intermediates.


materialization:n. 物质化;实体化;具体化;
internal:adj. 内部的;里面的;体内的;(机构)内部的;

Leveraging Re-costing for Online Optimization of Parameterized Queries with Guarantees
prototype:n. 原型;标准,模范

Handling Environments in a Nested Relational Algebra with Combinators and an Implementation in a Verified Query Compiler

From In-Place Updates to In-Place Appends: Revisiting Out-of-Place Updates on Flash

数据库升级、打补丁是我们经常面对的日常工作内容。在正常情况下,两个因素是我们必须要考虑的问题:停机时间窗和回退方案。就 Oracle 而言,即便是最简单的更新操作,都难以做到 “零停机”。回退方案是在一旦发现新版本存在问题,迅速的回退到原有的版本,支持应用访问。

目前,Oracle 推荐两种大规模升级的方法:In-Place 和 Out-of-Place。In Place 升级方法下,升级动作直接在原有的 Database Home 目录下。Out-of-Place 则是选择了一个新的 Oracle Database Home 目录。相对于 In place 策略,Out-of-Place 在空间上需要更多的消耗。

但是,Out-of-Place 的好处也是比较明显的,首先是可以比较方便的进行回退,同时在 Downtime 停机时间上,也有比较强的优势。

3.19 Privacy

Pufferfish Privacy Mechanisms for Correlated Data

Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics

Pythia: Data Dependent Differentially Private Algorithm Selection

Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics

3.20 Crowdsourcing 众包

Crowdsourced Top-k Queries by Confidence-Aware Pairwise Judgments

Crowdsourced query processing is an emerging processing technique that tackles computationally challenging problems by human intelligence. The basic idea is to decompose a computationally challenging problem into a set of human friendly microtasks (e.g., pairwise comparisons) that are distributed to and answered by the crowd. The solution of the problem is then computed (e.g., by aggregation) based on the crowdsourced answers to the microtasks. In this work, we attempt to revisit the crowdsourced processing of the topk queries, aiming at (1) securing the quality of crowdsourced comparisons by a certain confidence level and (2) minimizing the total monetary cost. To secure the quality of each paired comparison, we employ two statistical tools, Student’s tdistribution estimation and Stein’s estimation, to estimate the confidence interval of the underlying mean value, which is then used to draw a conclusion to the comparison. Based on the pairwise comparison process, we attempt to minimize the monetary cost of the top-k processing within a SelectPartition-Rank framework. Our experiments, conducted on four real datasets, demonstrate that our stochastic method outperforms other existing top-k processing techniques by a visible difference.


Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services

CrowdDQS: Dynamic Question Selection in Crowdsourcing Systems

CDB: Optimizing Queries with Crowd-Based Selections and Joins

SIGMOD 2017论文的摘要与看法相关推荐

  1. CVPR 2017论文集锦

    计算机视觉顶会之一的CVPR2017将于7月21日至7月26日在夏威夷举行.下面为目前关于CVPR2017的论文解读的文章总结.欢迎大家收藏并推荐~(小助手微信:Extreme-Vision) 所有文 ...

  2. ICSE NIER 2017 论文阅读 - Production-Driven Patch Generation

    前言 这个是前天的文章,昨天.今天的文章都还没看的.很僵硬. 决定先补上这一篇,然后再读两篇SSBSE的文章. 简介 本文旨在阅读 ICSE NIER 2017 论文- Production-Driv ...

  3. 国防科技大学计算机学院刘洋,GAMES Webinar 2017-02期(Siggraph 2017论文报告)| 刘洋(微软亚洲研究院),徐凯(国防科技大学)...

    [GAMES Webinar 2017-02期(Siggraph 2017论文报告)] 报告嘉宾1:刘洋,微软亚洲研究院 报告时间:2017年6月29日(星期四)晚20:00-20:45(北京时间) ...

  4. opentracing-02 dapper论文词汇摘要

    opentracing-01 dapper论文词汇摘要 参考文档 github corpus 英[ˈkɔːpəs]美[ˈkɔːrpəs]n.(书面或口语的)文集,文献,汇编; 语料库; researc ...

  5. 3D目标检测论文阅读摘要

    3D目标检测论文阅读摘要 2D Object Detection 的研究已经非常成熟了,代表作品有RPN系列的FasterRCNN,One Shot系列的YOLOv1-YOLOv3,这里推荐一个2D ...

  6. 6个月为50篇AI论文写摘要,网友:这有啥,我曾被要求1.5小时内复现一篇论文...

    点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达 来源丨机器之心 编辑丨极市平台 导读 快速阅读论文是研究人员不可或缺 ...

  7. Open-Domain Question Answering相关部分论文阅读摘要

    主要内容 Open-Domain Question Answering相关部分论文阅读摘要 DrQA(Reading Wikipedia to Answer Open-Domain Questions ...

  8. mysql论文的摘要格式怎么写_如何撰写论文的摘要

    论文摘要怎么写 摘要:简洁.具体的摘要要反映论文的实质性内容,展示论文内容足够的信息,体现论文的创新性,展现论文的重要梗概,一般由具体研究的对象.方法.结果.结论四要素组成. 对象--是论文研究.研制 ...

  9. 学术论文中摘要和结论的区别?

    学术论文中的摘要和结论都是对论文整体的总结部分,在很多数据方面内容都是一致的,因此许多同学会对二者产生混淆,甚至在写作过程中将两个部分当成一个东西去写,这样肯定是不对的.摘要和结论在论文中位置.作用. ...


  1. [LeetCode]113.Path Sum II
  2. 一文读懂HTTP/2 及 HTTP/3特性
  3. python功能函数_Python-功能函数的使用
  4. java实现遍历树形菜单方法——OpenSessionView实现
  5. fastapi 传输文件存文件_python3 FastAPI框架入门 基本使用, 模版渲染, 数据交互,cookie使用, 上传文件, 静态文件配置...
  6. 【英语学习】【Level 07】U06 First Time L5 A Different City
  7. 刘强东深夜写信诉苦;华为不排斥卖给苹果 5G 芯片;Facebook 再宕机 | 极客头条...
  8. 使用maven启动web项目报错
  10. 石灰窑计算机控制上料,石灰窑自动化控制系统
  11. 商业虚拟专用网络技术一
  12. html chm 打不开,Win7系统中出现CHM打不开的具体解决方法
  13. 2019牛客暑期多校训练营(第六场) Move
  14. java设置图片_JAVA 设置背景图片
  15. 关于文件夹病毒exe的处理方法
  16. 小程序亚马逊服务器,亚马逊aws服务器搭建实现微信小程序换脸(草草收尾)
  17. 『这辈子就相爱《何苦要等下辈子》 李草青青、肖玄MV』
  18. 人机大战之AlphaGo的硬件配置和算法研究
  19. 使用nginx搭建http代理服务器
  20. 130 个相见恨晚的神器网站


  1. highcharts绘制3D图表
  2. 《淘宝店铺营销推广一册通》一1.3 宝贝标题优化
  3. LATEX 幻灯片入门
  5. noip普及组2007 守望者的逃离
  6. 龙讯|LT8911EXB高性能MIPI转EDP分辨率1080P@60
  7. 计算机管理没有指定运行,如何限制电脑只运行一个软件?只打开指定软件?
  8. 多维数组存储的两种方式
  9. html语言加号点一下变成减号6,CSS3 linear-gradient线性渐变生成加号和减号的方法...
  10. 投影仪如何选择?怎样选购家用投影仪