Skip to content

vazois/TopK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopK Algorithms Benchmark

Note: switch to the appropriate branch to run the algorithms (cpu_ks for CPUs, gpu_topk for GPUs)

This repository contains the source code for a collection of efficient parallel in-memory Top-k selection algorithms. CPU-only versions are verified and tested for correctness. GPU-only implementations are still under development.

The following is a short description of each CPU algorithm:

  • cpu/TA.h
    - Scalar implementation of the Threshold Algorithm using standard C++ library.

  • cpu/TPAc.h
    - Scalar, SIMD, SIMD + multi-threaded, and multi-query implementation of Full Table Evaluation (FTE) using column major order.

  • cpu/TPAr.h
    - Scalar, SIMD, and SIMD + multi-threaded implementation of Full Table Evaluation (FTE) using row major order.

  • cpu/VTA.h
    - Scalar, SIMD, SIMD + multi-threaded, and multi-query implementation of Vectorized Threshold Algorithm (VTA).

  • cpu/SLA.h
    - Scalar, SIMD, and SIMD + multi-threaded implementation of Skyline Layered Algorithm (SLA).

  • cpu/PTA.h
    - Scalar, SIMD, SIMD + multi-threaded, and multi-query implementation of Partitioned Threshold Algorithm (PTA).

Behnchmark Instructions

To evaluate the above implementations, you need to configure debug.sh. Following we provide a short description of those parameters and a few example how to configure them.

  • START_N, END_N
    - For synthetic data, choose number of objects to test.

  • DIMS
    - For synthetic data, choose number of attributes for each object. Specified also queries to be tested (i.e. DIMS=8, queries on 2,3,4,5,6,7,8 attributes).

  • KKS, KKE
    - Select range of k to test incrementing by a factor of 2 (i.e. KKS=16 KKE=128, 16,32,64,128).

  • LD
    - Specified synthetic or real data. LD=0 synthetic data on disk, LD=1 synthetic data generated in-memory, LD=3 real data.

  • distr
    - Choose data distribution (i.e. distr=c(orrelated), distr=i(ndependent), distr=a(nticorrelated) to generate on disk or in-memory.

  • script
    - script name used to generate synthetic data on disk.

  • REAL_DATA_PATH
    - Path to real data file. It needs to follow a specific naming convention __

  • device
    - Choose between cpu or gpu evaluation. device=0, cpu-only for now.

  • QM=0
    - Query attributes starting from the last or first attribute, 0 or 1 respectively

  • QD
    - Query interval testing (i.e. QD=2 then 2,4,6,8).

  • IMP
    - Choose what implementation to test (i.e. Scalar=0, SIMD=1, Threads=2, Multi-query random dimension=3, Multi-query fixed dimension=4)

  • ITER
    - Choose number of iterations to benchmark a query.

  • MQTHREADS
    - Multi-query number of threads.

  • STATS_EFF
    - Gather statistics associated with number of objects evaluated.

  • WORKLOAD
    - Number of queries generated for multi-query evaluation.

  • TA_B
    - Enable TA benchmark.

  • TPAc_B
    - Enable FTE column major benchmark.

  • TPAr_B
    - Enable FTE row major benchmark.

  • VTA_B
    - Enable VTA benchmark.

  • PTA_B
    - Enable PTA benchmark.

  • SLA_B
    - Enable SLA benchmark.