Outline

Motivation

A little history

Concept/Terminology

Paradigm

Outline

Motivation

A little history

Concept/Terminology

Paradigm

Where to start

A simple example

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$

single cpu 2.0GHz | 0.5 ns | 4 pipelines
+ | C.float.add | 5 cycle
$$k \ | \ 2 \times 10^{8} adds \ | \ 1s$$
datasize | 1TB | simple add | 12min

Case1

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + d_{6} ... + d_{k-1} + d_{k} $$

d | float array
+ | user-defined func | 2000 cycles
k \ | \ 0.5 \times 10^{7} +s
datasize | 1TB | user-defined func | 8.5h

Case2

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$

$$+ \ | \ \cap \cup \ like \ op $$
memory explosion

Optimization always confuse

optimization x parallel computing
- efficiency makes qualitative | life is limit
- distributed memory makes qualitative | enough dataset?
- parallel computing make things happen

Outline

Motivation

A little history

Concept/Terminology

Paradigm

A little history

line0
- human being | neurons in brain
line1
- 1960's vector machine
- 1970's | SMP(Symmetric Multiprocessor) | 10- cpus
- 1980's-2000 NUMA/MPP(Massive Parallel Processing) | 100+ cpus
- 2000- Intel | pipeline in core | multicore
- 2005- hybrid system | accelerated system with GPU/MIC | 1000+ cpus | peak?
line2
- 2000- google | commercial cluster | distributed systems to make use of servers | across the room | paxos
- flat tree, cubic etc | SC conf | DCN(hot research @MSR)
- mapreduce | lots of computational frameworks
- douban | single room | all connected

Outline

Motivation

A little history

Concept/Terminology

Paradigm

How

A simple example

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$

suppose k a power of 2
Algorithms?
solution1: master node
solution2: binary tree | node(value) edge(communication)

Move on

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$

partition | map | data distribution | load balance
parallel calc | simultaneous
sync | communication
(all)reduce
bug?
consistency | small dataset to verify

Cost Model

Amdahl's law

r | portion which can be parallelized
$$T(p) = (\frac{r}{p} + 1 - r) T(1)$$
$$r \rightarrow 1 \ | \ T(p) \rightarrow \frac{r}{p} \ | \ px(linear) \ | \ :)$$
$$r \rightarrow 0.5 \ | \ T(p) \rightarrow (0.5 + \frac{0.5}{p}) T(1) \ | \ p \rightarrow \infty \ | \ 2x \ | \ :($$
r is important | critical path | data/procedure level
scalability | data | process unit | time
no communication cost | toooooo ideal

Cost Model (Cond)

A better model (calc_cost + comm_cost)

$$\alpha + M\beta \ | \ 10^{-3}+M10^{-8} \ | \ simplest \ communication \ model$$
$$solution1 \ | \ \frac{T(1)}{p} + (\beta\frac{M}{p} + \alpha) (p - 1) \ | \ :($$
$$solution2 \ | \ \frac{T(1)}{p} + (\beta\frac{M}{p} + \alpha) (logp) \ | \ :)$$
$$general \ | \ pT(p) = C + C_{1}(p - 1) + C_{2} p (p - 1)$$
$$C \ is \ portion \ can \ be \ paralleld$$
$$C_{1} \ is \ bandwidthsome \ cost$$
$$C_{2} \ is \ latancy$$
pack is important | 64MB is large message @douban(dpark) | reduce message cost(lantancy)
stagnation point | scalability ability

More of Scalability

different partition leads to different communication

2D(trick concept) data | n x n grids | O(n*2) | 1D partition vs 2D partition
$$1D \ | \ O(np) \ | \ \frac{n^{2}}{Cnp} \Rightarrow n\ \sim \ cp \ | \ \frac{n^{2}}{p}=\frac{c^{2}p^{2}}{p} \ | \ c^{2}p$$
$$2D \ | \ O(\frac{n}{\sqrt{p}}p) \ | \ \frac{n^{2}}{Cn\sqrt{p}} \Rightarrow n\ \sim \ c\sqrt{p} \ | \ \frac{n^{2}}{p}=\frac{c^{2}p}{p} \ | \ c^{2}$$

Implementation

Still the simple example

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$

fork,fork | map
SIMD(Single Instruction Multiple Data) | if(rank = ?) | calc
Send/Recv | communication
barrier | sync
y, we design a distributed framework | kiss
that's almost enough, trust me

Pairwise Summation (Solution1)

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$

  // This is a c++ pseudo code.
  type parallel_load(...) {...} 
  type sum_calc(data, comm) {...}
  float reduce(sum, comm) {
    float tmp = 0.0, result = sum; int cnt = comm.Get_size() - 1;
    if(comm.Get_rank() == 0) { 
      while(cnt--) { kiss_Recv(tmp, from = ANY); result += tmp; } 
    } else { kiss_Send(sum, to = 0); }
    return result;
  }
  int main(int argc, char *argv[]) {
    kiss_Init(...);
    auto data = parallel_load('a.txt', comm);
    auto local_sum = sum_calc(data);
    reduce(local_sum, comm); kiss_Finalize();
  }

Pairwise Summation (Solution2)

Outline

Motivation

A little history

Concept/Terminology

Paradigm

MPI/RPC

simple but low level | no algorithm level abstraction | gerneral

MPI_Init

MPI_Send/MPI_Recv

MPI_Barrier

MPI_Reduce(we already implement one | MPI_Bcast...

to programmers:

data distribution(partition + map)

calc

communication

sync

Openmp/Pthread

shared memory programming

openmp based on pthread

openmp used for compute-intensive forloop

Hybird MPI with Openmp

multicore-servers clusters

linear speedup is waiting for you

MapReduce

lisp | google | Jeff Dean

parallel 'map' and 'reduce' calc | key-value

structured communication(disk I/O stream)

middle level framework

mapreduce vs MPI:

'map'=linesplit + map

'reduce' = reduce

no shuffle in MPI | partition in MPI is more difficult

easy deploy | pool locality | model transfer | job

MapReduce in Action

// wordcount with douban's mapreduce framework
delimiter = re.compile('[^-a-zA-Z0-9_]')

def wc_map(line):
    return [(w, 1) for w in delimiter.split(line) if w]

def wc_reduce(a, b): 
    return int(a) + int(b)

def sort_key(a):
    return a.split('\t', 1)[0]

def sort_kcmp(a, b): 
    return -cmp(int(a), int(b))

def sort_vcmp(a, b): 
    return -cmp(int(a.split('\t', 1)[0]), int(b.split('\t', 1)[0]))

def init(_):
    io.map_reduce(wc_map, wc_reduce, lambda w, c: c + "\t" + w)
    io.sort(sort_key, sort_kcmp, sort_vcmp)

Spark/Dpark

based on mesos | mesos is a better fork,fork(kiss)

memory cached | lineage to make fault-tolerant(hang on)

high level class

serial code style

to many apis

hard for wide dependency application

Why Distributed Computational Framework?

easy code | stay focus on app

abstraction | code reuse

distributed computational ability

great property from computing models

hadoop(mapreduce)

bsp

hama(pregel)

storm(twitter)

piccolo

parasol

graphlab

...

Which one is better

from low level to high level | implementation dependency

commercial cluster | GPU x | supercomputer x

mid-low level with simple api | general

glue | good husband

What we can do

discussion...

Simple Summation Plus

parallel is not obvious sometimes

fibonacci sequence | n-th | prefix

thinking...

Simple Summation Plus (Cond)

tree-reduce-like-process

segmented scan | [1 2 3 4 5 6 7 8 9 10] [1 0 0 0 10 1 10 1]: [1 3 6 10 5 11 7 8 17 10 ]

$$matrix \ inversion(Csanky) \ | \ A^{-1} = (A^{n-1} + c_{1}A^{n-2} + ... + c_{n-1}) / (-1/c_{n})$$

$$summator \ | \ a_{3}a_{2}a_{1}a_{0} + b_{3}b_{2}b_{1}b_{0}$$

fibonacci seqs

...

all recursion pattern algos | seems seq2parallel

Summarize

$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} \ | \ a \ great \ example$$

optimization problems | + | iterater | serial | challenge

why (motivation)

a little history

what (is parallel computing)

How (kiss)

what (is distributed computational framework)

seriesII (end of Oct): bsp, pregel (in action)

seriesIII (middle of Nov): parasol (how to build a demo distributed system)

Thank You!

contact info