Motivation
A little history
Concept/Terminology
Paradigm
Motivation
A little history
Concept/Terminology
Paradigm
A simple example
$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$
$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + d_{6} ... + d_{k-1} + d_{k} $$
$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$
Motivation
A little history
Concept/Terminology
Paradigm
Motivation
A little history
Concept/Terminology
Paradigm
A simple example
$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$
$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$
Amdahl's law
A better model (calc_cost + comm_cost)
different partition leads to different communication
Still the simple example
$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} $$
// This is a c++ pseudo code. type parallel_load(...) {...} type sum_calc(data, comm) {...} float reduce(sum, comm) { float tmp = 0.0, result = sum; int cnt = comm.Get_size() - 1; if(comm.Get_rank() == 0) { while(cnt--) { kiss_Recv(tmp, from = ANY); result += tmp; } } else { kiss_Send(sum, to = 0); } return result; } int main(int argc, char *argv[]) { kiss_Init(...); auto data = parallel_load('a.txt', comm); auto local_sum = sum_calc(data); reduce(local_sum, comm); kiss_Finalize(); }
Motivation
A little history
Concept/Terminology
Paradigm
simple but low level | no algorithm level abstraction | gerneral
MPI_Init
MPI_Send/MPI_Recv
MPI_Barrier
MPI_Reduce(we already implement one | MPI_Bcast...
to programmers:
data distribution(partition + map)
calc
communication
sync
shared memory programming
openmp based on pthread
openmp used for compute-intensive forloop
multicore-servers clusters
linear speedup is waiting for you
lisp | google | Jeff Dean
parallel 'map' and 'reduce' calc | key-value
structured communication(disk I/O stream)
middle level framework
mapreduce vs MPI:
'map'=linesplit + map
'reduce' = reduce
no shuffle in MPI | partition in MPI is more difficult
easy deploy | pool locality | model transfer | job
// wordcount with douban's mapreduce framework delimiter = re.compile('[^-a-zA-Z0-9_]') def wc_map(line): return [(w, 1) for w in delimiter.split(line) if w] def wc_reduce(a, b): return int(a) + int(b) def sort_key(a): return a.split('\t', 1)[0] def sort_kcmp(a, b): return -cmp(int(a), int(b)) def sort_vcmp(a, b): return -cmp(int(a.split('\t', 1)[0]), int(b.split('\t', 1)[0])) def init(_): io.map_reduce(wc_map, wc_reduce, lambda w, c: c + "\t" + w) io.sort(sort_key, sort_kcmp, sort_vcmp)
based on mesos | mesos is a better fork,fork(kiss)
memory cached | lineage to make fault-tolerant(hang on)
high level class
serial code style
to many apis
hard for wide dependency application
easy code | stay focus on app
abstraction | code reuse
distributed computational ability
great property from computing models
hadoop(mapreduce)
bsp
hama(pregel)
storm(twitter)
piccolo
parasol
graphlab
...
from low level to high level | implementation dependency
commercial cluster | GPU x | supercomputer x
mid-low level with simple api | general
glue | good husband
discussion...
parallel is not obvious sometimes
fibonacci sequence | n-th | prefix
thinking...
tree-reduce-like-process
segmented scan | [1 2 3 4 5 6 7 8 9 10] [1 0 0 0 10 1 10 1]: [1 3 6 10 5 11 7 8 17 10 ]
$$matrix \ inversion(Csanky) \ | \ A^{-1} = (A^{n-1} + c_{1}A^{n-2} + ... + c_{n-1}) / (-1/c_{n})$$
$$summator \ | \ a_{3}a_{2}a_{1}a_{0} + b_{3}b_{2}b_{1}b_{0}$$
fibonacci seqs
...
all recursion pattern algos | seems seq2parallel
$$ d_{1} + d_{2} + d_{3} + d_{4} + d_{5} + ... + d_{k-1} + d_{k} \ | \ a \ great \ example$$
optimization problems | + | iterater | serial | challenge
why (motivation)
a little history
what (is parallel computing)
How (kiss)
what (is distributed computational framework)
seriesII (end of Oct): bsp, pregel (in action)
seriesIII (middle of Nov): parasol (how to build a demo distributed system)
contact info