123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081 |
- --------------- title page ----------------
- -------------- credit page ----------------
- ------------ problem slide 1 --------------
- - why FPGA?
- - CPU? computational power
- - GPU? communication facilities
- - model? human brain
- ----------- the machine slide -------------
- - what we want to build
- ------------ problem slide 2 --------------
- - simple model
- - differential equations
- - more tractable
- - real-time deadline
- - some neurons live on the same FPGA
- ---------- requirements slide --------------
- - latency >> bandwidth
- - inevitable faults (thousands of links at gbps)
- - enable small FPGAs
- - fully exhaust resources
- - transceivers
- - max. links, max. link rates
- - gain bandwidth, reduce hops
- - heterogenous interop
- --------- standard IP cores slide ----------
- - many standards
- - many difficulties
- ----------- architecture slide -------------
- - difficulties of custom communication
- - additional work
- - serial transceiver layer
- - send/receive 32-bit words
- - physical
- - conversion of words
- - idle symbols/alignment
- - link
- - serialization of flits/words
- - reliability
- - CRC (without header), sequence number, acknowledgement
- - unackowledges flits in replay buffer, resend
- - routing and switching
- - hop-by-hop routing
- - abstraction
- - primitives for applications
- ----------- abstractions slide -------------
- - packets
- - send/receive buffers
- - polling/interrupts informs packet delivery
- - bluespec
- - adds 10-20 extra cycles latency
- - FIFO type abstraction
- - remote DMA
- - direct memory access
- - read/write translation
- - transparency
- - with bursts
- - blocking
- - read/write until successful
- - deadlock risk
- - software pipes
- - linux pipe semantics
- - testing application on pc
- ------------- results slide ----------------
- - altera core
- - many comparisons, 4 key results
- - inherent area/performance trade-off
- - bandwidth
- - utilization when instantiating each system as many times as necessary to use all transceiver resources
- - protocols in black implement reliability
- - latency
- - comparison with 10G bluelink/ethernet
- - flits can be accepted in a single cycle
- - lightly loaded case more likely with more transceivers
- - overhead
- - better use up to 256-bit packets
- - area
- - bluelink compares very favorably
- - 10G has 65% of LUT/reg of 10G ethernet
- - 40G will fit same area
- - 15% memory of 10G bluelink
|