notes 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
  1. # alha15 1. FPGA accelerated hadoop cluster for deep learning computations
  2. - map-reduce
  3. - data parallelism
  4. - common compute step
  5. - deep learning kernels computationally intensive
  6. - speedup: 12.6 times
  7. - energy reduction: 87.5%
  8. - nodes: 6
  9. - fpga: zedboard
  10. "Researchers are currently training deep learning architecture particularly convolutional neural ntworks either by resording to special ahrdware such as gpu fpga or mapping training process into distributed omcputing clusxter such as hadoop map-reduce framework, but not both at the same time"
  11. "amdahl's law and the assumption that the acceleration speedup of all convolution layers to be the same, we computed the overall speedup from FPGA acceleration over sequential execution to be 8.12 times"
  12. # fisc20 2. BNNsplit binarized neural networks for embedded distributed FPGA-based computing systems
  13. - map-reduce
  14. - CNNs: floating point operations -> reducing weights to binary values for FPGAs
  15. - FINN: state of the art BNN
  16. - extension to run BNN on multi-FPGA systems
  17. - hadoop: input data in distributed file system divided between mappers for fault tolerance, master divides input into splits and assigns split to a mapper node it usually assigned based on physical proximity
  18. - map: key-value pairs are mapped into key-value pairs, which are grouped by the framework by the second key of the key-value pair
  19. - reduce: applied in parallel to each group
  20. "there have been various attempts to accelerate deep learning algorithms using either hadoop or FPGA technology, but not both at the same time"
  21. # chun95 3. design and implementation of a multicomputer interconnection network using FPGAs
  22. - four-by-four port interconnection network
  23. - cell routing hubs
  24. - traditional FPGA beenfits in system development including hardware/software codesign, architecture trade-off study and system debugging
  25. - benefits compared to ASIC development
  26. # chun15 4. FPGA-based accelerator platform for big data matrix processing
  27. - hadoop
  28. - cluster of FPGA evaluation boards (EVBs)
  29. - communication via Gigabit Ethernet switch
  30. - 512x512 floating point matrix multiplications with four FPGA EVBs at 125MHz clock achieve 4x speedup as compared with i7-4770 CPU at 3.4GHz
  31. # chun17 5. hadoop cluster with FPGA-based hardware accelerators for K-means clustering algorithm
  32. - hadoop
  33. - 4x speedup compared to hadoop cluster without FPGA-based accelerators
  34. - compared to machine learning (Apache) Mahout libraries
  35. - evaluation boards (EVBs)
  36. - Gigabit ethernet switch
  37. # du19 6. the library for hadoop deflate compression based on FPGA accelerator
  38. - hadoop: map-reduce
  39. - accelerating a hadoop system with hardware by implementing compression options with FPGAs
  40. - speedup ratio 6.42x, 6.28x and 3.25x
  41. - "Apache Hadoop is the industry's mainstream big data processing software, running on a cluster, distribtued storage and distirbuted processind of large-scale data"
  42. - compression and decompression at the same time due to timeline parallelism
  43. - modified zlib, modified zpipe, modified testDFSIO (IO benchmarking)
  44. - PCI-x4 hardware interface
  45. # kalm16 7. clustering and mapping algorithm for application distribution on a scalable FPGA cluster
  46. "the creation of an FPGA cluster introduces two major challenges."
  47. - "how does the communication between the FPGAs take place?"
  48. - "how will the application(s) be distributed among the FPGAs?"
  49. - task interaction graph (TIG) mapped to board connection graph (BCG)
  50. - "One challenge for FPGA clusters is the topology and interconnection type. The most common approaches connect FPGA boards via Ethernet or connect FPGA cards via PCIe"
  51. - or self built using bluelink (lightweight pluggable interconnect lirbary)
  52. - or wifi
  53. - many approaches suffer from a poor scalability and become inefficient when it comes to build a large cluster
  54. - "Another challenge for FPGA clusters is the application distribution."
  55. - several publications with single FPGA
  56. - network-on-chip using single FPGA
  57. - also called load-balancing
  58. # jone99 8. AUX implementing an API for distributed adaptive computing systems
  59. - two such application classes are embedded systems in which multiple baords are required to physically interface to different sensors/actuators and applications whose computational demands require mutliple boards
  60. - the cluster computing paradigm is a cost-effective method of constructing small parallel computers using commercial off-the-shelf technology to exploit coarse- and medium-grain parallelism
  61. # knod13 9. integration of a highly scalable multi-FPGA-based hardware accelerator in common cluster infrastructures
  62. - offers simple/scalable integration of FPGAs in common cluster architecture, permit easy access to resources
  63. - enables system-wide dynamic partitioning, batch-based administration and monitoring of FPGA resources
  64. - "numerous applications in bio- and neuroinformatics are highly suitable for FPGAs"
  65. - present efficiently working cluster architecture with distributed FPGAs
  66. - "If no hardware is available, a simulation of the applications' new runtime environment is required. A testbed provides realistisc performance values but causes additional effort whereas a simulation can have negative effects on the performance values."
  67. - "The currently used connection technologies range from simple streaming solutions realized with Gigabit Ethernet up to complex PCI Express solutions."
  68. - accelerator card developed by Pico Computing, holding up to six FGPAs per card
  69. - simple and user-friendly approach Convey-HC1 system
  70. - unlike in-socket FPGA co-processors from Nallatech it uses mezzanine connector to link the front side bus to an accelerator board with four user-programmable FPGAs
  71. - accessible with an OpenMP programming model
  72. - emulation of parallel architectures
  73. - "In most heterogenous clusters FPGAs are used as simple co-processors or accelerators connected over PCIe to the node's processor cores."
  74. - tight coupling to host processor
  75. - "The main characteristic of our concept is the dynamic allocation of FPGA resources and the flexible assignment between the number of host processors and FPGAs"
  76. - "Our approach comprises basic protocol implementations for the FPGA-to-FPGA, FPGA-to_Host and FPTA-to-Cluster communication"
  77. - PCI communication
  78. - "To allow an efficient programming of the FPGA resources in the introduced heterogenous cluster environment, a framework is necessary. The Open Computing Langauge (OpenCL) framework is commonly used for the development and execution of programs across heterogenous platforms consisting of CPUs, GPUs and also FPGAs."
  79. # nesh15 10. accelerating machine-learning kernels in hadoop using FPGAs
  80. - hadoop: map-reduce
  81. - comprehensive analysis of communication and computation overheads such as data I/O movements, and calling several standard libraries that can not be offloaded to the accelerator
  82. - several data mining algorithms and applications
  83. - K-means clustering, KNN classification, SVM-learn and Naive Bayes classification
  84. - speedup derivation using Amdahl's law
  85. - "As results shown, the input data size have a significant effect on the speedup in some applications."
  86. - design space analysis: "Mapping of applications to a heterogeneous architecture to benefit from the diverse core and accelerators is a complex problem, particularly because different phases of the same application will often prefer different cores or configurations and thus, require specific scheduling and mapping to find the best match. Making wrong scheduling decisions can lead to suboptimal performance and negatively impact power and energy consumption as well"
  87. # theo14 11. interconnect for commodity FPGA clusters: standardized or customized?
  88. - "Whilst soft cores for standard protocols (Ethernet, RapidIO, Infiniband, Interlaken) are a boon for FPGA-to-other-system interconnect, we argue that they are inefficient and unnecessary for FPGA-go-FPGA interconnect"
  89. - makers of BlueLink
  90. - on the idea of single PCB multi-FPGA and its pitfalls
  91. - "This requires complex design an simulation - for preossional designers a board takes about one man-year of design effort. FPGAs are typically found in advanced ball grid array packages, which also makes manufacturing difficult. In addition there is the headache of managing the whole process of parts procurement, production, test and debug."
  92. - "Secondly, many such boards (especially commercial products) are not regular - each FPGA is not connected to the same peripherals. This requries a separate synthesis run for each FPGA in a cluser, which makes it difficult to scale to a large number of FPGAs."
  93. - "some applications do not requrie any communication between FPGAs: loosely coupled"
  94. - "mapreduce fits this model"
  95. - "other applications are tightly coupled"
  96. - gate-level systom-on-chip simulation
  97. - interconnect:
  98. - simple approach: GPIO pins using single-ended driving or low-voltage differential signalling (LVDS)
  99. - limited frequency about 1 GHz in LVDS mode
  100. - signal integrity and skew: short cables (typically centimetres) with careful (expensive) construction
  101. - limits size of cluster
  102. - proposes commodity FPGA boards (reduce costs, development time), serial interconnect using FPGA transceivers, low-cost commodity passive copper cabling between boards (optical for longer distances if necessary), multi-hop routing such that fully-connected network is not required
  103. - compared against Altera 10G Ethernet MAC on Strativ V platform
  104. - small message sizes, low latency, reliable, hardware-only, lightweight, ubiquitous and interoperable
  105. # asse21 12. accelerating deep neuroevolution on distributed FPGAs for reinforcement learning problems
  106. - sequential nature of problems poses a fundamental challenge
  107. - "most appealing part of video games for reinforcement learning research is the availability of the game score as a direct reward signal, as well as the low cost of running large amounts of virtual experiments on computers without actual consequences"
  108. - "training neural networks with derivative-free methods opens the door for innovrations in hardware beyond GPUs"
  109. - IBM Neural Computer
  110. - "rather than accelerating the optimization algorithm (e.g. RL or GA) we have taken a different approach and addressed the data generation (i.e. ATari game environemtn and obtaining frames"
  111. - "within each node is a zynq-7045 system-on-chip, which integrates a dual-core Cortex A9 ARM processor and an FPGA, alongside 1GB of DRAM used both by the ARM CPU and the FPGA"
  112. - "for example, for game playing, a significant portion of the time is spent during the game itself, which results in a long sequence of inference of game frames and actions. communicating game scores and updating neural network weights are sparse in comparison. therefore, rather than accelerating the genetic algorithm, acceleration of the game environment and the inference can make a big difference as our results have shown."
  113. # prit20 13. overview of the IBM neural computer architecture
  114. - IBM neural computer (INC)
  115. - custom-designed distributed FPGA system developed by IBM Research
  116. - 416 FPGAs
  117. - 832 instances in parallel
  118. - "while INC is a distributed system in that it is composed of distinct processor+memory nodes interconnected by communication links, it has a unique combiantion of features not available elsewhere. IOt is not a multi-FPGA 'sea of gates' sytem, whose structure would need to be defined by the logic resident on the FPGA. it has a very well defined structure of compute nodes with a well defined communications network. Therefore it does not carry the performance compromise associated with the need to support a fully-generic interconnect."
  119. - 3x3x3 topology per card, along with (XYZ) co-ordinates overlaid to indicate the organization of the 3D mesh
  120. - 27 nodes placed on the card in a way to minimize the connection lengths between logically adjacent nodes
  121. - 27 identical nodes except for ethernet (node 100), controller (node 000) with 4 lane PCIe 2.0 connection (possibly to host PC) and serial (possibly serving as console during boot time), extra optional PCIe support (node 200) for addional bandwidth
  122. - backplane up to 16 cards in a 12x12x3 mesh
  123. - communication network currently supports directed and broadcast packet routing schemes
  124. - multiple virtual channels can be designed to sit atop the underlying router logic described in the previous section to give the processor ad FPGA logic different virtual or logical interfaces to the communication network
  125. - internet ethernet
  126. - emulates regular ethernet to take advantage of existing software
  127. - postmaster direct memory access (DMA)
  128. - bridge FIFO
  129. # 14. inference
  130. # 15. the cube