Introduction to OpenCL on TI Introduction to OpenCL on TI

  • View
    218

  • Download
    3

Embed Size (px)

Transcript

  • Introduction to OpenCL on TIIntroductiontoOpenCL onTIEmbeddedProcessors

  • Agenda OpenCL Overview WhyandWhentoUseOpenCL onTIEmbeddedProcessors ProcessorSDKOpenCL Examples TIDesignExampleCodeWalkthrough

  • OpenCLOverview

    IntroductiontoOpenCL

  • OpenCLParallelLanguageforHeterogeneousModelg

    ThecontentofthisslideoriginatesfromtheOpenCLstandardsbodyKhronos. AM57xhastheARMCortexA15asahost,andDSPcoresasaccelerators., TheTIOpenCL implementationiscompliantwithOpenCL1.1

  • BenefitsofUsingOpenCL onTIProcessorsEasyportingbetweendevicesNo need to understand memory architectureNoneedtounderstandmemoryarchitectureNoneedtoworryaboutMPAXandMMUNo need to worry about coherencyNoneedtoworryaboutcoherencyNoneedtobuild/configure/useIPCbetweenARMandDSPN d t b t i DSP d hit t ti i tiNoneedtobeanexpertinDSPcode,architecture,oroptimization

  • OpenCLPlatformModel

    AhostisconnectedtooneormoreOpenCL computedevices. An OpenCL compute device is a collection of one or more compute units AnOpenCL computedeviceisacollectionofoneormorecomputeunits. Eachcomputeunitmayhavemultipleprocessingelements.

  • OpenCLTIPlatformModel ARM Cortex A15 is the host: C d b itt d f th h t t th O CL d i ( ti d ARMCortexA15isthehost:CommandsaresubmittedfromthehosttotheOpenCLdevices(executionand

    memorymove).

    AllC66xDSPCorePacs areOpenCL computedevices.EachDSPcoreisacomputeunit.AnOpenCLdeviceisviewedbytheOpenCLprogrammerasasinglevirtualprocessor.ThismeansthattheOpe C de ce s e ed by t e Ope C p og a e as a s g e tua p ocesso s ea s t at t eprogrammerdoesnotneedtoknowhowmanycoresareinthedevice.OpenCL runtimeefficientlydividesthetotalprocessingeffortacrossthecores.NOTE:AM57xand66K2H12havethesameOpenCL code.

    66AK2H12KeyStoneIIMulticoreDSP+ARMProcessor

    ++**

  • OpenCLApplicationsModelSerial Code HostSerialCode Host

    ParallelCode Multiple DSPcorescores

    SerialCode Host

    ParallelCode Multiple DSP

    ExecutionModelMemoryModel

    cores

    y

  • OpenCLExecutionModelHostContext

    Definedeviceandstate

    ComputeDeviceOneormoreComputeUnit(s)

    ProcessingAlgorithmOneormorekernel(s)

    ComputeUnitOneormoreComputeElement(s)

    WorkGroup

    Compute(Processing)ElementWorkItem

    Workitems=>Workgroup

  • PrivateMemory

    OpenCLMemoryModelPrivateMemory

    PrivateMemory

    PrivateMemory

    PrivateMemoryy

    Perworkitem

    LocalMemory

    Shared within a workgroup local to a compute unit (core)

    WorkWorkItemItemWorkWorkItemItem WorkWorkItemItemWorkWorkItemItem

    Local MemoryLocal Memory Sharedwithinaworkgroup,localtoacomputeunit(core) Global/ConstantMemory

    Sharedacrossallcomputeunits(cores)inacomputedevice

    Workgroup Workgroup

    LocalMemoryLocalMemory

    Global/ConstantMemory

    HostMemory AttachedtotheHostCPU Canbedistinctfromglobalmemory

    Read / Write buffer model

    ComputeDevice

    HostMemory Read/Writebuffermodel

    Canbesameasglobalmemory Map/Unmapbuffermodel

    Host

    Memorymanagementisexplicit;Commands move data from

    Copyright Khronos Group, 2009

    Commandsmovedatafromhost>global>localand back.

  • OpenCLExecutionModelDefinitions

    ContextContextDevice

    CommandqueueGlobalbuffers

    B ild K lBuildKernelsGetsourcefromfile(orpartofthecode)andcompileitatruntime

    ORGetbinaries,eitherasstandalone.outorfromalibrary

    ManipulateMemory&BuffersMovedataanddefinelocalmemory

    ExecuteDispatch all work itemsDispatchallworkitems

  • TheOpenCL includefileCL ENABLE EXCEPTIONS

    SimpleFunctionCodeWalkthroughCL_ENABLE_EXCEPTIONSenablesC++classchecking.

    This string defines the kernelThisstringdefinesthekernel.ItwillbecompiledfortheDSPandrunsontheDSP.Thekernelnameisset.

    Theary arrayisdefinedinthehostmemory.buf isdefinedwithapointertoary,whichisbufferdatathatisalreadyallocatedbytheapplicationapplication.

  • i l & ff

    SimpleFunctionCodeWalkthroughConstructcontext

    DefinitionsManipulateMemory&Buffers CL_DEVICE_TYPE_ACCELERATORisDSP.

    ThistellsOpenCL thearchitectureofthecomputedevice.getinfo returnsinformationaboutthedeviceordevices.

    B ild K l

    ManipulateMemory&Buffers

    Identifywherethekernel(s)isdefined. Associatetheprogramwiththekernel.

    ExecuteBuildKernels Buildtheprogramforthedevices

    usingtherightcodegenerationtools.

    Defineaqueuetothedevice. Definewhatkernelissenttothedevice

    andsetthelistofarguments(onlyoneinthisexample).

    Thequeueisconnectedtothedeviceandthekerneliscompiledandset.Start

    b ll h fexecutionbycallingtheenqueuefunction.NDRangeclassprovidesthedimensions.

  • WhyandWhentoUseOpenCL

    IntroductiontoOpenCL

  • UsingOpenCL onTIDSPDevices HPCmachineswithlargenumbersofcomputationalunits noissue.UseOpenCLorCUDAorsimilar.

    Fordeviceslike66AK2H12,wherethereare4ARMA15coresand8DSPC66xcores: 8DSPsprocessmanysignalprocessingalgorithms. SomeoftheARMcorescanbeonaseparatecomputedevice.

    Not supported currentlyNotsupportedcurrently

    Ruleofthumb:UseOpenCLwhenhighprocessingpowerisneeded.CompareittotheoverheadassociatedwithdispatchingDSPexecution. TheexampleNULL(fromthereleaseexamplesthatarediscussedlater)providestheoverheadthatisassociatedwithexecutionofnullprogrambytheDSP.

  • UsingOpenCL onTISitara Devices For devices like AM57x where there is 12 ARM (1 5G) cores and 12 DSP C66x (600 MHZ) cores:FordeviceslikeAM57xwherethereis1 2ARM(1.5G)coresand1 2DSPC66x(600MHZ)cores:

    ARMCortexA15ishighperformanceprocessor. ButitisnotasefficientasDSPforsomealgorithms.

    Consider the overhead that is associated with building the OpenCL structure and the runtimeConsidertheoverheadthatisassociatedwithbuildingtheOpenCLstructureandtherun timecompilingofthekernel. Thereisdirectivethatkeepsthepreviouscompiledbinariesincachebetweencalls.

    Rule of thumb Use OpenCL when the following are true: RuleofthumbUseOpenCL whenthefollowingaretrue: Thesamekernelrunsmany(infinite)times(theoverheadisnegligible)andtheARMcanexecuteotherfunctionsatthesametime.

    Kernel involves complex processing algorithms, especially if realtime is a consideration.Kernelinvolvescomplexprocessingalgorithms,especiallyifreal timeisaconsideration.

    BenchmarkyourcodewithandwithoutOpenCLandcompare.

  • ProcessorSDKOpenCLExamples

    IntroductiontoOpenCL

  • OpenCL inProcessorSDKLinuxRelease

    OpenCLimplementationispartoftheProcessorSDKLinuxperspective.

    TIstandardfilesystemhasseveralOpenCLexamples:/usr/shared/ti/examples/openclp p

  • OpenCLExamplesin Processor SDK

    root@am57xx-evm:/usr/share/ti/examples/opencl# ls -ltr

    -rwxr-xr-x 1 root root 2450 Aug 26 12:30 make.inc

    -rwxr-xr-x 1 root root 548 Aug 26 12:30 MakefileinProcessorSDKLinuxFileSystem

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 vecadd

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 simple

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 platforms

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 ooo_callback

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 offline_embed

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 offline

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 nulldrwxr xr x 2 root root 4096 Aug 26 12:55 null

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 matmpy

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 float_compute

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 edmamgr

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 dsplib_fft

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 ccode

    drwxr-xr-x 2 root root 4096 Aug 26 12:55 buffer

  • ExecutingOpenCL Examples:ccoderoot@am57xx-evm:/usr/share/ti/examples/opencl#root@am57xx-evm:/usr/share/ti/examples/opencl#root@am57xx-evm:/usr/share/ti/examples/opencl# cd ccoderoot@am57xx-evm:/usr/share/ti/examples/opencl/ccode# ls -ltr-rw-r--r-- 1 root root 2107 Aug 26 12:30 oclwrapper.cl-rw-r--r-- 1 root root 377656 Aug 26 12:30 main.o-rw-r--r-- 1 root root 6544 Aug 26 12:30 main.cpp-rw-r--r-- 1 root root 6376 Aug 26 12:30 ccode.obj-rw-r--r-- 1 root root 2036 Aug 26 12:30 ccode c-rw-r--r-- 1 root root 2036 Aug 26 12:30 ccode.c-rw-r--r-- 1 root root 171 Aug 26 12:30 Makefile-rwxr-xr-x 1 root root 22524 Aug 26 12:30 ccoderoot@am57xx-evm:/usr/share/ti/examples/opencl/ccode# ./ccode[ 540.955345] NET: Registered protocol family 41Success!root@am57xx-evm:/usr/share/ti/examples/opencl/ccode#

  • ExecutingOpenCL Examples:vecaddroot@am57xx-evm:/usr/share/ti/examples/opencl/ccode# cd /vecaddroot@am57xx-evm:/usr/share/ti/examples/opencl/ccode# cd ../vecaddroot@am57xx-evm:/usr/share/ti/examples/opencl/vecadd# lsMakefile main_map_prof.cpp main_prof.cppmain.cpp main_md.cpp vecadd_main.o main_md.o vecadd_mdroot@am57xx-evm:/usr/share/ti/examples/opencl/vecadd# ./vecaddDEVICE: TI Multicore C66 DSP

    Offloading vector addition of 8192K elements...

    Kernel Exec : Queue to Submit: 7 usKernel Exec : Submit to Start : 68 usKernel Exec : Start to End : 32176 us

    Success!Success!

  • BuildingOpenCL Examples Copy the OpenCL examples directory into your CopytheOpenCL examplesdirectoryintoyour

    homedirectory.

    Goto/opencl directory,domakecleand h k ll d ll bandthenmake.Alldirectorieswillbe

    built. Next,runanyoftheprojectsbygoingto

    th j t di t d i ththeprojectdirectoryandrunningtheexecutable.

  • TIDesignExampleCodeWalkthrough

    IntroductiontoOpenCL

  • OpenCL TIDesign:www.ti.com/tool/TIDEP0046

  • OpenCL TIDesign:Resourceshttp://www.ti.com/tool/TIDEP0046

  • TheAlgorithm

  • Th