DSP3210 Overview ---------------- The DSP3210 is a full 32-bit floating point DSP implemented in .9 micron CMOS. It provides many advantages over fixed point DSPs such as the Motorola 56000. Some of the main features of the DSP3210 include: * 32-bit floating point arithmetic. * 32-bit addressing. * Large (8k) on-chip, zero wait-state memory. * Single cycle instructions (for up to 33 Mflops). * Share bus with Motorola or Intel style CPU. * Serial I/O with DMA transfer conters for up to 25 Mbits/second transfer: Serial data transfers occur without processor intervention. Cycles are stolen when necessary. DMA control for serial in and serial out. * Barrel Shifter for bit manipulation in graphics or data encryption. * Both mu-law and A-law encoding. * Bit I/O general purpose 8-bit I/O port for control of external hardware. * Programmable 32-bit timer for interval timing, rate generation, event counting or waveform generation. * Fully vectored interrupt structure with hardware context save: Allows very fast interrupt processing, up to 2 million/second. * Low power CMOS design. No special programming is required on the DSP3210 to implement floating point algorithms, or to process signals with a much larger dynamic range (in excess of 1500 dB as opposed to < 300 dB for fixed point). The DSP3210 is also designed to share a host memory bus with either a Motorola or Intel style CPU. This greatly reduces system cost by removing the requirement for expensive fast local memory for the DSP. This also removes any practical restrictions on program or data size. A large on-chip cache (8k) combined with software that intelligently utilizes the cache allows the DSP3210 to execute complex signal processing algorithms without expensive local memory. All instructions execute in a single cycle (four clock periods> 80 ns for a 50 MHz part or 60 ns for a 66 MHz part) and includes all floating point normalization (which is performed automatically). A single instruction may have two floating point operations: a floating point multiplication and a floating point addition. The DSP3210 also supports up to four memory accesses in a single instruction cycle (quad- word transfer). The DSP3210 architecture features seven functional units: * Control Arithmetic Unit (CAU) * Data Arithmetic Unit (DAU) * On-chip memory (RAM0, RAM1, Boot ROM) * Bus Interface * Serial I/O (SIO) * DMA Controller (DMAC) * Timer/Status Control (TSC) The Control Arithmetic Unit --------------------------- The CAU is responsible for address calculation, branching control and all 16 and 32-bit integer logic and arithmetic operations. It is a RISC core consisting of a 32-bit Arithmetic Logic Unit (ALU), a 32-bit Program Counter (PC), 22 32-bit general purpose registers (r0-r22) and a 32-bit barrel shifter. This core executes instructions at up to 16.7 million instructions per second. There are special register considerations in the CAU: r0 hardwired to 0 (always) r1-r14 DA instruction memory reference (X,Y,Z) pointer registers r15-r19 DA instruction memory reference (X,Y,Z) increment registers r20 used by error exception facility to store old pc r21 stack pointer (sp) r22 pointer to the exception vector table (evtp) The CAU provides the following branching and control instructions: if (COND) goto {N,rB,rB+N} Conditional branch based on flags if (rM-->=0) goto {N,rB,rB+N} Conditional branch using loop counter goto {N,rB,rB+N,M,rB+M} Unconditional branch nop No operation call {N,rB,rB+N,M}(rM) Call subroutine return {rM} Return from subroutine do K,{L,rM} Do next K+1 instruction(s) L+1 or (rM+1) times. K=0,1,2...127; L=rM=0,1,2...2047 dolock K,{L,rM} Signals interlocked bus transfer doblock {L,rM} Signals quad-word transfers ireturn Return from interrupt sftrst Soft-reset; changes error level to base level; encoded as spc=(byte)r0 waiti Wait for interrupt; encoded as spc=(long)r0 where: rB = pc, r0-r22 rM = r1-r22 N = 16-bit unsigned integer M = 24-bit unsigned integer COND = one of the DSP3210 condition codes (refer to DSP3210 manual) The Data Arithmetic Unit ------------------------ The DAU consists of a 32-bit floating point multiplier, a 40-bit floating point adder, four 40-bit floating point accumulators (a0-a3), a clip test register (ctr), and a control register (dauc). The multiplier and adder operate in parallel to perform up to 16.7 million computations per second (12.5 million for a 50 MHz part) of the form (a=b+c*d), also known as a multiply-accumulate. The DAU contains a four stage pipeline which is visible to the application programmer. The DAU supports the following floating point formats: Single precision (32-bit) in both DSP32 and IEEE format Extended single precision (40-bit) (uses 8 mantissa guard bits) Single instruction data type conversions are done in the DAU hardware: DSP32 and IEEE 32-bit floating point 16/32-bit integer 8-bit unsigned mu-law and A-law The DAU has a number of special instructions to greatly simplify data type conversions and other common operations: [Z=] aN = ic(Y) Input conversion mu-law, A-law, 8-bit linear to float. [Z=] aN = oc(Y) Output conversion float to mu-law, A-law, 8-bit linear. [Z=] aN = float16(Y) 16-bit integer to float. [Z=] aN = float32(Y) 32-bit integer to float. [Z=] aN = int16(Y) Float to 16-bit integer (round or truncate, dauc[4]). [Z=] aN = int32(Y) Float to 32-bit integer (round or truncate, dauc[4]). [Z=] aN = round(Y) Round to nearest, float(40) to float(32). [Z=] aN = ifalt(Y) Condidional assignment/memory write. [Z=] aN = ifaeq(Y) Conditional assignment/memory write. [Z=] aN = ifagt(Y) Conditional assignment/memory write. [Z=] aN = dsp(Y) IEEE to DSP format conversion. [Z=] aN = ieee(Y) DSP to IEEE format conversion. [Z=] aN = seed(Y) 32-bit to 32-bit reciprocal seed. Where [Z=] indicates that condition codes may be set. Note that Y may not be a0-a3 for the dsp() special function. Addressing Modes ---------------- DSP3210 assembler language exhibits a syntax very similar to 'C'. The notation conventions are as follows: a0-a3 are the accumulators (DAU), and r0-r22 are the CAU registers. Instructions take the following appearance: r2 = (long)r1 ; CAU register direct: store the contents of r1 in r2 r1 = (long)*r1 ; store value pointed to by r1 in r2 r1 = (long)r1 + 1 ; increment r1 by 1 *r2++ = (long)r1 ; postmodify increment r2 after storing r1 there (in *r2) r3 = (long)r1 + r2 ; add two numbers in r1, r2: store result in r3 r3 = (long)*r1++r2 ; post modify increment r1 by r2: store the result in r3 a2 = a2 + *r2 * a3 ; use that pipeline! The following table lists the various addressing modes supported by the DSP3210: ------------------------------------------------------------------------------ Instruction Type Addressing Mode CA Data CA Data CA Arithmetic/ DA M/A & Move Group Move Group Logic Group Special Func (CAU Reg) (I/O Reg) ------------------------------------------------------------------------------ Short Immediate Yes 24-bit Immediate Yes Memory Indirect Yes CAU Register Direct Yes Yes Yes IO Register Direct Yes DAU Register Direct Yes Yes Register Indirect Yes Yes Yes Register Indirect with Yes Yes Yes Postmodification ------------------------------------------------------------------------------ Latency Issues -------------- The most difficult aspect of programming the DSP3210 is being aware of latency in the instruction pipeline. There are four cases in the DAU when the pipeline affects latency. The cases are: 1. DA Memory Writes. When a DA instruction specifies a write to memory, the value written is not available to be read from that location until four instructions later (a three instruction latency). For example: *r3 = a0 = a0 ; instruction 1 *r3 = a3 = a3 ; instruction 2 . ; instruction 3 . ; instruction 4 a1 = *r3 ; instruction 5 The value read in instruction 5 is the value written in instruction 1, not instruction 2. Instructions 3 and 4 are latent instructions for instruction 1 and instructions 3, 4 and 5 are latent for instruction 2. 2. Accumulator as Multiplier Input. When an accumulator is used as an input to the multiplier, its value is established no sooner than three instructions prior to the multiply instruction (a two instruction latency). Note that this also applies to an accumulator using the X field of an instruction of the form: [Z=] aN = [-]Y {+,-}X For example: a0 = a0 + *r1**r2 ; instruction 1 a0 = a0 + a1 ; instruction 2 . ; instruction 3 a2 = a0 * a0 ; instruction 4 a1 = a2 + a0 ; instruction 5 The value of a0 used in instruction 4 is calculated in instruction 1. The value of a0 used in instruction 5 is calculated in instruction 4 since there is no latency effect on accumulators used as inputs to the adder. 3. Branching. When a CA Control Group instruction of the form if()goto, call, return, goto is executed, the instruction immediately following is also executed before the branch occurs. This is commonly referred to as a delayed branch. The ireturn instruction is different, and execution of the base-level program resumes in the following instruction cycle. For example: if(eq) goto over ; instruction 1 r1 = 3 ; instruction 2 . ; instruction 3 Instruction 2 is executed even if the condition is true and the branch is taken. If this is undesirable, a nop can be placed after the branch instruction, or if possible, the instructions can be rearranged. Because of this latency, a complex situation arises if successive branch instructions are coded in the following manner: goto A ; instruction 1 goto B ; instruction 2 A: . B: . C: . The order of execution is instruction 1, instruction 2, A, B. If the instruction at A is not a goto, execution continues from B. If the instruction at A is goto C, the order of execution is instruction 1, instruction 2, A, B, C, and execution continues from C. Successive branch instructions are useful in some applicationss. 4. Conditional Branching on DAU Conditions. A DAU conditional branch or conditional arithmetic/logic instruction is established by the last DA instruction that affects DAU flags no sooner than four instructions prior to the test (a three instruction latency): a0 = a0 + a1 ; instruction 1 a2 = a0 * a2 ; instruction 2 . ; instruction 3 . ; instruction 4 if(agt) goto next ; instruction 5 . ; instruction 6 (latent instruction) The condition tested in instruction 5 is established by instruction 1, not instruction 2. Because of this latency effect, use the zero-latency ifalt(), ifagt() and ifaeq() functions where possible (see DA Special Instructions). The DA condition tested by these conditional accumulator loads is established by the last DA instruction that affected the DAU flags. Ref. "DSP3210 Digital Signal Processor. The Multimedia Solution", Information Manual. AT&T, September 1991 printing. "VCOS Multimedia Development Kit, Technical Reference". AT&T Release 1.0, March 1992 printing. DSP3210 Support Software Toolkit Manual, Release 1.3 DSP3210 Support Software Library Manual, Release 1.3