DSP3210 Overview
----------------

The DSP3210 is a full 32-bit floating point DSP implemented in .9 micron CMOS.
It provides many advantages over fixed point DSPs such as the Motorola 56000.
Some of the main features of the DSP3210 include:

  * 32-bit floating point arithmetic.
  * 32-bit addressing.
  * Large (8k) on-chip, zero wait-state memory.
  * Single cycle instructions (for up to 33 Mflops).
  * Share bus with Motorola or Intel style CPU.
  * Serial I/O with DMA transfer conters for up to 25 Mbits/second transfer:
      Serial data transfers occur without processor intervention.
      Cycles are stolen when necessary.
      DMA control for serial in and serial out.
  * Barrel Shifter for bit manipulation in graphics or data encryption.
  * Both mu-law and A-law encoding.
  * Bit I/O general purpose 8-bit I/O port for control of external hardware.
  * Programmable 32-bit timer for interval timing, rate generation, event
    counting or waveform generation.
  * Fully vectored interrupt structure with hardware context save:
      Allows very fast interrupt processing, up to 2 million/second.
  * Low power CMOS design.

No special programming is required on the DSP3210 to implement floating point
algorithms, or to process signals with a much larger dynamic range (in excess
of 1500 dB as opposed to < 300 dB for fixed point). The DSP3210 is also
designed to share a host memory bus with either a Motorola or Intel style CPU.
This greatly reduces system cost by removing the requirement for expensive
fast local memory for the DSP. This also removes any practical restrictions on
program or data size. A large on-chip cache (8k) combined with software that
intelligently utilizes the cache allows the DSP3210 to execute complex signal
processing algorithms without expensive local memory. All instructions execute
in a single cycle (four clock periods> 80 ns for a 50 MHz part or 60 ns for a
66 MHz part) and includes all floating point normalization (which is performed
automatically). A single instruction may have two floating point operations:
a floating point multiplication and a floating point addition. The DSP3210
also supports up to four memory accesses in a single instruction cycle (quad-
word transfer). The DSP3210 architecture features seven functional units:

  * Control Arithmetic Unit (CAU)
  * Data Arithmetic Unit (DAU)
  * On-chip memory (RAM0, RAM1, Boot ROM)
  * Bus Interface
  * Serial I/O (SIO)
  * DMA Controller (DMAC)
  * Timer/Status Control (TSC)


The Control Arithmetic Unit
---------------------------

The CAU is responsible for address calculation, branching control and all 16
and 32-bit integer logic and arithmetic operations. It is a RISC core
consisting of a 32-bit Arithmetic Logic Unit (ALU), a 32-bit Program Counter
(PC), 22 32-bit general purpose registers (r0-r22) and a 32-bit barrel
shifter. This core executes instructions at up to 16.7 million instructions
per second. There are special register considerations in the CAU:

  r0              hardwired to 0 (always)
  r1-r14          DA instruction memory reference (X,Y,Z) pointer registers
  r15-r19         DA instruction memory reference (X,Y,Z) increment registers
  r20             used by error exception facility to store old pc
  r21             stack pointer (sp)
  r22             pointer to the exception vector table (evtp)

The CAU provides the following branching and control instructions:

  if (COND) goto {N,rB,rB+N}       Conditional branch based on flags
  if (rM-->=0) goto {N,rB,rB+N}    Conditional branch using loop counter
  goto {N,rB,rB+N,M,rB+M}          Unconditional branch
  nop                              No operation
  call {N,rB,rB+N,M}(rM)           Call subroutine
  return {rM}                      Return from subroutine
  do K,{L,rM}                      Do next K+1 instruction(s) L+1 or (rM+1)
                                   times. K=0,1,2...127; L=rM=0,1,2...2047
  dolock K,{L,rM}                  Signals interlocked bus transfer
  doblock {L,rM}                   Signals quad-word transfers
  ireturn                          Return from interrupt
  sftrst                           Soft-reset; changes error level to base
                                   level; encoded as spc=(byte)r0
  waiti                            Wait for interrupt; encoded as spc=(long)r0

  where:  rB = pc, r0-r22
          rM = r1-r22
          N = 16-bit unsigned integer
          M = 24-bit unsigned integer
          COND = one of the DSP3210 condition codes (refer to DSP3210 manual)


The Data Arithmetic Unit
------------------------

The DAU consists of a 32-bit floating point multiplier, a 40-bit floating
point adder, four 40-bit floating point accumulators (a0-a3), a clip test
register (ctr), and a control register (dauc). The multiplier and adder
operate in parallel to perform up to 16.7 million computations per second
(12.5 million for a 50 MHz part) of the form (a=b+c*d), also known as a
multiply-accumulate. The DAU contains a four stage pipeline which is visible
to the application programmer. The DAU supports the following floating point
formats:

  Single precision (32-bit) in both DSP32 and IEEE format
  Extended single precision (40-bit) (uses 8 mantissa guard bits)

Single instruction data type conversions are done in the DAU hardware:

  DSP32 and IEEE 32-bit floating point
  16/32-bit integer
  8-bit unsigned
  mu-law and A-law

The DAU has a number of special instructions to greatly simplify data type
conversions and other common operations:

  [Z=] aN = ic(Y)      Input conversion mu-law, A-law, 8-bit linear to float.
  [Z=] aN = oc(Y)      Output conversion float to mu-law, A-law, 8-bit linear.
  [Z=] aN = float16(Y) 16-bit integer to float.
  [Z=] aN = float32(Y) 32-bit integer to float.
  [Z=] aN = int16(Y)   Float to 16-bit integer (round or truncate, dauc[4]).
  [Z=] aN = int32(Y)   Float to 32-bit integer (round or truncate, dauc[4]).
  [Z=] aN = round(Y)   Round to nearest, float(40) to float(32).
  [Z=] aN = ifalt(Y)   Condidional assignment/memory write.
  [Z=] aN = ifaeq(Y)   Conditional assignment/memory write.
  [Z=] aN = ifagt(Y)   Conditional assignment/memory write.
  [Z=] aN = dsp(Y)     IEEE to DSP format conversion.
  [Z=] aN = ieee(Y)    DSP to IEEE format conversion.
  [Z=] aN = seed(Y)    32-bit to 32-bit reciprocal seed.

Where [Z=] indicates that condition codes may be set. Note that Y may not be
a0-a3 for the dsp() special function.


Addressing Modes
----------------

DSP3210 assembler language exhibits a syntax very similar to 'C'. The notation
conventions are as follows: a0-a3 are the accumulators (DAU), and r0-r22 are
the CAU registers. Instructions take the following appearance:

  r2 = (long)r1      ; CAU register direct: store the contents of r1 in r2
  r1 = (long)*r1     ; store value pointed to by r1 in r2
  r1 = (long)r1 + 1  ; increment r1 by 1
  *r2++ = (long)r1   ; postmodify increment r2 after storing r1 there (in *r2)
  r3 = (long)r1 + r2 ; add two numbers in r1, r2: store result in r3
  r3 = (long)*r1++r2 ; post modify increment r1 by r2: store the result in r3
  a2 = a2 + *r2 * a3 ; use that pipeline!


The following table lists the various addressing modes supported by the
DSP3210:

------------------------------------------------------------------------------
                                       Instruction Type
Addressing Mode         CA Data      CA Data      CA Arithmetic/  DA M/A &
                        Move Group   Move Group   Logic Group     Special Func
                        (CAU Reg)    (I/O Reg)
------------------------------------------------------------------------------
Short Immediate         Yes
24-bit Immediate        Yes
Memory Indirect         Yes
CAU Register Direct     Yes          Yes          Yes
IO Register Direct      Yes
DAU Register Direct     Yes                                       Yes
Register Indirect       Yes          Yes                          Yes
Register Indirect with  Yes          Yes                          Yes
Postmodification
------------------------------------------------------------------------------


Latency Issues
--------------

The most difficult aspect of programming the DSP3210 is being aware of latency
in the instruction pipeline. There are four cases in the DAU when the pipeline
affects latency. The cases are:

1. DA Memory Writes. When a DA instruction specifies a write to memory, the
   value written is not available to be read from that location until four
   instructions later (a three instruction latency). For example:

  *r3 = a0 = a0     ; instruction 1
  *r3 = a3 = a3     ; instruction 2
  .                 ; instruction 3
  .                 ; instruction 4
  a1 = *r3          ; instruction 5

The value read in instruction 5 is the value written in instruction 1, not
instruction 2. Instructions 3 and 4 are latent instructions for instruction 1
and instructions 3, 4 and 5 are latent for instruction 2.

2. Accumulator as Multiplier Input. When an accumulator is used as an input to
   the multiplier, its value is established no sooner than three instructions
   prior to the multiply instruction (a two instruction latency). Note that
   this also applies to an accumulator using the X field of an instruction
   of the form:

     [Z=] aN = [-]Y {+,-}X

For example:

  a0 = a0 + *r1**r2     ; instruction 1
  a0 = a0 + a1          ; instruction 2
  .                     ; instruction 3
  a2 = a0 * a0          ; instruction 4
  a1 = a2 + a0          ; instruction 5

The value of a0 used in instruction 4 is calculated in instruction 1. The
value of a0 used in instruction 5 is calculated in instruction 4 since there
is no latency effect on accumulators used as inputs to the adder.

3. Branching. When a CA Control Group instruction of the form if()goto, call,
   return, goto is executed, the instruction immediately following is also
   executed before the branch occurs. This is commonly referred to as a
   delayed branch. The ireturn instruction is different, and execution of the
   base-level program resumes in the following instruction cycle. For example:

  if(eq) goto over      ; instruction 1
  r1 = 3                ; instruction 2
  .                     ; instruction 3

Instruction 2 is executed even if the condition is true and the branch is
taken. If this is undesirable, a nop can be placed after the branch
instruction, or if possible, the instructions can be rearranged. Because of
this latency, a complex situation arises if successive branch instructions are
coded in the following manner:

  goto A                ; instruction 1
  goto B                ; instruction 2

  A:
  .
  B:
  .
  C:
  .

The order of execution is instruction 1, instruction 2, A, B. If the
instruction at A is not a goto, execution continues from B. If the instruction
at A is goto C, the order of execution is instruction 1, instruction 2, A, B,
C, and execution continues from C. Successive branch instructions are useful
in some applicationss.

4. Conditional Branching on DAU Conditions. A DAU conditional branch or
   conditional arithmetic/logic instruction is established by the last DA
   instruction that affects DAU flags no sooner than four instructions prior
   to the test (a three instruction latency):

  a0 = a0 + a1          ; instruction 1
  a2 = a0 * a2          ; instruction 2
  .                     ; instruction 3
  .                     ; instruction 4
  if(agt) goto next     ; instruction 5
  .                     ; instruction 6 (latent instruction)

The condition tested in instruction 5 is established by instruction 1, not
instruction 2. Because of this latency effect, use the zero-latency ifalt(),
ifagt() and ifaeq() functions where possible (see DA Special Instructions).
The DA condition tested by these conditional accumulator loads is established
by the last DA instruction that affected the DAU flags.


Ref.

"DSP3210 Digital Signal Processor. The Multimedia Solution", Information
Manual. AT&T, September 1991 printing.

"VCOS Multimedia Development Kit, Technical Reference". AT&T Release 1.0,
March 1992 printing.

DSP3210 Support Software Toolkit Manual, Release 1.3

DSP3210 Support Software Library Manual, Release 1.3