### Efficient Binary Field Multiplication on a VLIW DSP #### Christian Sean Tergino Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science In Computer Engineering Patrick Schaumont Wuchun Feng Michael Hsiao June 18, 2009 Blacksburg, VA Keywords: Binary Field, Galois Field, GF, Multiplication, Digital Signal Processor, Very Long Instruction Word, Modular Multiplication, C64x+, Heterogeneous Multiprocessors, Copyright 2009, Christian Sean Tergino #### Efficient Binary Field Multiplication on a VLIW DSP #### Christian Sean Tergino #### **ABSTRACT** Modern public-key cryptography relies extensively on modular multiplication with long operands. We investigate the opportunities to optimize this operation in a heterogeneous multiprocessing platform such as TI OMAP3530. By migrating the long-operand modular multiplication from a general-purpose ARM Cortex A8 to a specialized C64x+ VLIW DSP, we are able to exploit the XOR-Multiply instruction and the inherent parallelism of the DSP. The proposed multiplication utilizes Multi-Precision Binary Polynomial Multiplication with Unbalanced Exponent Modular Reduction. The resulting DSP implementation performs a $GF(2^{233})$ multiplication in less than 1.31us, which is over a seven times speed up when compared with the ARM implementation on the same chip. We present several strategies for different field sizes and field polynomials, and show that a 360MHz DSP easily outperforms the 500MHz ARM. # **ACKNOWLEDGEMENTS** I would like to thank all of my instructors and advisors throughout my time at Virginia Tech. I would specifically like to thank my advisory committee members Michael Hsiao and Wuchun Feng. A special thanks goes to my advisor Patrick Schaumont, who has spent a good deal of time helping guide my research. I also would like to thank my research group and my fellow students for their support. # **FOREWORD** This is an expanded version of a paper submitted to International Symposium on System-on-Chip on May 29th, 2009. It is currently under review. # TABLE OF CONTENTS | CHAPTER 1: INTRODUCTION | 1 | |------------------------------------------------------|----| | CHAPTER 2: BACKGROUND | 3 | | 2.1 Prime Field vs. Binary Field | 3 | | 2.2 Binary Field | 4 | | 2.3 NIST Finite Fields | 4 | | 2.4 Montgomery's Algorithm | 5 | | 2.5 Karatsuba Multiplication | 6 | | 2.6 Unbalanced Exponent Modular Reduction | 6 | | 2.7 Texas Instruments Beagle Board and OMAP3530 | 8 | | 2.8 Texas Instruments TMS320C64x+ DSP | 9 | | 2.9 Related Work | 10 | | CHAPTER 3: PROPOSED METHOD | 11 | | 3.1 Multi-Precision Binary Polynomial Multiplication | 11 | | 3.2 Unbalanced Exponent Modular Reduction | 13 | | CHAPTER 4: METHODOLOGY | 15 | | CHAPTER 5: RESULTS | 24 | | 5.1 Tools and Environment | 24 | | 5.2 Execution Times on the OMAP3530 | 24 | | 5.3 Cycle Count Compared To Other Implementations | 28 | | 5.4 Reduction Results | 29 | | CHAPTER 6: FUTURE WORK | 33 | | 6.1 Improving Binary Field Multiplication | 33 | | 6.2 Implementing a Cryptosystem on the C64x+ | 34 | | 6.3 Finite Field Multiplications on Other Processors | 34 | | CHAPTER 7: CONCLUSIONS | 35 | | REFERENCES | 36 | # LIST OF FIGURES | FIGURE 1: TEXAS INSTRUMENTS BEAGLE BOARD | 7 | |-------------------------------------------------------------------------------------|----| | FIGURE 2: OMAP3530 FUNCTIONAL BLOCK DIAGRAM | 8 | | FIGURE 3: BLOCK DIAGRAM OF THE C64X+ DSP | 9 | | FIGURE 4: MULTI-PRECISION BINARY POLYNOMIAL MULTIPLICATION | 11 | | FIGURE 5: EXECUTION DIAGRAM OF A MULTIPLICATION | 12 | | FIGURE 6: UNBALANCED EXPONENT MODULAR REDUCTION | 14 | | FIGURE 7: DESIGN FLOW FOR DEVELOPING SOFTWARE FOR THE C64x+ | 15 | | FIGURE 8: EXECUTION TIMES OF MULTIPLICATIONS | 25 | | FIGURE 9: MULTIPLICATION EXECUTION CYCLES PER BIT | 26 | | FIGURE 10: MULTIPLICATION EXECUTION CYCLES PER BIT <sup>2</sup> ON C64x+ | 26 | | FIGURE 11: MULTIPLICATION EXECUTION CYCLES PER BIT <sup>2</sup> ON ARM | 27 | | FIGURE 12: MULTIPLICATION EXECUTION CYCLES PER BIT RAISED TO THE LOG <sub>2</sub> 3 | 27 | | FIGURE 13: AN ARM WITH MULGF VERSUS OUR IMPLEMENTATION | 29 | | FIGURE 14: CYCLES FOR UNBALANCED EXPONENT MODULAR REDUCTION | 30 | | FIGURE 15: PERCENTAGE OF EXECUTION TIME NEEDED FOR REDUCTION | 31 | # LIST OF TABLES | TABLE I: NIST PRIME FIELD SIZES | 5 | |-----------------------------------------------------------|----| | TABLE II: NIST BINARY FIELD IRREDUCIBLE POLYNOMIALS | 5 | | TABLE III: MODULAR MULTIPLICATION ON THE TI OMAP3530 | 25 | | TABLE IV: CYCLES FOR MODULAR MULTIPLICATION | 28 | | TABLE V: CYCLES FOR UNBALANCED EXPONENT MODULAR REDUCTION | 29 | | TABLE VI: CYCLE PERCENTAGE FOR REDUCTION | 32 | # CHAPTER 1 # **INTRODUCTION** Modern mobile devices frequently make use of heterogeneous multi-processors, which allows them to execute a mix of multimedia applications and general-purpose information technology. The parallelism, as well as the architectural heterogeneity, ensures that these chips are very energy-efficient. For example, the TI OMAP3530 runs off less than 500mA at 5V [1]. However, this efficiency is only available when applications can optimally exploit the features of the architecture. This means that software applications must be developed with the specialized platform features in mind. Indeed, these features are typically ignored by general-purpose software compilers, and require either a clever, architecture-aware programmer, or else a specialized software library. In this thesis we consider the efficient execution of modular multiplication of long operands. Modular multiplication is a cornerstone of public-key cryptography, and it is used in algorithms such as ECC, RSA and DSA. Most known software optimizations of cryptographic modular multiplications assume standard processor architectures, and focus on the algorithm. Some well-known examples are modular multiplications with Montgomery or Barrett reduction, or multiplication based on Karatsuba decomposition [2]. However, rather than developing new algorithms, we investigate the opportunities offered by a heterogeneous Multi-Processor System on Chip (MPSoC) architecture. We investigate the implementation of modular multiplications on Texas Instrument's OMAP3530, which has an ARM Cortex A8 and a TI C64x+ DSP. We will use the C64x+ DSP in the OMAP3530 as a cryptographic accelerator. There are two arguments for this. The first is that this DSP is a Very Long Instruction Word (VLIW) architecture, which enables parallelism. The second is that the DSP has an XOR-Multiply (XORMPY) instruction, which can accelerate binary field multiplications in $GF(2^m)$ . Neither of these features is available on the ARM. Because cryptographic operands typically are several hundreds of bits long, and because the C64x+ DSP is a 32-bit processor, we have to implement modular multiplications using multi-precision arithmetic, based on combining XORMPYs, XORs and shifts. We call this operation Multi-Precision Binary Polynomial Multiplication (MPBPM). The contribution of our work is an efficient implementation of MPBPM on the C64x+ DSP. We also present two efficient modular reduction techniques based on Unbalanced Exponent Modular Reduction (UEMR) [3]. While these algorithms are known, we are not aware of any Binary Field Multiplication implementations optimized for the C64x+ DSP. Our results show that the resulting binary field multiplication executes six times faster on the DSP when compared with the ARM. This comparison is made against an ARM which runs an optimized modular multiplication algorithm at a higher clock frequency. The remainder of this thesis is organized as follows. Chapter 2 introduces Finite Fields and the TI OMAP3530. Chapter 3 presents our proposed methods for Multi-Precision Binary Polynomial Multiplication and Unbalanced Exponent Modular Reduction. Our methodology is explained in Chapter 4. Results are given in Chapter 5. Possible future work on this topic is discussed in Chapter 6, and conclusions are drawn in Chapter 7. ## CHAPTER 2 # **BACKGROUND** This chapter reviews prime field and binary field arithmetic, Unbalanced Exponent Modular Reduction, Montgomery's algorithm, Karatsuba Multiplication, the TI Beagle Board, OMAP3530 and C64x+. #### 2.1 Prime Field vs. Binary Field ECC, a very popular public-key cryptographic primitive, is implemented over GF(p), a prime field, or $GF(2^m)$ , a binary field. The National Institute of Standard and Technology (NIST) recommends five specific prime fields and five specific binary fields, with curves defined for each. GF(p) defines a finite field of integers, where its elements are $\{0, 1, 2, 3, ..., p-2, p-1\}$ and p is a prime number. Every operation is performed modulo p. $GF(2^m)$ defines a Finite Field of binary polynomials, i.e. polynomials whose coefficients are each 0 or 1. The maximal term in a number in $GF(2^m)$ is $x^{m-1}$ . Every operation in this field is performed modulo an irreducible polynomial, f(x). GF(p) is popularly implemented in software while $GF(2^m)$ is usually implemented in hardware. $GF(2^m)$ multiplications are faster than GF(p) in hardware because there are no carries, which results in a reduced critical path [4]. It is also quicker to compute the inverse in a binary field versus a prime field [2]. On the other hand, GF(p) multiplications are faster than $GF(2^m)$ in software because processors have integer multipliers built into them. $GF(2^m)$ relies on binary polynomial multiplications and a large majority of processors do not have support for this. However, the C64x+ does have support for binary polynomial multiplication for Reed Solomon based error control coding [5]. This special purpose hardware can also be utilized by cryptographic algorithms by implementing MPBPM. #### 2.2 Binary Fields Binary field numbers are within $GF(2^m)$ and are represented in the form $$A(x) = a_0 x^0 + a_1 x^1 + a_2 x^2 + \dots + a_{m-2} x^{m-2} + a_{m-1} x^{m-1},$$ (1) where each coefficient $a_i = \{0, 1\}$ . Registers can easily represent these numbers where each bit in a word is a coefficient. For each binary field, an irreducible polynomial f(x) is defined: $$f(x) = T(x) + x^m, (2)$$ $$T(x) = x^{0} + t_{1}x^{1} + t_{2}x^{2} + \dots + t_{m-2}x^{m-2} + t_{m-1}x^{m-1},$$ (3) where every coefficient, $t_i = \{0, 1\}$ . All operations in $GF(2^m)$ are performed modulo f(x). Addition and subtraction are equivalent to performing a bitwise Exclusive-OR between the two operands. The first step of multiplication, Binary Polynomial Multiplication (BPM), is similar to integer multiplication. In the second step, partial products are added together, which is equivalent to performing bitwise Exclusive-ORs. Performing this BPM, C(x) = A(x)B(x) yields $$C(x) = c_0 x^0 + c_1 x^1 + c_2 x^2 + \dots + c_{2m-2} x^{2m-3} + c_{2m-2} x^{2m-2}.$$ (4) The product must be reduced to remain within $GF(2^m)$ , which is done by performing C(x) mod f(x). #### 2.3 NIST Finite Fields NIST recommends the usage of five prime fields and five binary fields. The NIST prime fields are GF(p192), GF(p224), GF(p256), GF(p384) and GF(p521). The prime fields are referred to in the form of GF(pn), where n is the number of bits needed to represent the prime modulo, p. Therefore, GF(p192) means that the field has p elements, where p is a 192-bit prime number. The p values are shown for each NIST Prime Field in Table I. The NIST Binary Fields are referred to in the form $GF(2^m)$ , where m is the number of bits in the binary field. NIST has recommended an irreducible polynomial for each binary field. These polynomials are chosen to make computations in that field very efficient. Each of these irreducible polynomials has either three or five terms and the biggest term in T(x) is less than $x^{m/2}$ . The NIST binary fields are $GF(2^{163})$ , $GF(2^{233})$ , $GF(2^{283})$ , $GF(2^{409})$ and $GF(2^{571})$ . Table II displays the irreducible polynomial for each NIST Binary Field. TABLE I. NIST PRIME FIELD SIZES | NIST<br>Prime Field | Field Size | |---------------------------|-----------------------------------------------------| | <i>GF</i> ( <i>p</i> 192) | 2 <sup>192</sup> - 2 <sup>64</sup> - 2 <sup>0</sup> | | <i>GF</i> ( <i>p</i> 224) | $2^{224} - 2^{96} + 2^0$ | | <i>GF</i> ( <i>p</i> 256) | $2^{256} - 2^{224} + 2^{192} + 2^{96} - 2^{0}$ | | <i>GF</i> ( <i>p</i> 384) | $2^{384} - 2^{128} - 2^{96} + 2^{32} - 2^{0}$ | | <i>GF</i> ( <i>p</i> 521) | $2^{521}$ - $2^0$ | TABLE II. NIST BINARY FIELD IRREDUCIBLE POLYNOMIALS | NIST<br>Binary Field | Irreducible<br>Polynomial | |----------------------|--------------------------------------| | $GF(2^{163})$ | $x^{163} + x^7 + x^6 + x^3 + x^0$ | | $GF(2^{233})$ | $x^{233} + x^{74} + x^0$ | | $GF(2^{283})$ | $x^{283} + x^{12} + x^7 + x^5 + x^0$ | | $GF(2^{409})$ | $x^{409} + x^{87} + x^0$ | | $GF(2^{571})$ | $x^{571} + x^{10} + x^5 + x^2 + x^0$ | ### 2.4 Montgomery's Algorithm Montgomery's algorithm is commonly used to speed up reduction. In prime field, there exists a k such that $2^{k-1} . Let <math>r$ be $2^k$ and let the p-residue of a number a be $\bar{a} = a \cdot r \pmod{p}$ . The Montgomery product of two p-residues, $\bar{a}$ and $\bar{b}$ , is $\bar{c} = \bar{a} \cdot \bar{b} \cdot r^{-1} \pmod{p} = c \cdot r \pmod{p}$ . Thus, c can be calculated by performing Montgomery Reduction: $c = \bar{c} \cdot r^{-1} \pmod{p}$ [6]. For example, if p = 11 then r = 16 and $r^{-1} = 9$ because $r \cdot r^{-1} \pmod{11} = 1$ . Let a = 5 and b = 4. Then $\bar{a} = 5 \cdot 16 \pmod{11} = 3$ and $\bar{b} = 4 \cdot 16 \pmod{11} = 9$ . Let $c = a \cdot b \pmod{p}$ . Therefore $\bar{c} = 3 \cdot 9 \cdot 9 \pmod{11} = 1$ and $c = 1 \cdot 9 \pmod{11} = 9$ . Performing Montgomery Reduction is faster than regular reduction (dividing by p). Performing more multiplication with the same operands will reduce the total execution time because the p-residue needs to only be calculated once for each number [7]. This algorithm can similarly be applied to binary fields. #### 2.5 Karatsuba Multiplication The Karatsuba method can be used to reduce the execution time for large operand multiplication. This method replaces some multiplications with additions and subtractions. With a base b, if an operand x has two digits, $(x_1x_0)$ , and operand y has two digits, $(y_1y_0)$ , then their product z is $$z = (x_1 y_1)b^2 + (x_0 y_1 + x_1 y_0)b + (x_0 y_0).$$ (8) Karatsuba showed that a multiplication can be eliminated from (8). The middle coefficient is equivalent to $$x_1 y_0 + x_0 y_1 = (x_0 y_0 + x_0 y_1 + x_1 y_0 + x_1 y_1) - x_1 y_1 - x_0 y_0$$ (9) $$x_1 y_0 + x_0 y_1 = (x_0 + x_1)(y_0 + y_1) - x_1 y_1 - x_0 y_0.$$ (10) Therefore (8) can be rewritten as $$z = (x_1 y_1)b^2 + ((x_0 + x_1)(y_0 + y_1) - x_1 y_1 - x_0 y_0)b + (x_0 y_0).$$ (11) For examples, if $b = 10^2$ and x = 2345 and y = 6789, then $x_1 = 23$ , $x_0 = 45$ , $y_1 = 67$ and $y_0 = 89$ . Traditionally, to calculate $z = x \cdot y$ , one would calculate $(45 \cdot 89) + (45 \cdot 67 + 23 \cdot 67)10^2 + (23 \cdot 67)10^4$ , which has four multiplications. Using Karatsuba Multiplication, one would calculate $z_0 = 45 \cdot 89 = 4005$ and $z_2 = 23 \cdot 67 = 1541$ first. Next, $z_1$ would be calculated as follows: $z_1 = (45 + 23) \cdot (89 + 67) - z_2 - z_0 = 5062$ . Hence, $z = z_0 + z_1b + z_2b^2 = 4005 + 5062 \cdot 10^2 + 1541 \cdot 10^4 = 15920205$ , which is equal to $2345 \cdot 6789$ . (11) has two more additions and subtractions but one less multiplication than (8). Traditional multiplication is performed in $O(n^2)$ because every word must be multiplied by every other word. Karatsuba Multiplication can be performed in $O(n^{\log_2 3})$ [8]. ### 2.6 Unbalanced Exponent Modular Reduction Shen, Jin and You proposed using Unbalanced Exponent Modular Reduction over binary fields in [3]. C(x) can be divided into two parts as shown in (5). Since $$C(x) = C_l(x) + C_h(x)x^m$$ (5) and $$T(x) \equiv x^m \mod f(x)$$ (6) then $$C(x) \equiv C_l(x) + C_h(x)T(x) \mod f(x)$$ (7) Both $C_l(x)$ and $C_h(x)$ are within $GF(2^m)$ . Because of (6), (7) can be performed and repeated until $C_h(x)$ is zero, in which case C(x) is completely reduced. When the largest nonzero term in T(x) is $x^k$ and $k < \frac{m}{2}$ , then (7) need only be performed twice. This is the case for all NIST Binary Fields. For example in $GF(2^4)$ , $f(x) = x^4 + T(x)$ and T(x) = x + 1. Multiplying $(x^3 + x + 1)$ $(x^2 + x)$ yields $C(x) = x^5 + x^4 + x^3 + x^2 + x^2 + x = x^5 + x^4 + x^3 + x$ . Therefore $C_l(x) = x^3 + x$ and $C_h(x) = x + 1$ . To reduce this product using UEMR, we multiply $C_h(x)$ by T(x), which gives $(x + 1)(x + 1) = x^2 + x + x + 1 = x^2 + 1$ . We take this and add it to $C_l(x)$ , which yields $x^3 + x^2 + x + 1$ . Since now $C_h(x)$ is zero, the product is completely reduced. If the $C_h(x)$ was not zero, this process would be repeated. **Figure 1.** Texas Instruments Beagle Board [1]. Photo by Gerald Coley of BeagleBoard.org. #### 2.7 Texas Instruments Beagle Board and OMAP3530 The Beagle Board, shown in Figure 1, is a development board manufactured by Texas Instruments and available for purchase through DigiKey for \$149. This board contains an OMAP3530, 128MB DDR RAM, 256 MB Flash, an SD Card Reader and a variety of I/O ports. The Beagle Board consumes less than 2500mW and can be powered by a USB port [1]. The TI OMAP3530 is a multiprocessor system in a Package-On-Package implementation built using 65nm technology [9]. It contains an ARM Cortex A8 and a TMS320C64x+ DSP [1]. Figure 2 depicts its Functional Block Diagram. The ARM Cortex A8 is an applications processor based on the ARMv7 architecture and consumes less than 300mW [10]. The Cortex A8 supports Thumb-2 technology and is implemented with 16KB Instruction L1 Cache, 16KB Data L1 Cache and 256KB L2 Cache [9]. **Figure 2.** OMAP3530 Functional Block Diagram. #### 2.8 Texas Instruments TMS320C64x+ DSP The C64x+ is a VLIW processor with 32KB of L1 Program Cache, 80KB of L1 Data Cache, and 96KB of L2 Cache [9]. It contains two identical data paths, each with four functional units. Each instruction is 32-bits and the processor can execute 8 instructions every cycle. Therefore every VLIW instruction fetch is 256-bits. The processor contains two register files, one for each data path. Each register file has 32 32-bit registers. The registers and data paths are shown in Figure 3. Each functional unit can optionally retrieve an operand from a register in the other data path. Shifts can be performed by the functional units .S1 and .S2. XORs, ANDs, and ORs can be performed by the functional units .S1, .S2, .D1, .D2, .L1 and .L2. Multiplications can be performed by .M1 and .M2 [11]. So in one cycle, the C64x+ can perform two shifts, four XORs and two multiplies. Alternatively, the C64x+ can perform six XOR instructions and two multiply instructions every cycle. Therefore an addition in $GF(2^m)$ can take just $[m/(32\cdot6)]$ clock cycles. Thus binary field addition can be performed faster on the C64x+ when compared to the ARM, since an ARM can only execute one 32-bit XOR every cycle. **Figure 3.** Block Diagram of the C64x+ DSP. The C64x+ contains an XOR-Multiply (XORMPY) instruction, which is identical to a normal multiplication except that the partial products are XORed together instead of added. This is ideal for binary polynomial multiplication. The operand sizes for XORMPY are limited to 32-bits and 9-bits and the product size is limited to 32-bits. #### 2.9 Related Work In [12], cryptosystems were implemented on the Texas Instruments TMS320C6201 DSP. The authors used Montgomery's algorithm for prime field multiplication on the DSP. This paper shows that RSA, DSA and ECC can run fast on a VLIW DSP. In [13], the architecture of the TMS320C6201 was considered when enhancing an algorithm to perform modular multiplication. Gastaldo, Parodi and Zunino produced a 45% reduction in cycles for 2048-bit prime field multiplications using Montgomery's algorithm. Like [13], we analyze a DSP's architecture to help speed up modular multiplication. However, we instead use binary fields and Unbalanced Exponent Modular Reduction. [14] shows that binary field multiplication can be sped up greatly by adding an instruction set extension, MULGF, to an ARM. MULGF performs 32-bit by 32-bit binary polynomial multiplication. However, adding this extension requires implementing the processor on an FPGA, which is not lower power, or designing a new ASIC, which takes significant time and resources. We will utilize a similar instruction, XORMPY, which is already part of the C64x+. Therefore our method requires no hardware design or implementation. # **CHAPTER 3** # PROPOSED MULTIPLICATION METHOD In this chapter we present our proposed method for Multi-Precision Binary Polynomial Multiplication and our proposed implementations for Unbalanced Exponent Modular Reduction. #### 3.1 Multi-Precision Binary Polynomial Multiplication We propose using the XORMPY instruction to perform Multi-Precision Binary Polynomial Multiplications. XORMPY puts its result in a 32-bit register. Since its operands are limited to 32-bits and 9-bits, the results extends out by 8-bits (32+9-1). Since **Figure 4.** Multi-Precision Binary Polynomial Multiplication: Partial products from 24-bit by 8-bit XORMPY are XORed into 32-bit products. we need all bits in the result, the sum of the number of bits in the operands must be less than or equal to 33 (not 32 because there are no carries). To keep everything byte-aligned, we propose performing 24-bit by 8-bit XORMPYs. Figure 4 shows an operand, "A", being multiplied by an operand, "B." Operand "A" must be shifted into 24-bit subwords. Each subword can then be multiplied by each byte of the "B" operand. This will produce partial products which can then be XORed together to form a complete product, "P". Figure 4 shows the formation of the partial products and complete product. **Figure 5.** Execution diagram of a multiplication between a 24-bit word of operand A and a byte of operand B in $GF(2^{163})$ Performing the XORMPY operations and then shifting and XORing the result is the most computationally intensive part of this algorithm. Given the C64x+'s parallelized architecture, this series of instructions can be computed quickly. The DSP can perform two XORMPYs, two shifts along with four XOR operations every cycle. Figure 5 shows the execution diagram of the formation of the partial products using the "First Byte" portrayed in Figure 4. Figure 5 details the dependency and ability to take advantage of the parallelized architecture of the C64x+. The rows represent cycles and the columns show concurrent operations. Even though only two XORMPYs can be dispatched every cycle, more than two XORMPYs can be executed concurrently. For example in Cycle 5, one XORMPY is starting to execute, but six other XORMPY instructions are also being executed. In Cycle 8, three XOR instructions, two Shift instructions and one Store instruction are all executed in parallel. #### 3.2 Unbalanced Exponent Modular Reduction After the full polynomial product is formed, it must be reduced into $GF(2^m)$ . We propose using Unbalanced Exponent Modular Reduction [3]. The irreducible polynomial is usually known at compile time, and based on this we propose using one of two methods for Unbalanced Exponent Modular Reduction. If three of four or more terms in T(x) are within 9 bits of each other, then it is beneficial to use XORMPYs, shifts and XORs to reduce the product. Else, it is more efficient to shift and add for reduction. If most of terms in T(x) are within a 9-bit window, then these terms can fit in the 9-bit operand of XORMPY. Therefore, XORMPY can be used by multiplying T(x) by $C_h(x)$ using 24-bits at a time. This is similar in time and resources to multiplying one byte of operand B by all 24-bit words of operand A. If there is a term that is not within this window, shifts and adds can be used implement that bit's multiplication. For example, in the case of $GF(2^{163})$ , the irreducible polynomial is: $f(x) = x^{163} + x^7 + x^6 + x^3 + x^0$ . So $T(x) = x^7 + x^6 + x^3 + x^0$ (0xC9). Performing the first iteration of reduction requires T(x) be multiplied with seven 24-bit words. The next and final iteration requires T(x) be multiplied with one 24-bit word. The left diagram of Figure 6 shows a 325-bit product being reduced to $GF(2^{163})$ . When most of the set bits are not in a 9-bit window of the irreducible polynomial, it is more effective to manually multiply the upper bits. Instead of calling XORMPY, for coefficient $t_i$ which is 1 in T(x), shift all bits m and greater to the right m-i bits. For each term produced, XOR them together and with the original bits below m. **Figure 6.** Unbalanced Exponent Modular Reduction. Left: Reduction over $GF(2^{163})$ using MPBPM. Right: Reduction over $GF(2^{233})$ using shifts and adds. For $GF(2^{233})$ , the irreducible polynomial, f(x), is $x^{233} + x^{74} + x^0$ . To reduce, we shift all bits above bit 232 to the right 233, then XOR them with the bits occupying bits 232-0. We XOR that result with $C_h(x)$ shifted to the right 233-74 = 159 bits. Since the result extends out over 233 bits, these steps must be repeated once more. This is shown in the right diagram of Figure 6. # **CHAPTER 4** ## **METHODOLOGY** We used the TI C6x C Compiler to build our program. All of our coding was in C. The compiler has built-in functionality which recognizes when the function "\_xormpy()" is written in a C program it refers to the XORMPY instruction. Given the inherent parallelism of the DSP, the compiler attempts to make the best use of its hardware when compiling the source code. Optimizing code for a processor with eight parallel functional units that can all access the same registers is complicated. The compiler can produce very different binary files by just switching the order of a few C statements, even if the resulting executable will have the exact same functionality. Therefore our design cycle, depicted in Figure 7, had some added steps to try to get the most out of the compiler. **Figure 7.** Design flow for developing software for the C64x+. First, we developed the algorithm and pseudo code. Next, we wrote the C code and compiled it. Then we went through a few cycles of debugging and fixing the C Code. Once the program was functioning properly, we worked on optimizing the resulting binary, by optimizing the C code. Some techniques we used to help reduce execution time are listed below: - Put C statements which could be executed in parallel, adjacent to each other - Unroll and roll loops - Adjust the number of temporary variables - Do not access registers modified by instructions with delay slots soon after dispatched After making one of the above changes, we would compile the modified program and compare its assembly listing file to the previous iteration's assembly listing. If the new listing file looked more optimized, we would work on further optimizing this new C file. Else, we would go back to the previous iteration's C file and try different optimizations. We considered an assembly listing more optimized if the number of cycles needed decreased from the previous listing file. The following C Code displays the main loop for a $GF(2^{163})$ multiplication. The compiled assembly listing of this code is listed immediately following the C Code. Each line in the assembly listing starting with "I" indicates it is being dispatched in parallel with the preceding instruction. Counting every instruction not preceded by "II", gives a good estimate for the number of cycles needed to perform that group of instructions. Therefore, assuming there are no cache misses, each iteration in this loop will take 96 cycles. There will be a very limited number of cache misses because the L1 Data Cache is 80KB and the data required for this multiplication is much less than 1KB. There are 227 total instructions being executed in these 96 cycles, including 23 cycles of NOPs (No-Operation instructions). Therefore, each cycle an average of 2.13 instructions are being dispatched in parallel, excluding NOPs. This loop will run for 5 iterations, so it takes a total of $96 \cdot 5 = 480$ cycles. The subsequent listing shows the same C Code with only two modifications. First, instead of using one variable to compute the partial products, seven variables are used (one for each partial product computed for each byte of operand "b"). Second, all XORMPY instruction calls for the same byte of operand "b", are written on consecutive lines, and all XOR and shift instruction calls for that byte are also written consecutively. Even though this has the exact same functionality as the previous C code, compiling this gives much different results. The resulting assembly listing is given after the optimized C code. If there are no cache misses, each iteration of this loop will take 43 cycles. Since there are a total of 165 instructions (with no NOPs) being executed in these 43 cycles, each cycle an average of 3.84 instructions are being dispatched. Since this is executed for five iterations, this section of the code requires $43 \cdot 5 = 215$ cycles. This is a speed up of 2.23 when compared with the unoptimized C code. This drastic difference in cycles can be surprising since there was no algorithmic change. However, this shows the importance of being architecturally-aware when writing code for a VLIW processor. ### Unoptimized C Code Snippet from $GF(2^{163})$ Multi-Precision Binary Polynomial Multiplication ``` 99 for(i=0; i<5; i++) 162 100 { 163 b = (0xFF0000&opb[i]) >> 16; 101 164 b = 0xFF&opb[i]; 102 165 103 166 p = xormpy(shifted[0], b); final[i] ^= (p << 16); p = xormpy(shifted[0], b); 167 104 final[i+1] ^= (p >> 16); 105 final[i] ^= p; 168 106 169 107 p = xormpy(shifted[1], b); 170 p = xormpy(shifted[1], b); 108 final[i] ^= (p << 24); 171 final[i+1] ^= (p << 8); final[i+2] ^= (p >> 24); final[i+1] ^= (p >> 8); 109 172 173 110 174 p = xormpy(shifted[2], b); 111 p = \_xormpy(shifted[2], b); 112 final[i+1] ^= (p << 16); 175 final[i+2] ^= (p); final[i+2] ^= (p >> 16); 176 113 114 177 p = xormpy(shifted[3], b); 115 178 final[i+2] ^= (p << 24); final[i+3] ^= (p >> 8); 116 p = xormpy(shifted[3], b); 179 final[i+2] ^= (p << 8); final[i+3] ^= (p >> 24); 117 180 181 p = xormpy(shifted[4], b); 118 119 182 final[i+3] ^= (p << 16); p = _xormpy(shifted[4], b); final[i+4] ^= (p >> 16); 120 183 121 final[i+3] ^= p; 184 122 185 p = xormpy(shifted[5], b); p = xormpy(shifted[5], b); final[i+4] ^= (p << 8); final[i+5] ^= (p >> 24); 123 186 final[i+3] ^= (p << 24); final[i+4] ^= (p >> 8); 124 187 125 188 126 189 p = xormpy(shifted[6], b); p = xormpy(shifted[6], b); 190 final[i+5] ^= p; 127 final[i+4] ^= (p << 16); final[i+5] ^= (p >> 16); 128 191 129 192 130 193 b = (0xFF0000000&opb[i])>>24; 131 194 132 195 b = (0xFF00&opb[i]) >> 8; 196 133 p = xormpy(shifted[0], b); 134 197 final[i] ^= (p << 24); final[i+1] ^= (p >> 8); 135 p = xormpy(shifted[0], b); 198 136 final[i] ^= (p << 8); 199 final[i+1] ^{=} (p >> 24); 200 137 p = xormpy(shifted[1], b); final[i+1] ^= (p << 16); final[i+2] ^= (p >> 16); 138 2.01 139 p = xormpy(shifted[1], b); 202 140 final[i+1] ^= (p); 203 2.04 p = xormpy(shifted[2], b); 141 final[i+2] ^= (p << 8); final[i+3] ^= (p >> 24); 142 p = xormpy(shifted[2], b); 2.05 143 final[i+1] ^= (p << 24); 206 final[i+2] ^= (p >> 8); 144 207 145 208 p = xormpy(shifted[3], b); 146 p = xormpy(shifted[3], b); 209 final[i+3] ^= (p); final[i+2] ^= (p << 16); 147 210 final[i+3] ^= (p >> 16); 148 211 p = xormpy(shifted[4], b); final[i+3] ^= (p << 24); 149 212 final[i+4] ^= (p >> 8); 150 p = \_xormpy(shifted[4], b); 213 151 final[i+3] ^= (p << 8); 214 final[i+4] ^= (p >> 24); 152 215 p = xormpy(shifted[5], b); final[i+4] ^= (p << 16); final[i+5] ^= (p >> 16); 153 216 154 p = xormpy(shifted[5], b); 217 155 218 final[i+4] ^= (p); 219 p = xormpy(shifted[6], b); 156 final[i+5] ^= (p << 8); final[i+6] ^= (p >> 24); 157 p = xormpy(shifted[6], b); 220 158 final[i+4] ^= (p << 24); 221 final[i+5] ^= (p >> 8); 159 222 160 223 161 224 } ``` # Assembly Output from Compiling Unoptimized C Code Snippet from $GF(2^{163})$ Multi-Precision Binary Polynomial Multiplication | • * | | | | | * | 1.1 | STNDW | .D2T2 | B23:B22,*+B5(4) | ١. | 11201 | <0 16× | |------------|-----------|----------|-----------------|-------------|--------|-----|--------|--------|-----------------|------------|-------|--------| | ,<br>;* | SOFTWAR | E PIPELI | INE INFORMATION | | | | MV | .D212 | | | | <0,16> | | ; * | | | | | | | | | ., | , | | , | | ; *Lo | op source | e line | | : 99 | | | SHRU | .S1 | A8,8,A3 | ; | 125 | <0,17> | | ; *Lo | op openi | ng brace | e source line | : 100 | | 1.1 | SHL | .S2X | A3,16,B22 | ; | 129 | <0,17> | | ; *Lo | op closi | ng brace | e source line | : 224 | | 11 | XOR | .L2 | B8,B4,B4 | ; | 121 | <0,17> | | ; * | Know | n Minimu | ım Trip Count | : | 5 | 1.1 | STNDW | .D2T1 | A5:A4,*-B5(4) | ; | 111 | <0,17> | | ; * | Know | n Maximu | ım Trip Count | : | 5 | | | | | | | | | ; * | Know | n Max Ti | rip Count Facto | r : | 5 | | SHL | .S2 | B21,16,B4 | ; | 112 | <0,18> | | ; *Lo | op Carri | ed Deper | ndency Bound() | : 69 | | 11 | XOR | .L1 | A4,A3,A3 | ; | 125 | <0,18> | | ; * | Unpa | rtitione | ed Resource Bou | nd : | 34 | 11 | XOR | .L2X | B4,A7,B8 | ; | 124 | <0,18> | | ; * | Part: | itioned | Resource Bound | (*): | 40 | 11 | STW | .D2T2 | B4,*+B5(8) | ; | 123 | <0,18> | | ; * | Reso | urce Par | rtition: | | | | | | | | | | | ; * | | | | A-side | B-side | | XOR | .L2X | A6,B4,B4 | ; | 112 | <0,19> | | ; * | .L u | nits | | 0 | 0 | 11 | STW | .D1T2 | B8, *+A18(8) | ; | 124 | <0,19> | | ; * | .S u | nits | | 21 | 26 | | | | | | | | | <i>;</i> * | .D u | | | 40* | 27 | | MV | .L2X | A3,B4 | | | <0,20> | | <i>;</i> * | .M u | nits | | 11 | 17 | 11 | STW | .D2T2 | B4,*B5 | ; | 112 | <0,20> | | ; * | .X c | ross pat | hs | 28 | 21 | | | | | | | | | ; * | .T a | ddress p | paths | 40* | 39 | | XOR | .L2 | B7,B9,B9 | ; | 129 | <0,21> | | | ng read p | | 0 | 0 | | 11 | XOR | .S2X | | | | <0,21> | | | ng write | | 0 | 0 | | 11 | STW | .D2T2 | B4,*+B5(12) | ; | 127 | <0,21> | | | gical op | | | 0 | | | | | | | | | | <i>;</i> * | Addi | tion ops | s (.LSD) | 29 | 34 | | STNDW | .D2T2 | B9:B8,*+B5(12) | ; | 129 | <0,22> | | <i>;</i> * | | d(.L .S | | 11 | 13 | | LDW | .D2T2 | *B16,B4 | ; | 133 | <0,23> | | ; * | Bound | d(.L .S | .D .LS .LSD) | 30 | 29 | | NOP | | 4 | | | | | ; * | | | | | | | EXTU | .S2 | B4,16,24,B7 | ; | 133 | <0,28> | | | | | | | * | | XORMPY | .M2 | B20,B7,B4 | ; | 137 | <0,29> | | | 1: ; PIP | | | | | | | | | | | | | ; * * | | | | | | | LDW | .D1T1 | *+A18(12),A7 | | | <0,30> | | | LDW | .D2T2 | | ; 102 | <0,0> | 11 | XORMPY | .M2 | B17,B7,B8 | ; | 150 | <0,30> | | | NOP | | 4 | | | | | | | | | | | | EXTU | .S2 | B4,24,24,B8 | ; 102 | <0,5> | | LDW | .D1T2 | *A18,B7 | | | <0,31> | | | | | | | | 11 | XORMPY | .M1X | A19,B7,A4 | ; | 148 | <0,31> | | | LDW | .D2T1 | *+SP(32),A21 | ; 127 | | | | | | | | | | | XORMPY | .M2 | B19,B8,B4 | ; 109 | <0,6> | | LDW | .D2T2 | *-B5(4),B4 | | | <0,32> | | | | | | | | 11 | XORMPY | .M1X | A21,B7,A3 | ; | 159 | <0,32> | | | XORMPY | .M2 | B6,B8,B7 | ; 118 | <0,7> | | | | | | | | | | | | | | | | XORMPY | .M1X | A20,B7,A6 | | | <0,33> | | | LDW | .D1T1 | *++A18,A5 | ; 105 | | 11 | SHL | .S2 | B4,8,B9 | ; | 139 | <0,33> | | | LDW | .D2T2 | *B5++,B7 | ; 105 | | | | | | | | | | | XORMPY | .M2 | B18,B8,B21 | ; 111 | <0,8> | | LDW | .D1T1 | *+A18(4),A6 | | | <0,34> | | | | | | | | 1.1 | XORMPY | .M2 | B18,B7,B9 | | | <0,34> | | | LDW | .D2T1 | *+B5(4),A3 | ; 113 | | 1.1 | SHRU | .S2 | B8,24,B21 | ; | 152 | <0,34> | | | LDW | .D1T2 | *+A18(8),B7 | ; 113 | | | | | | | | | | | XORMPY | .M2 | B20,B8,B9 | ; 105 | <0,9> | | LDW | .D1T1 | *+A18(16),A5 | | | <0,35> | | | | | | | | 1.1 | XORMPY | .M2 | B19,B7,B21 | | | <0,35> | | | NOP | | 1 | | | 1.1 | SHRU | .S2 | B4,24,B4 | | | <0,35> | | | | | | | 0 11 | 1.1 | XOR | .L2X | A7,B21,B26 | ; | 152 | <0,35> | | | XORMPY | .M1X | A21,B8,A3 | | <0,11> | | | | | | | 0.00 | | | SHL | .S2 | B7,8,B22 | ; 120 | <0,11> | | LDW | .D1T1 | *+A18(8),A5 | | | <0,36> | | | | | | | 0 10 | | XOR | .L2 | B7,B4,B7 | | | <0,36> | | | LDW | .D2T2 | *+B5(16),B7 | ; 125 | | | SHL | .S2X | A4,16,B22 | ; | 1150 | <0,36> | | | XORMPY | .M1X | A20,B8,A8 | | <0,12> | | VOD | т О | D4 D0 D04 | | 1120. | ZO 27. | | 11 | SHRU | .S2 | B7,24,B23 | ; 118 | <0,12> | 1.1 | XOR | .L2 | | | | <0,37> | | | T DW | D2m1 | * - DE / 10\ 3/ | . 11051 | <0 12· | | SHL | .S1 | A3,24,A5 | | | <0,37> | | 1.1 | LDW | .D2T1 | *+B5(12),A4 | | <0,13> | | SHRU | .S2X | A3,8,B23 | | | <0,37> | | 11 | XORMPY | .M2 | B17,B8,B4 | | <0,13> | | MV | .D2 | B7,B25 | ; | 1139 | <0,37> | | 11 | XOR | .L2 | B9, B7, B9 | | <0,13> | | VOD | T 1 17 | 37 DO1 30 | | 1150: | -0 20 | | | SHRU | .S2 | B21,16,B8 | ; 113 | <0,13> | 1.7 | XOR | .L1X | A7, B21, A3 | | | <0,38> | | | CTM | D2#2 | DO + DE/41 | . 11071 | <0,14> | | XOR | .L2X | B26, A6, B4 | | | <0,38> | | 1.1 | STW | .D2T2 | B9,*-B5(4) | | | | STNDW | .D2T2 | B25:B24,*-B5(4) | ) <b>;</b> | 1139 | <0,38> | | | XOR | .L2 | B7,B23,B8 | | <0,14> | | CLIDIT | C1 V | 24 0 00 | | 11/// | <0 20· | | | XOR | .S2X | A3, B8, B7 | | <0,14> | 1.1 | SHRU | .S1X | B9,8,A3 | | | <0,39> | | | SHRU | .S1X | B4,8,A4 | ; 109 | <0,14> | | XOR | .L2 | B7, B21, B4 | | | <0,39> | | | CIII | 017 | D4 24 34 | | <0 1E | | XOR | .S2X | B4, A5, B24 | | | <0,39> | | 1.1 | SHL | .S1X | B4,24,A4 | | <0,15> | | STW | .D2T1 | A3,*+B5(12) | | | <0,39> | | | XOR | .L1 | A5, A4, A6 | | <0,15> | | STW | .D1T2 | B4,*+A18(12) | ; | 115/ | <0,39> | | | MV | .L2 | B8, B23 | | <0,15> | | CIII | CO | DO 24 D7 | | 11421 | <0 40· | | | XOR | .S2 | B7, B22, B22 | | <0,15> | 1.1 | SHL | .S2 | B9,24,B7 | | | <0,40> | | 11 | STW | .D2T2 | B7,*+B5(4) | ; 110 | <0,15> | | SHRU | .S1 | A4,16,A4 | | | <0,40> | | | CIII | C1 | 70 24 77 | . 11041 | 20 1C | | XOR | .L2X | A5,B23,B25 | | | <0,40> | | 1.1 | SHL | .S1 | A8,24,A7 | | <0,16> | | STW | .D1T2 | B4,*A18 | | | <0,40> | | | SHRU | .S2X | A3,16,B9 | | <0,16> | 1.1 | SHL | .S2 | B8,8,B7 | | | <0,41> | | 11 | XOR | .L1X | B9, A4, A4 | , 1 1 1 1 | <0,16> | | XOR | .L1 | A5,A4,A3 | ; | 148 | <0,41> | | | | | | | | | | | | | | | ``` .S1 A3.*+A18(16) YOR A6,A3,A4 ; |144| <0,41> STW .D1T1 ; |190| <0,68> XOR .L2 B4,B7,B4 ; |143| <0,41> LDW .D2T2 *B16++,B4 ; |194| <0,69> B25:B24,*+A18(12);|159| <0,41> STNDW D1T2 NOP 4 SHRU .S2 B4,24,B4 ; |194| <0,74> ; |150| <0.42> ΜV . T.1 A3.A5 NOP \Box STW D2T1 A4,*+B5(4) ; |146| <0,42> .M2 XOR .S1X A4,B22,A4 ; |150| <0,42> XORMPY B17, B4, B21 ; |211| <0,76> \Box STW D1T2 B4,*A18 ; |143| <0,42> | \cdot | XORMPY .MlX A21.B4.A16 ; |219| <0,76> ; |151| <0,43> .M2 ; |200| <0,77> XOR . T.1 X A3.B7.A3 XORMPY B19.B4.B22 A5:A4,*+A18(4); |150| <0,43> STNDW .D1T1 XORMPY .MlX A20.B4.A5 ; |217| <0,77> \prod STW D1T1 A3,*+A18(8) ; |151| <0,44> T.DW D1T2 *+A18(4),B7 ; |202| <0,78> LDW .D2T2 *B16,B4 ; |163| <0,45> XORMPY .M2 B18, B4, B9 ; |206| <0,78> NOD 1 EXTU .S2 B4,8,24,B8 ; |163| <0,50> LDW .D1T1 *+A18(8),A3 ; |202| <0,79> .M2 VORMOV . M2 B17,B8,B7 ; |181| <0,51> | \cdot | XORMPY B20.B4.B8 ; |198| <0,79> NOP XORMPY . M1 X A19.B8.A3 ; |179| <0,53> T.DW .D1T1 *A18.A6 ; |198| <0.80> SHL .S1 A16,8,A9 ; |220| <0,80> 11 LDW .D1T1 *+A18(12),A5 ; |183| <0,54> XORMPY .M2 B20,B8,B4 ; |168| <0,54> T.DW .D1T1 *-A18(4),A7 ; |200| <0,81> XORMPY .M1X A20, B8, A4 ; |187| <0,54> | \cdot | XORMPY .M1X A19,B4,A3 ; |209| <0,81> SHL . S1 A5,16,A21 ; |219| <0.81> 11 T.DW .D1T1 *+A18(4),A6 ; |172| <0,55> XORMPY B19,B8,B9 ; |170| <0,55> .D1T1 *+A18(12),A22 \prod .M2 LDW ; |213| <0,82> | \cdot | SHRU .S2 B9,24,B4 ; |206| <0,82> LDW .D1T1 *A18,A17 ; |168| <0,56> SHL .S1X B22,16,A8 ; |201| <0,82> 1.1 \Box B7,16,A8 ; |182| <0,56> SHL .S1X T.DW .D2T1 *+B5(20),A6 ; |221| <0,83> .S2 T.DW D1T1 *-A18(4),A17 ; |170| <0,57> | \cdot | SHRU B22,16,B22 ; |202| <0,83> XORMPY .M2 B18, B8, B21 ; |175| <0,57> SHRU .S1X B21,8,A17 ; |213| <0,83> 1.1 A3,24,A9 ; |181| <0,57> SHL .S1 ; |213| <0,84> LDW .D1T1 *+A18(16),A6 I.DW .D1T1 *+A18(16),A4 ; |183| <0,58> SHL B9,8,B9 ; |208| <0,84> \Box .S2 SHRU .s1 A4,24,A7 ; |187| <0,58> XOR .L2X A3,B4,B7 ; |206| <0,84> \prod XOR .D2 B7, B22, B4 ; |202| <0,84> .D1T2 T.DW *+A18(8),B7 ; |172| <0,59> SHRU .S1X B8,8,A4 ; |198| <0,84> 11 .S2 ; |172| <0,59> SHRU B9,24,B21 ; |200| <0,85> \prod SHRU .S1X B4,16,A16 ; |168| <0,59> SHL .S1X B8,24,A6 XOR .L1 A6, A4, A4 ; |198| <0,85> . L2X A6,B21,B7 ; |172| <0,60> MV .L2 В7,В9 ; |208| <0,85> XOR .S2 \Box B7,16,A6 ; |183| <0,60> XOR B4,B9,B8 ; |208| <0,85> SHRU .S1X STW .D2T2 B4,*+B5(4) ; |204| <0,85> 1.1 B9,8,B7 ; |170| <0,61> II SHL .S1X B4,16,A16 [ B0] SUB B0,1,B0 ; |99| <0,86> ; |183| <0,61> B21,24,B8 SHL .S2 ; |212| <0,86> II XOR .L1 A5,A6,A6 \Box II XOR .D1 A17, A16, A5 ; |168| <0,61> XOR .L1 A7, A6, A6 ; |200| <0,86> ; |175| <0,61> | \cdot | XOR .L2 B7, B21, B4 |\cdot| STNDW .D1T2 B9:B8, *+A18(4) ; |208| <0,86> \prod .D2T2 B7, *+B5(4) ; |174| <0,61> MV A4, A7 ; |200| <0,86> STW .S1 XORMPY .MlX A21,B8,A4 ; |190| <0,62> XOR .L2X B7, A3, B4 ; |209| <0,87> .L1 A17, A16, A16 ; |170| <0,62> B0] .S2 $C$L2 ; |99| <0,87> XOR 11 [ ; |189| <0,62> A16,24,A7 A4,8,B8 SHRU ; |221| <0,87> | \cdot | SHL .S2X |\cdot| .S1 ; |213| <0,87> STW D1T2 B4, *+A18(4) ; |177| <0,62> XOR .L1 A22, A17, A3 ; |185| <0,62> .D2T1 A6, *+B5(12) STNDW A7:A6, *-A18(4) ; |200| <0,87> | \cdot | STW |\cdot| .D1T1 MV .S1 A5, A17 ; |170| <0,62> SHRU .S1 A5,16,A7 XOR .L1 A4, A7, A3 XOR .L1 A7, A6, A5 ; |221| <0,88> .S2X ; |179| <0,63> .L2 B4, B8, B4 ; |212| <0,88> \Box SHRU A3.8.B9 XOR B4, *+A18(8) XOR .S1X A5, B7, A4 ; |171| <0,63> | \cdot | STW .D1T2 ; |211| <0,88> | \cdot | STNDW A17:A16, *-A18(4) ; |170| II .D1T1 <0,63> XOR ; |221| <0,89> 11 STW .D2T1 A5, *+B5(20) ; |189| <0,64> A4, A8, A5 ; |201| <0,89> .L1 |\cdot| XOR .S1 A3,B4 \prod XOR .L2 B7, B9, B7 ; |179| <0,64> MV .L2X ; |213| <0,89> A6,B8,A4 ; |189| <0,64> XOR .S1X | \cdot | STW .D1T2 B4,*+A18(8) ; |212| <0,89> \prod ; |171| <0,64> | \cdot | STW .D1T1 A4,*A18 . T.1 A4,A7 ; |219| <0,90> ΜV A3, A21, A6 ; |219| <0,90> .L2X B4, A9, B8 XOR .S1 .D1T1 .S2 ΜV B7.B9 ; |181| <0,65> STW A5,*A18 ; |201| <0,90> A5:A4, *+A18(12); |189| <0,65> B4, *+B5(12) ; |215| <0,90> STNDW .D1T1 STW .D2T2 11 .L2X ; |182| <0,66> XOR A4,A9,A3 ; |220| <0,91> | \cdot | STNDW .D1T2 B9:B8, *+A18(4) ; |181| <0,66> 11 STNDW .D1T1 A7:A6,*+A18(12); |219| <0,91> ; |190| <0,67> STW A3,*+A18(16) ; |220| <0,92> .D1T2 B4, *+A18(8) ; |182| <0,67> II STW NOP ``` # Optimized C Code Snippet from $GF(2^{163})$ Multi-Precision Binary Polynomial Multiplication ``` 225 for(i=0; i<5; i++) 226 { 279 227 280 p = xormpy(shifted[0], b); 281 p1 = _xormpy(shifted[1], b); 228 b = 0xFF&opb[i]; 282 p2 = xormpy(shifted[2], b); 229 230 p = _xormpy(shifted[0], b); 231 p1 = _xormpy(shifted[1], b); 232 p2 = _xormpy(shifted[1], b); 233 p3 = _xormpy(shifted[2], b); 234 p4 = _xormpy(shifted[3], b); 235 p5 = _xormpy(shifted[3], b); 236 p6 = _xormpy(shifted[4], b); 237 238 final[i] ^= p; 239 final[i] ^= p; 239 final[i] ^= (p1 << 24); 240 final[i+1] ^= (p1 << 24); 241 final[i+1] ^= (p1 >> 8); 242 final[i+2] ^= (p2 << 16); 243 final[i+2] ^= (p2 << 16); 244 final[i+2] ^= (p3 << 8); 245 final[i+3] ^= (p3 << 8); 246 final[i+3] ^= (p3 << 8); 247 final[i+3] ^= (p3 << 24); 248 final[i+3] ^= (p3 >> 24); 249 final[i+3] ^= (p5 << 8); 240 final[i+4] ^= (p5 >> 8); 241 final[i+4] ^= (p5 >> 8); 242 final[i+4] ^= (p5 >> 8); 243 final[i+4] ^= (p5 >> 8); 244 final[i+4] ^= (p5 >> 8); 245 final[i+4] ^= (p5 >> 8); 246 final[i+4] ^= (p5 >> 8); 247 final[i+4] ^= (p5 <> 8); 248 final[i+4] ^= (p6 << 16); 301 283 p3 = xormpy(shifted[3], b); 230 p = xormpy(shifted[0], b); 248 final[i+4] ^= (p6 << 16); 249 final[i+5] ^= (p6 >> 16); 302 250 303 b = (0xFF000000&opb[i])>>24; 250 251 252 b = (0xFF00&opb[i]) >> 8; 253 254 p = _xormpy(shifted[0], b); 255 p1 = _xormpy(shifted[1], b); 256 p2 = _xormpy(shifted[2], b); 257 p3 = _xormpy(shifted[3], b); 258 p4 = _xormpy(shifted[4], b); 260 p6 = _xormpy(shifted[6], b); 261 262 final[i] ^= (p << 8); 263 final[i+1] ^= (p >> 24); 264 final[i+1] ^= (p1); 265 final[i+1] ^= (p2 >> 8); 266 final[i+2] ^= (p2 >> 8); 267 final[i+2] ^= (p3 >> 16); 268 final[i+3] ^= (p3 >> 16); 269 final[i+3] ^= (p4 << 8); 260 final[i+4] ^= (p4 >> 24); 261 322 final[i+4] ^= (p5); 262 final[i+4] ^= (p5); 263 final[i+4] ^= (p5); 264 final[i+3] ^= (p5 >> 16); 265 final[i+3] ^= (p6 << 24); 266 final[i+3] ^= (p6 << 24); 267 final[i+4] ^= (p6 << 24); 268 final[i+4] ^= (p6 << 24); 270 final[i+4] ^= (p6 << 24); 271 final[i+4] ^= (p5); 272 final[i+4] ^= (p6 << 24); 273 final[i+4] ^= (p6 << 24); 274 final[i+4] ^= (p6 << 24); 275 final[i+4] ^= (p6 << 24); 276 final[i+4] ^= (p6 << 24); 277 final[i+4] ^= (p6 << 24); 278 final[i+4] ^= (p6 << 24); 279 final[i+4] ^= (p6 << 24); 270 final[i+4] ^= (p6 << 24); 271 final[i+4] ^= (p6 << 24); 272 final[i+4] ^= (p6 << 24); 273 final[i+6] ^= (p6 >> 24); 274 final[i+4] ^= (p6 << 24); 275 final[i+4] ^= (p6 << 24); 276 final[i+4] ^= (p6 << 24); 277 final[i+4] ^= (p6 << 24); 278 final[i+6] ^= (p6 >> 24); 279 final[i+4] ^= (p6 << 24); 280 final[i+6] ^= (p6 >> 24); 291 final[i+6] ^= (p6 >> 24); 292 final[i+6] ^= (p6 >> 24); 293 final[i+6] ^= (p6 >> 24); 294 final[i+6] ^= (p6 >> 24); 295 final[i+6] ^= (p6 >> 24); 296 final[i+6] ^= (p6 >> 24); 297 final[i+6] ^= (p6 >> 24); 298 final[i+6] ^= (p6 >> 24); 299 final[i+6] ^= (p6 >> 24); 290 final[i+6] ^= (p6 >> 24); 291 final[i+6] ^= (p6 >> 24); 292 final[i+6] ^= (p6 >> 24); 293 final[i+6] ^= (p6 >> 24); 294 final[i+6] ^= (p6 >> 24); 295 final[i+6] ^= (p6 >> 24); 296 final[i+6] ^= (p6 >> 24); 297 final[i+6] ^= (p6 >> 24); 298 final[i+6] ^= (p6 >> 24); 299 final[i+6] ^= (p6 >> 24); 290 final[i+6] ^= (p6 >> 24); 291 final[i+6] ^= (p6 >> 24); 291 final[i+6] ^= (p6 >> 24); 291 final[i+6] ^= (p6 >> 24); 291 fin 251 304 274 327 275 328 } 276 277 b = (0xFF0000&opb[i]) >> 16; ``` # Assembly Output from Compiling Optimized C Code Snippet from $GF(2^{163})$ Multi-Precision Binary Polynomial Multiplication | ; * | | | | | * | | XORMPY | .M2X | B20, A21, B16 | ; | 158 | <0,17> | |------------|-----------|----------|-----------------|----------|----------|-----|-----------|----------|---------------|--------------|-----------|----------| | ; * | SOFTWAR | E PIPELI | NE INFORMATION | | | 11 | XOR | .L1X | A4,B7,A6 | ; | 114 | <0,17> | | ; * | | | | | | 11 | XOR | .L2 | B16, B5, B16 | ; | 118 | <0,17> | | ; * | Loop | source | line | : | 99 | 11 | LDW | .D2T2 | *+B8(12),B6 | ; | 147 | <0,17> | | ; * | Loop | opening | brace source | line : | 100 | Ιİ | XOR | .S2 | B9,B17,B17 | ; | 118 | <0,17> | | ; * | | | brace source | | 202 | | | | , , | • | | , | | ; * | | | m Trip Count | | 5 | | XORMPY | .M1 | A20, A24, A4 | ; | 11371 | <0,18> | | ; * | | | ım Trip Count | | 5 | 11 | XOR | .L1X | B6,A3,A3 | | | <0,18> | | ; * | | | ip Count Facto | | 5 | ii | STNDW | .D2T2 | B17:B16,*+B8( | | | | | ; * | | | d Dependency Bo | | | 1.1 | DINDW | . DZ IZ | BI7.BI0, 1B0( | - / <b>/</b> | 1110 | 1 (0,10) | | ; * | | | ed Resource Bou | | 24 | | XORMPY | .M1 | A19,A24,A5 | | 11301 | <0,19> | | ; * | | | Resource Bound | | 32 | 11 | XOR | .L1 | A5, A6, A5 | | | <0,19> | | ; * | | | rtition: | | 32 | ii | XOR | .S1 | A4, A3, A4 | | | <0,19> | | ; * | 11630 | urce rar | | A-side | B-side | ii | LDW | .D2T1 | *+B8(16),A7 | | | <0,19> | | | T | 01+0 | | 0 | 0 | 1.1 | LDW | . DZ I I | THO (10), A | , | 114/1 | <0,13/ | | ; * | .L u | | | | | | VODMDV | 141 | 317 304 30 | | 11221 | -0 00 | | ; * | .S u | | | 29 | 18 | | XORMPY | .M1 | A17, A24, A3 | | | <0,20> | | <i>;</i> * | .D u | | | 8 | 32* | 11 | SHL | .S1 | A3,24,A6 | | | <0,20> | | <i>;</i> * | .M u | | , | 20 | 8 | | STNDW | .D1T1 | A5:A4,*-A9(4) | ; | 114 | <0,20> | | ; * | | ross pat | | 13 | 28 | | | | | | | | | ; * | | ddress p | | 20 | 32* | | XORMPY | .M2X | B21,A21,B9 | | | <0,21> | | <i>;</i> * | | read pa | | 0 | 0 | | SHRU | .S2 | B16,16,B17 | | | <0,21> | | ; * | | write p | | 0 | 0 | | SHRU | .S1X | B4,24,A5 | | | <0,21> | | ; * | | cal ops | | 0 | 0 | | LDW | .D2T2 | *B8,B7 | ; | 137 | <0,21> | | ; * | | | s (.LSD) | 23 | 27 | | | | | | | | | ; * | Bound | d(.L .S | .LS) | 15 | 9 | | XORMPY | .M2X | B21,A24,B6 | ; | 137 | <0,22> | | ; * | Bound | d(.L .S | .D .LS .LSD) | 20 | 26 | | SHL | .S2 | B4,8,B5 | ; | 142 | <0,22> | | ; * | | | | | | | XOR | .L1 | A6,A5,A6 | ; | 147 | <0,22> | | ; * | | | | | * | | LDW | .D2T2 | *-B8(4),B18 | ; | 137 | <0,22> | | \$C\$L | 3:; PIPE | D LOOP P | PROLOG | | | | | | | | | | | ; * * | | | | | * | | XORMPY | .M1 | A18, A24, A3 | ; | 131 | <0,23> | | | XORMPY | .M1 | A8, A7, A6 | ; 110 | <0,7> | 11 | SHRU | .S1 | A3,8,A6 | ; | 147 | <0,23> | | | | | | | | Ιİ | SHRU | .S2X | A4,24,B4 | ; | 137 | <0,23> | | | LDW | .D1T1 | *++A9,A5 | ; 114 | <0,8> | 11 | LDW | .D2T2 | *+B8(4),B18 | | | <0,23> | | 11 | EXTU | .S1 | A22,8,24,A21 | | | ii | XOR | .L1X | B6, A6, A23 | | | <0,23> | | ii | XORMPY | .M1 | A18,A7,A4 | ; 107 | | | | | , , | , | 1 | , | | | 1101411 1 | • | 1110/11//111 | , 12071 | 10,07 | | SHL | .S2X | A5,24,B6 | | 11371 | <0,24> | | | LDW | .D1T1 | *+A9(16),A27 | • 11231 | <0.95 | 11 | LDW | .D2T2 | *+B8(8),B6 | | | <0,24> | | 11 | EXTU | .S1 | A22,16,24,A24 | | | ii | XOR | .L1 | A7, A6, A7 | | | <0,24> | | 1.1 | LMIO | .01 | 1122,10,21,1121 | , 11301 | (0,3) | ii | XOR | .s1 | A3, A23, A6 | | | <0,24> | | | LDW | .D2T2 | *B8++,B6 | . 111/1 | <0,10> | 1.1 | AOI | .51 | AJ, AZJ, AO | , | 114/1 | (0,24) | | 11 | LDW | .D1T1 | *+A9(12),A23 | | | | XORMPY | .M1 | A20, A21, A4 | | 11631 | <0,25> | | | XORMPY | .M1 | A19, A7, A6 | | <0,10> | 1.1 | SHRU | .S2X | A5,8,B19 | | | <0,25> | | | | | | | | 11 | | | | | | | | 11 | XORMPY | .M2X | B21,A7,B16 | | <0,10> | 11 | XOR | .L2 | B6, B4, B4 | | | <0,25> | | | SHL | .S1 | A4,24,A3 | ; 118 | <0,10> | 11 | STNDW | .D2T1 | A7:A6,*+B8(12 | ); | 14/ | <0,25> | | | | | | | 0 11 | | | | | | | 0.06 | | | LDW | .D2T2 | *+B8(8),B5 | | <0,11> | | XORMPY | .M1 | A17, A21, A3 | | | <0,26> | | 11 | XORMPY | .M2X | B20,A24,B4 | | <0,11> | 11 | SHL | .S2X | A4,8,B7 | | | <0,26> | | 11 | SHRU | .S1 | A4,8,A23 | ; [123] | <0,11> | | XOR | .L2 | B7,B4,B4 | | | <0,26> | | | | | | | | | LDW | .D2T2 | *+B8(12),B4 | ; | 172 | <0,26> | | | LDW | .D2T2 | *+B8(4),B16 | | <0,12> | | | | | | | | | | XORMPY | .M2X | B20,A7,B9 | | <0,12> | | XORMPY | .M1 | A18, A21, A5 | | | <0,27> | | | SHL | .S1 | A6,16,A25 | ; 123 | <0,12> | | SHL | .S1 | A3,16,A5 | | | <0,27> | | | | | | | | 1.1 | XOR | .L2 | B18,B7,B6 | | | <0,27> | | | SHRU | .S1 | | ; 123 | | | LDW | .D2T2 | *+B8(16),B5 | | | <0,27> | | 11 | SHRU | .S2X | A4,24,B4 | ; 118 | <0,13> | 11 | XOR | .S2 | B6,B4,B7 | ; | 137 | <0,27> | | | | | | | | | | | | | | | | | XOR | .L1 | A25, A23, A4 | ; 123 | <0,14> | | SHRU | .S1 | A22,24,A23 | ; | 180 | <0,28> | | | SHL | .S1 | A4,8,A3 | ; 118 | <0,14> | 11 | SHRU | .S2X | A3,16,B4 | ; | 142 | <0,28> | | 11 | SHRU | .S2 | B16,8,B7 | ; 114 | <0,14> | 11 | STNDW | .D2T2 | B7:B6,*-B8(4) | ; | 137 | <0,28> | | 11 | XOR | .L2X | A3,B4,B18 | ; 118 | <0,14> | | | | | | | | | | | | | | | | SHRU | .S1 | A4,16,A7 | ; | 163 | <0,29> | | | XORMPY | .M1 | A20,A7,A4 | ; 114 | <0,15> | 11 | XOR | .L2 | B5,B4,B5 | | | <0,29> | | 11 | SHL | .S1 | A6,16,A4 | | <0,15> | ii | XOR | .S2X | A5,B19,B4 | | | <0,29> | | ii | XOR | .L1 | A27, A26, A7 | | <0,15> | ii | LDW | .D2T2 | *B8,B5 | | | <0,29> | | ii | SHRU | .S2X | A6,16,B17 | | <0,15> | | | | • | • | | , . | | ii | XOR | .D1 | A23, A4, A6 | | <0,15> | | XORMPY | .M1 | A8,A21,A3 | ; | 160 | <0,30> | | | | | | , | | 11 | SHL | .S1 | A3,8,A6 | | | <0,30> | | | XORMPY | .M1 | A8, A24, A3 | ; 134 | <0,16> | ii | XOR | .L2 | B6,B5,B7 | | | <0,30> | | 11 | SHL | .S1X | B16,24,A3 | | <0,16> | ii | XOR | .S2 | B18, B4, B6 | | | <0,30> | | ii | XOR | .L2X | A3,B17,B5 | | <0,16> | ii | LDW | .D2T2 | *-B8(4),B17 | | | <0,30> | | ii | XOR | .S2 | B5,B18,B17 | | <0,16> | | | | \ -/ / | , | 5 0 1 | , | | ii | STNDW | .D1T1 | A7:A6,*+A9(12 | | | | XORMPY | .M1 | A20,A23,A4 | , | 11881 | <0,31> | | 1.1 | O TIND W | | 117.110, 187(12 | ,, 11201 | \U, ± U/ | 11 | XORMPY | .M2X | B21, A23, B6 | | | <0,31> | | | | | | | | 1.1 | 71 OTHE 1 | •1161 | 221,1123,100 | ′ | 1 1 0 0 1 | \U, J±/ | | 11 | SHL | .S2 | B16,16,B7 | | | <0,31> | 1.1 | LDW | .D2T2 | *+B8(8),B6 | ; | 192 | <0,40> | |-----|--------------|---------------|---------------------------|---|-------|--------|------|--------------|--------------|----------------------------|---|-----------|---------| | 11 | SHL<br>STNDW | .S1X<br>.D2T2 | B9,8,A22<br>B7:B6,*+B8(4) | | | <0,31> | | T DW | .D1T1 | *+A9(20),A3 | | 11001 | <0,41> | | 11 | SINDW | .DZ1Z | B/:B0, ^+B8(4) | , | 142 | <0,31> | 1.1 | LDW<br>SHL | .S1X | | | | <0,41> | | | XORMPY | .M1 | A19,A21,A3 | | 11561 | <0,32> | 11 | LDW | .D2T2 | | | | <0,41> | | 11 | SHRU | .MI | A3,24,A3 | | | <0,32> | 1.1 | LDW | •DZ1Z | | ′ | 11001 | VU, 41/ | | | SHL | .S2X | A5,24,A5<br>A5,24,B16 | | | <0,32> | [A0] | CIID | .L1 | A0,1,A0 | | 99 < | 0 125 | | | XOR | .52A | A6,B17,A6 | | | <0,32> | [AU] | SHRU | .S1X | B9,8,A22 | | | <0,42> | | | LDW | .D2T2 | *+B8(4),B5 | | | <0,32> | 11 | LDW | .D2T2 | *B8,B9 | | | <0,42> | | 1 1 | шом | . DZ 1 Z | 100(4),03 | · | | , | 11 | DDW | . DZ 1 Z | B0 <b>,</b> B3 | ′ | 11001 | (0,42) | | | XORMPY | .M1 | A17,A23,A6 | | | <0,33> | | XOR | .L1 | A8,A7,A4 | | | <0,43> | | | XORMPY | .M2X | B20,A23,B9 | | | <0,33> | | SHRU | .S1 | | | | <0,43> | | | SHRU | .S2 | B9,24,B9 | | | <0,33> | | XORMPY | .M1 | | | | <0,43> | | 11 | XOR | .L1 | A22,A7,A21 | | | <0,33> | | SHL | .S2X | | | | <0,43> | | 11 | LDW | .D2T2 | *+B8(8),B7 | | | <0,33> | | LDW | .D2T2 | | | | <0,43> | | 11 | XOR | .S1X | B5,A3,A7 | ; | 172 | <0,33> | 11 | LDW | .D1T1 | *A16++,A22 | ; | 102 | <1,0> | | | XORMPY | .M1 | A19,A23,A5 | | | <0,34> | | XOR | .L1 | | | | <0,44> | | | SHL | .S2X | A4,16,B4 | | | <0,34> | | XOR | .D1X | | | | <0,44> | | | XOR | .L1X | B4,A6,A6 | | | <0,34> | [A | | .S1 | \$C\$L4 | , | 99 < | . , | | 11 | XOR | .S1 | A3,A7,A7 | ; | 172 | <0,34> | 11 | SHRU | .S2 | B6,16,B7 | ; | 192 | <0,44> | | | STNDW | .D2T1 | A7:A6,*+B8(12) | | | | | SHRU | .S1 | | | | <0,45> | | | XOR | .L2 | | | | <0,35> | | SHL | .S2 | | | | <0,45> | | | XOR | .S2X | B5,A21,B5 | ; | 163 | <0,35> | | XOR | .L2X | | | | <0,45> | | | | | | | | | | XOR | .L1 | A3,A21,A7 | | | <0,45> | | | XORMPY | .M1 | A8,A23,A6 | | | <0,36> | 11 | XOR | .D1X | | | | <0,45> | | | SHRU | .S1 | A4,8,A21 | | | <0,36> | | LDW | .D2T1 | *+SP(32),A8 | ; | 110 | <1,2> | | 11 | SHRU | .S2X | A5,8,B16 | | | <0,36> | | | | | | | | | | XOR | .L2 | B16,B9,B4 | | | <0,36> | | XOR | .L1 | | | | <0,46> | | | STNDW | .D2T2 | B5:B4,*-B8(4) | ; | 163 | <0,36> | | XOR | .L2X | | | | <0,46> | | | | | | | | | 11 | STNDW | .D2T1 | A5:A4,*+B8(12) | ; | 197 | <0,46> | | | SHL | .S1 | A6,16,A25 | | | <0,37> | | | | | | | | | 11 | XOR | .L2 | B7, B16, B5 | | | <0,37> | | STW | .D1T1 | | | | <0,47> | | 11 | XOR | .S2 | B5, B4, B4 | | | <0,37> | 11 | XOR | .L2 | | | | <0,47> | | | LDW | .D2T2 | *+B8(12),B4 | ; | 119/1 | <0,37> | 11 | XOR | .S2 | | | | <0,47> | | | 0.117 | 0.1 | 35 0 304 | | 11001 | <0,38> | 11 | XOR | .D2X | B9,A7,B5 | ; | 11881 | <0,47> | | | SHL | .S1<br>.L2 | A5,8,A24 | | | <0,38> | | WOD | T 0 | DC D1C D4 | | 11001 | .0 40. | | 11 | XOR<br>XOR | . LZ<br>.S2X | B7, B5, B5 | | | <0,38> | | XOR<br>STNDW | .L2<br>.D2T2 | B6,B16,B4<br>B5:B4,*-B8(4) | | | <0,48> | | 11 | | | A3, B4, B4 | | | , | 11 | | | | | | | | 1.1 | LDW | .D2T2 | *+B8(16),B7 | , | 119/ | <0,38> | | XOR<br>EXTU | .S2X<br>.S1 | | | 11061 | <0,48> | | | SHRU | .S1 | A6,16,A7 | | 11971 | <0,39> | 1.1 | PVIO | .01 | DCC, 47, 47, A1 | ′ | 11001 | \1,J/ | | 11 | STNDW | .D2T2 | B5:B4,*+B8(4) | | | , | | STNDW | .D2T2 | B5:B4,*+B8(4) | | 11921 | <0.49 | | 1.1 | OTINDM | . D L I L | DO.Da' 100(4) | , | 11001 | 10,000 | 1.1 | XORMPY | .DZ1Z | | | 11091 | | | | SHL | .S1 | A6,8,A8 | ; | 197 | <0,40> | 1.1 | MOININ I | • 111 | MI//A//AT | ′ | 1 + 0 > 1 | 11,0/ | | | | | | | | | | | | | | | | # **CHAPTER 5** ## RESULTS In this chapter, we discuss how we compiled our implementation. We compare our implementation to a standard library and other implementations on an ARM, and we analyze our results. #### 5.1 Tools and Environment We wrote our implementations in C and was able to execute the XORMPY instruction by using the "\_xormpy()" function call. We compiled these implementations using the TI C6x C Compiler using the -O3 option. It was run on the C64x+ at 360MHz on an OMAP3530. We compared our results by running modular multiplication using liberypt on the ARM Cortex A8 on the same OMAP3530. liberypt is the cryptographic library which OpenSSL uses. We compiled liberypt using GCC with the -O3 option. #### 5.2 Execution Times on the OMAP3530 Table III shows our results. We can perform finite field multiplication faster using the DSP despite the DSP running at a slower clock rate. Using our implementation, we can perform binary field multiplication over seven times faster on the DSP when compared to using liberypt on the ARM for $GF(2^{283})$ and smaller. We can perform $GF(2^{233})$ multiplications on the DSP over six times faster than prime field GF(p224) multiplications on the faster ARM. There is a drop off in speed up as the field increases in size, with the largest drop off at $GF(2^{571})$ . Figure 8 is a graph of the execution times of binary field multiplication on the OMAP3530. These comparisons are relevant because these are two TABLE III. MODULAR MULTIPLICATION ON THE TI OMAP3530 | NIST | | M Cortex A8 | 360MHz TI C64x+ DSP<br>Our Implementation | | | | | | |----------------|-------------------|-------------------|-------------------------------------------|--------|----------|--|--|--| | Field | Time (µs) | rypt<br>Cycles | Time (µs) | Cycles | Speed Up | | | | | $GF(2^{163})$ | 6.62 | 3310 | 0.829 | 298 | 7.99 | | | | | $GF(2^{233})$ | 9.98 | 4990 | 1.31 | 472 | 7.62 | | | | | $GF(2^{283})$ | 15.44 | 7720 | 2.20 | 792 | 7.02 | | | | | $GF(2^{409})$ | 27.47 | 13735 | 4.43 | 1595 | 6.20 | | | | | $GF(2^{571})$ | 45.53 | 22765 | 11.60 | 4175 | 3.93 | | | | | $GF(p192)^{a}$ | 6.80 <sup>b</sup> | 3400 <sup>b</sup> | - | - | - | | | | | $GF(p224)^{a}$ | 8.79 <sup>b</sup> | 4395 <sup>b</sup> | - | _ | - | | | | | $GF(p256)^{a}$ | 9.28 <sup>b</sup> | 4640 <sup>b</sup> | - | _ | _ | | | | a. Montgomery Multiplication b. Does not include overhead needed for Montgomery Multiplication # **Execution Time of Binary Field Multiplication on the TI OMAP3530** **Figure 8.** Execution times of multiplications in all NIST Binary Fields on the ARM Cortex A8 using liberypt and C64x+ using our implementation. processors implemented on the same SoC. So our results have factored out differences in technology. Figure 9 depicts a graph of multiplication execution cycles per bit vs. bits in the NIST Field. It shows that as the number of bits in the NIST Field increase, the cycles per bit needed to complete the multiplication also increases. Figure 10 shows the cycles per bit vs. bits<sup>2</sup> in the NIST Field. This is close to a horizontal line. So the multiplication in our implementation is $O(n^2)$ , which is a result of us multiplying each byte of the one operand by each subword of the other operand. Figure 11 shows that the liberypt implementation ### **Execution Cycles Per Bit for NIST Binary Fields** **Figure 9.** Multiplication execution cycles per bit versus the number of bits in the NIST Binary Field on the ARM Cortex A8 (libcrypt) and TI C64x+ (our implementation). ## Cycles on C64x+ Per Bit<sup>2</sup> for NIST Binary Fields **Figure 10.** Multiplication execution cycles per $bit^2$ versus the number of bits in the NIST Binary Field using our implementation on the TI C64x+. running on the ARM Cortex A8 decreases in cycles per bit<sup>2</sup> in a NIST Binary Field. This shows that the libcrypt implementation uses a form of Karatsuba Multiplication, which is $O(n^{\log_2 3})$ . Figure 12 shows the number of cycles for the libcrypt multiplication vs. the # **Execution Cycles on ARM Cortex A8 Per Bit<sup>2</sup> for NIST Binary Fields** **Figure 11.** Multiplication execution cycles per bit<sup>2</sup> versus the number of bits in the NIST Binary Field using our implementation on the ARM Cortex A8. # Execution Cycles on ARM Cortex A8 Per Bit<sup>log</sup><sub>2</sub><sup>3</sup> for NIST Binary Fields **Figure 12.** Multiplication Execution Cycles per bit raised to the log<sub>2</sub>3 power versus the number of bits in the NIST Binary Field using liberypt on the ARM Cortex A8. number of bits raised to the power of log<sub>2</sub> 3. This displays a horizontal line, and shows that liberypt does use a form of Karatsuba Multiplication. This gives us the reason why our speed up decreases as the number of bits in the field increase, i.e. the liberypt implementation uses Karatsuba Multiplication, while our implementation does not. #### 5.3 Cycle Count Compared To Other Implementations Table IV shows our results compared to results shown in other papers. In [14], an ARM instruction extension, MULGF, was implemented. MULGF performs the same operations as XORMPY, except that it takes two 32-bit operands and produces a 63-bit result. Figure 13 shows the execution cycles of our implementation compared with the implementations in [14] with the MULGF instruction implemented. Even though the MULGF is more powerful than XORMPY, i.e. it can produce a 63-bit result as opposed to a 32-bit result, the C64x+ still outperforms the ARM by big margin. This shows that because of the highly parallelized architecture of the C64x+, it can perform modular multiplication much faster than the ARM Cortex A8. TABLE IV. CYCLES FOR MODULAR MULTIPLICATION | Processor | Implementation | NIST<br>Field | Cycles | |-------------------|-----------------------|---------------------------|--------| | TV C(4 | | $GF(2^{163})$ | 298 | | TI C64x+<br>DSP | Our Implementation | $GF(2^{233})$ | 472 | | DSI | | $GF(2^{283})$ | 792 | | | Montgomery [15] | <i>GF</i> ( <i>p</i> 256) | 3384 | | ADM | Pencil and Paper [14] | $GF(2^{163})$ | ~8000 | | ARM | Montgomery [14] | $GF(2^{163})$ | ~24000 | | | Karatsuba [14] | $GF(2^{163})$ | ~5500 | | ARM + | Pencil and Paper [14] | $GF(2^{163})$ | ~2400 | | MULGF instruction | Montgomery [14] | $GF(2^{163})$ | ~7000 | | extension | Karatsuba [14] | $GF(2^{163})$ | ~2500 | ## Cycle Counts of GF(2<sup>163</sup>) Multiplication **Figure 13.** Execution cycles of multiplication implementations on an ARM processor with the MULGF instruction [14] and our implementation on the C64x+ DSP in $GF(2^{163})$ . #### 5.4 Reduction Results We implemented both of our reduction techniques for $GF(2^{163})$ , $GF(2^{233})$ , $GF(2^{283})$ and $GF(2^{409})$ . Our results are listed in Table V. Using the Shift and Add method works better only for $GF(2^{233})$ and $GF(2^{409})$ . This supports our claim that if most of the set bits in the irreducible polynomial are within a 9-bit window, then it is more effective to use Multi-Precision Binary Polynomial Multiplication. Figure 14 displays this data in a graph. TABLE V. CYCLES FOR UNBALANCED EXPONENT MODULAR REDUCTION | NIST Field | Irreducible | Cycles for Reduction<br>Method | | | | | |---------------|--------------------------------------|--------------------------------|-----------------------------------|--|--|--| | TVIST TIEM | Polynomial | Shift and Add | Multi-Precision<br>Multiplication | | | | | $GF(2^{163})$ | $x^{163} + x^7 + x^6 + x^3 + x^0$ | 28 | 19 | | | | | $GF(2^{233})$ | $x^{233} + x^{74} + x^0$ | 33 | 42 | | | | | $GF(2^{283})$ | $x^{283} + x^{12} + x^7 + x^5 + x^0$ | 90 | 64 | | | | | $GF(2^{409})$ | $x^{409} + x^{87} + x^0$ | 52 | 66 | | | | # **Cycle Count of Unbalanced Exponent Modular Reduction of Binary Fields** ■ Shift and Add ■ Multi-Precision Multiplication NIST Binary Field **Figure 14.** Cycles for Unbalanced Exponent Modular Reduction using the shift and add method and MPBPM. One might expect that reduction for $GF(2^{409})$ would take longer than reduction for $GF(2^{283})$ because it is a larger binary field. However, T(x) for $GF(2^{283})$ has more bits set in it when compared with the T(x) for $GF(2^{409})$ , and not all bits are within a 9 bit window. Therefore, the first iteration of UEMR requires the "reduction byte", (0xA1), to be multiplied by $C_h(x)$ , and then $C_h(x)$ must be shifted and added to that result. Since in $GF(2^{283})$ there are 282 bits in $C_h(x)$ , there are [282/24] = 12 subwords that need to be multiplied by the reduction byte in the first iteration. Then $C_h(x)$ must be shifted and added to account for the bit not in the reduction byte, i.e. $x^{12}$ . So there are [282/32] = 9 adds and 9 = 2 = 18 shifts needed in addition to the adds and shifts needed for binary polynomial multiplication. In the first iteration of reduction for $GF(2^{409})$ , there needs to be only [408/32] = 13 adds and 13 = 2 = 26 shifts. Table VI displays the percentage of cycles of multiplications needed for Unbalanced Exponent Modular Reduction. The reduction method which needed the least number of cycles was chosen for each binary field. Figure 15 shows this data in pie graphs. This data shows that reduction for $GF(2^{283})$ takes the highest percentage of execution time when compared with the other NIST Binary Fields. This is because it must perform both binary polynomial multiplication using XOR-Multiply instructions and shift and add. Figure 15 shows that reduction does count for a very small percentage of the total multiplication time, which supports our usage of UEMR. Since the number of bits set in T(x) are either 2 or 4, as the NIST Binary Field becomes larger, the percentage of execution time used for reduction becomes smaller. **Figure 15.** Percentage of execution time needed for reduction for each NIST Binary Field in our implementation on the C64x+ TABLE VI. CYCLE PERCENTAGE FOR REDUCTION | NIST Field | Irreducible Polynomial | Percentage<br>of cycles for<br>MPBPM | Percentage<br>of cycles for<br>Reduction | |---------------|--------------------------------------|--------------------------------------|------------------------------------------| | $GF(2^{163})$ | $x^{163} + x^7 + x^6 + x^3 + x^0$ | 93.62% | 6.38% | | $GF(2^{233})$ | $x^{233} + x^{74} + x^0$ | 93.01% | 6.99% | | $GF(2^{283})$ | $x^{283} + x^{12} + x^7 + x^5 + x^0$ | 91.92% | 8.08% | | $GF(2^{409})$ | $x^{409} + x^{87} + x^0$ | 96.74% | 3.26% | | $GF(2^{571})$ | $x^{571} + x^{10} + x^5 + x^2 + x^0$ | 95.83% | 4.17% | # CHAPTER 6 # **FUTURE WORK** #### 6.1 Improving Binary Field Multiplication While our results were very good, there is room for improvement. Our lowest speed ups came at larger binary fields. Our algorithm can be enhanced by incorporating Karatsuba Multiplication in it so it can perform in less than $O(n^2)$ time. The larger binary fields would benefit most from this. To represent numbers in $GF(2^{571})$ on the C64x+, $\lceil 571/32 \rceil = 18$ registers are needed. Let $x_0$ and $y_0$ represent the least significant 9 registers of x and y, respectively. Let $x_1$ and $y_1$ represent the most significant 9 registers of x and y, respectively. Let the base b be $2^{288}$ ( $2^{(9)(32)}$ ). Instead of calculating $(x_1y_1)b^2 + (x_0y_1 + x_1y_0)b + x_0y_0$ , you can use Karatsuba Multiplication and calculate $(x_1y_1)b^2 + x_0y_0 + ((x_1 + x_0)(y_0 + y_1) - x_0y_0 - x_1y_1)b$ . This increases the number of additions and subtractions, equivalient to Exclusive-ORs, but significantly decreases the number of XOR-Multiplies needed. Six XOR instructions can be executed every cycle and these instructions have no delay slots. Only two XOR-Multiply instructions can be dispatched each cycle and require three delay slots. Using XOR-Multiply also requires executing XOR and shift instructions to create partial products. Using Karatsuba Multiplication in this instance decreases the number of XOR-Multiply instructions needed for MPBPM from 1728 to 1296, while increasing the number 32-bit XOR instructions by 18. This is a significant drop off in the the number of instructions needed. We implemented this on the C64x+ and it produced very good results. The execution time decreased from 11.60 $\mu$ s to 6.81 $\mu$ s. The speed up compared to the liberypt implementation on the ARM increased from 3.93 to 6.68. Karatsuba Multiplication can also be performed recursively. That is, each multiplication required, i.e. $x_1y_1$ , $x_0y_0$ and $(x_1 + x_0)(y_0 + y_1)$ , in this new implementation can be also be performed using Karatsuba Multiplication. This requires each multiplication to be broken up into smaller parts to decrease the total number of XOR-Multiply instructions. This can potentially be done several times to further increase performance. There will be a point where performing another Karatsuba decomposition will make the execution time longer. Therefore further work can be done to further optimize these algorithms for this DSP. Functions to perform binary field squaring can be also be developed based on MPBPM and UEMR. To square x, you get $x^2 = (x_1x_1)b^2 + (x_0x_1 + x_0x_1)b + x_0x_0$ . Since adding terms in a binary field is equivalent to performing a bit-wise XOR operation, the middle term is equal to zero because $x_0x_1 + x_0x_1 = 0$ . Hence $x^2 = (x_1x_1)b^2 + x_0y_0$ . Therefore squaring a binary field polynomial can be performed much faster than even Karatsuba Multiplication. #### 6.2 Implementing a Cryptosystem on the C64x+ Modular multiplication is just one part of a modern cryptosystems. Implementing an entire cryptosystem on the C64x+ can further support our argument that utilizing a VLIW DSP for cryptographic operations can significantly reduce execution time while freeing up the General Purpose Processor. Even though we have showed that the most computationally intensive part of an encryption algorithm can be sped up greatly, implementing a whole cryptographic system might show bottlenecks that might slow down the system. Cache and memory access might limit the C64x+. Additionally, if a full cryptosystem was implemented, there would need to be communication with the ARM or other peripherals. So communication may prove to be a bottleneck in this application. ### 6.3 Finite Field Multiplications on Other Processors Future research can be done to see how finite field multiplication can perform on other processors. One can survey newer DSPs, newer ARM processors or other low power processors. Power analysis of different processors performing modular multiplication would also be interesting. # CHAPTER 7 # **CONCLUSIONS** We have shown that migrating applications from an ARM to a DSP can provide drastic improvements in performance. We implemented binary field multiplication using Multi-Precision Binary Polynomial Multiplication on the TI C64x+. For most fields, this provided a six times speed up when compared with the ARM on the same chip. We have shown that Unbalanced Exponent Modular Reduction can also be efficiently implemented on the C64x+ using both the shift and add method and Multi-Precision Binary Polynomial Multiplication. We have explained how our algorithm can be improved by taking advantage of Karatsuba Multiplication. We also explored other possible future work. We have taken our source code and produced a C library for the C64x+ which can perform binary field addition and multiplication for all NIST Binary Fields. Our source code is posted online and is available for all to view, edit and compile. It is downloadable from: http://rijndael.ece.vt.edu/ctergino ## REFERENCES - [1] "BeagleBoard System Reference Manual Rev C2," beagleboard.org, revision 0.2, March 2009. - [2] D. Hankerson, A. Menezes and S. Vanstone, Guide to Elliptic Curve Cryptography. New York, NY: Springer, 2004, pp. 213-215. - [3] S. Haibin, Y. Jin, and R. You, "Unbalanced Exponent Modular Reduction over Binary Field and Its Implementation," Innovative Computing, Information and Control, 2006. ICICIC '06. First International Conference on, vol. 1, pp. 190-193, Aug-Sept 2006. - [4] M. Knezevic, K. Sakiyama, J. Fan, and I. Verbauwhede, "Modular Reduction in GF(2<sup>n</sup>) Without Pre-Computational Phase," Lecture Notes in Computer Science, Arithmetic of Finite Fields 2nd International Workshop, WAIFI 2008, Proceedings, pp. 77-87, July 2008. - [5] J. Sankaran, "Reed Solomon Decoder: TMS320C64x Implementation," Digital Signal Processing Solutions Application Report, Texas Instruments, December 2000. - [6] P. Montgomery, "Multiplication Without Trial Division," Mathematics of Computation, vol. 44, issue 170, pp. 519-521, April 1985. - [7] C. Koc, T. Acar and B. Kaliski, "Analyzing and Comparing Montgomery Multiplication Algorithms," IEEE Micro, vol. 16, issue 16, pp. 26-33, June 1996. - [8] B. Sunar, "A generalized method for constructing subquadratic complexity GF(2k) multipliers," IEEE Transactions on Computers, vol. 53, issue 9, pp. 1097-1105, Sept. 2004. - [9] "OMAP3530/25 Applications Processor," Texas Instruments, May 2009. - [10] "Architecture and Implementation of the ARM® Cortex<sup>TM</sup>-A8 Microprocessor," ARM, October 2005. - [11] "TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide," Texas Instruments, October 2008. - [12] K. Itoh, M. Takenaka, N. Torii, S. Temma and Y. Kurihara, "Fast Implementation of Public-Key Cryptography on a DSP TMS320C6201," Lecture Notes In Computer Science, vol 1717, pp. 61-72, 1999. - [13] P. Gastaldo, G. Parodi and R. Zunino, "Enhanced Montgomery Multiplication on DSP Architectures for Embedded Public-Key Cryptosystems," EURASIP Journal on Embedded Systems, vol. 8, issue 3, pp. 1-9, April 2008. - [14] S. Bartolini, I. Branovic, R. Giogri, and E. Martinelli, "Effects of Instruction-Set Extensions on an Embedded Processor: A Case Study on Elliptic Curve Cryptography over GF(2<sup>m</sup>)," IEEE Transactions on Computers, vol. 57, issue 5, pp. 672-686, May 2008. - [15] A. Tenca and C. Koc, "A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm," IEEE Transactions on Computers, vol. 52, issue 9, pp. 1215-1221, Sept 2003.