Department of Mechanical Engineering, Virginia Tech, 800 Drillfield Dr, Blacksburg, VA 24061, USA

Australian Centre for Field Robotics (ACFR), Rose St, Sydney 2006, Australia

Abstract

This paper presents the performance modeling of the real-time grid-based recursive Bayesian estimation (RBE), particularly the parallel computation using graphics processing unit (GPU). The proposed modeling formulates data transmission between the central processing unit (CPU) and the GPU as well as floating point operations to be carried out in each CPU and GPU necessary for one iteration of the real-time grid-based RBE. Given the specifications of the computer hardware, the proposed modeling can thus estimate the total amount of time cost for performing the grid-based RBE in a real-time environment. A new prediction formulation, which adopted separable convolution, is proposed to further accelerate the real-time grid-based RBE. The performance of the proposed modeling was investigated, and parametric studies have first demonstrated its validity in various conditions by showing that the average error of estimation in computational performance stays below 6% to 7%. Utilizing the prediction with separable convolution, the grid-based RBE has also been found to perform within 1 ms, although the size of the problem was relatively large.

Introduction

Recursive Bayesian estimation (RBE) allows the estimation of belief of a dynamically moving target by updating the belief both in time and observation

One of such techniques is the modified ensemble Kalman filter (EnKF). The EnKF allows non-Gaussian estimation by minimizing a cost function defined by a non-Gaussian observation error with a pre-conditioned conjugate gradient method

Grid-based RBE technique is able to maintain a good accuracy for the belief since the entire target space is spatially discretized

This paper presents a performance modeling for the parallel grid-based RBE, particularly the parallel computation using the GPU, and it is able to determine the time cost of one iteration of the RBE. The proposed modeling formulates the total amount of data transmission between the CPU and the GPU and the total number of floating point operations to be carried out in each CPU and GPU necessary for one iteration of the parallel grid-based RBE. Given the specifications of the computer hardware, it is thus possible to estimate the time cost for one iteration of the parallel grid-based RBE. In order to perform the parallel grid-based RBE at maximum speed, the proposed modeling also reformulates and implements the prediction process with separable convolution.

The paper is organized as follows. The following section reviews the recursive Bayesian estimation as well as the parallel grid-based RBE. Section presents the proposed reformulation of the prediction process for the parallel grid-based RBE and its computational performance modeling. Section demonstrates the validation and efficacy of the proposed modeling through numerical examples, and the Conclusion and future work are summarized in the final section.

Parallel grid-based RBE

Problem statement

The motion of an object,

where x^{
o
} represents the state of the object, u^{
o
} represents the object control input, w^{
o
} represents the system noise, which includes environmental influences on the target, and

where _{
k−1} is the time which corresponds to the time step

Recursive Bayesian estimation

Prediction

The prediction process starts with the numerical implementation of the object motion model defined in Equation (2). For simplicity, the numerical integration is carried out by Riemann left sum algorithm. By dividing the time interval

Let a sequence of the observations of the object from time step 1 to

where

Correction

The correction process is associated with the definition of the observation model. Let the probability of detection (PoD) be

The observation ^{
s
}**z**
_{
k
} at the time step

where

The correction process then computes the belief

where

where

Parallel grid-based RBE

Representation of target space and belief

The grid-based RBE achieves non-Gaussian belief estimation by first representing the arbitrary target space

and subsequently creating a rectangular space as **m**= [ _{
x
} and _{
y
} grid cells in two directions, respectively. The dimensions of a grid cell are defined as

where ∀_{
x
}∈{1,…,_{
x
}} and ∀_{
y
}∈{1,…,_{
y
}}. Each grid cell is defined as

Note that _{
g
} is the number of grid cells approximating the target space.

The belief is usually represented by a probability density function over the target space. Similar to the discretization of the target space, the belief could also be represented discretely by grid cells. The position of each grid cell can be described in the two-dimensional integer space as [ _{
x
},_{
y
}], where _{
x
}∈1,…,_{
x
} and _{
y
}∈1,…,_{
y
}. With the integer representation, the belief at the grid cell [ _{
x
},_{
y
}] can be represented as

Prediction

The prediction process requires the numerical evaluation of Equation (4). Given the belief of the previous state _{
x
},_{
y
}] and the target motion model _{
x
}×_{
y
} as a convolution kernel, the predicted belief of the current state can be numerically computed as

where ⊗ indicates the two-dimensional convolution of the belief of the previous state with the probabilistic target motion model. Therefore, the belief of the current state is given by

The parallelization of the prediction process is straightforward. Since the prediction at each grid cell, given by Equation (12), can be performed independently, the parallelization of the prediction corresponds to the parallelization of the equation and achieves a parallel efficiency of 100% in an ideal environment. However, this equation also shows that the computation for the prediction process is largely dominated by the size of the convolution kernel. In order for real-time performance, it is important that the convolution kernel of an appropriate size, which needs to be big enough to capture the motion of the target as well as small enough to perform fast computation, is utilized.

Correction

The correction process corresponds to the numerical computation of Equation (7). Given the predicted belief _{
x
},_{
y
}], the corrected belief is computed by

where _{c} is the area of a grid cell, and

The parallelization of the correction process requires the breakdown of the process as it identifies which subprocesses are parallelizable. By observing the mathematical operations, the correction process can be broken down into three steps:

1. Calculate

2. Sum _{c};

3. Calculate

The breakdown indicates that steps 1 and 3 are grid-wise sub-processes, which can be conducted independently. Therefore, for the correction process, steps 1 and 3 can be computed in parallel, whereas step 2 is not parallelizable.

Target state evaluation

In the parallel grid-based RBE, the state of the target is evaluated by Equation (2) in the integral form at each time step. For an accurate evaluation of the target state an appropriate choice of the time interval _{c} to perform the computation, including both the prediction and correction processes. In order to achieve an accurate evaluation of the target state, the time interval _{c}. As shown in Figure _{c} the evaluated target states could match the real target states. When the _{c}, the evaluation of the target states fails and eventually leads to large accumulated errors. The _{c} is determined by not only the parallel grid-based RBE itself but also its computational performance for the specific computer hardware configuration.

Target states evaluation

**Target states evaluation.**

Computational performance modeling

Acceleration of prediction process

Since the RBE designed with high frequency results in using the Markovian target motion model well approximated by a Gaussian probability density, the proposed modeling first reformulates the prediction process with the Gaussian assumption as a pre-process and accelerates the parallel grid-based RBE to achieve the maximum performance. With the Gaussian assumption, the convolution kernel in the matrix of size _{
x
}×_{
y
} can be separated into two vector kernels in the name of separable convolution: a column kernel of length _{
x
} and a row kernel of length _{
y
}. Therefore, the target motion model matrix is separated as

where _{
x
}+_{
y
}. Substituting Equation (15) into Equation (11), the predicted belief of the current state can be computed as

which means that the prediction process can be broken down into two steps:

and

These equations show that the prediction process at each grid cell is carried out by performing two one-dimensional convolutions, each in horizontal and vertical directions instead of the original one two-dimensional convolution while remaining complete parallelizability. For Equation (17), the number of floating point operations for each grid cell is seen 2_{
x
} since _{
x
} times of one multiplication and one summation are necessary, whereas the number of floating point operations for Equation (18) is 2_{
y
} via the similar observation. Having a total of _{
g
} grid cells, the total number of floating point operations for the prediction process is thus given by

This is considerably small compared to that of the original formulation which is derived as 2_{
g
}
_{
x
}
_{
y
} via Equation (12) since _{
x
}+_{
y
}≪_{
x
}
_{
y
} for an appropriate prediction process.

Parallel computation using GPU

Following Equations (16) and (13) for the prediction and correction process, respectively, Figure

Procedures in the parallel grid-based RBE using GPU

**Procedures in the parallel grid-based RBE using GPU.**

Modeling of computational performance

The computational performance of the accelerated parallel grid-based RBE using GPU is determined not only by the performance of the CPU but also by the performance of the GPU and that of data transmission. As a result, the time cost of one iteration of the accelerated parallel grid-based RBE is given by

where _{trans} represents the data transmission time cost between the CPU’s memory and the GPU’s global memory as well as that between the local and the global memory inside the GPU, _{G} represents the time cost of the parallel computation performed on the GPU, and _{C} represents the time cost of the computation performed on the CPU.

Data transmission

In order to determine the data transmission time cost _{trans} for one iteration of the accelerated parallel grid-based RBE, the data transmitted among the CPU’s memory, GPU’s global memory, and GPU’s local memory need to be evaluated in both the prediction and correction processes. Let the amount of data transmitted in the unit of bytes be defined as

where _{
g
} and _{
x
}+_{
y
}, respectively. The same numbers of data, _{
g
} and _{
x
}+_{
y
}, are transmitted to the GPU’s local memory to perform parallel calculation. In the correction process, the number of data of the likelihood to be transmitted from the CPU’s memory to the GPU’s local memory through the GPU’s global memory is _{
g
}, whereas the number of data of the result _{
g
}. The number of data of the sum, _{
g
}.

By observing the data transmission for one iteration of the accelerated parallel grid-based RBE, the total number of data transmitted from the CPU’s memory to the GPU’s global memory is given by

and all the data are transmitted continuously from the GPU’s global memory to the GPU’s local memory

The total number of data transmitted from the GPU’s local memory to the GPU’s global memory is

and that from the GPU’s global memory to the CPU’s memory similarly becomes

The data transmission time cost _{trans} for one iteration of the accelerated parallel grid-based RBE is given by

where _{CG} and _{CG} are the total number of data transmitted and the copy bandwidth with the unit of bytes per second from the CPU’s memory to the GPU’s global memory, respectively, _{GC} and _{GC} are those from the GPU’s global memory to the CPU’s memory, respectively, and _{GG} and _{GG} represent those between the GPU’s global memory and the GPU’s local memory. Due to the fact that the copy bandwidth from the GPU’s global memory to the GPU’s local memory and the one in opposite direction are the same, the number of data transmitted inside the GPU is given by

Substitute Equations (22), (25), and (27) into Equation (26), the data transmission time cost for one iteration of the accelerated parallel grid-based RBE is given by

It is to be noted here that these parameters of copy bandwidths are inherent for a specific computer hardware configuration and can be determined experimentally.

Floating point operations

In order to determine the GPU computation time cost _{G} and CPU computation time cost _{C} for one iteration of the accelerated parallel grid-based RBE, the number of floating point operations performed on both CPU and GPU needs to be evaluated. The number of floating point operations performed on the GPU for the prediction process is seen 2_{
g
}(_{
x
}+_{
y
}) as the Equation (19) indicated. The number of floating point operations performed on the GPU for the correction process is identified as 2_{
g
} in total since _{
g
} parallel multiplications and _{
g
} parallel divisions are performed for steps 1 and 3 in Subsection 2, respectively. Meanwhile, the number of floating point operations performed on the CPU is _{
g
} by _{
g
} summations in step 2 of the Subsection 2. As a consequence, the total number of floating point operations performed on the GPU and the CPU for one iteration of the accelerated parallel grid-based RBE is given, respectively, by

The GPU computation time cost for one iteration of the accelerated parallel grid-based RBE is given by

where _{G} is the number of floating point operations performed on the GPU, and _{G} is the computational rate of GPU with the unit of FLOPS. Substituting Equation (29) into Equation (29), the GPU computation time cost is given by

Similarly, the CPU computation time cost for one iteration of the accelerated parallel grid-based RBE is given by

where _{C} represents the number of floating point operations performed on the CPU, and _{C} is the computational rate of CPU with the unit of FLOPS. In the same way, by substituting Equation (29) into Equation (31), the CPU computation time cost is given by

It is to be noted here that the computational rates, _{G} and _{C}, are also inherent for a specific CPU and GPU configuration and can be determined experimentally.

Experimental validation

Table

**Setup**

**Processor**

**Memory (GB)**

**GPU**

Nvidia GeForce (Santa Clara, CA, USA), Intel (Santa Clara, CA, USA).

1

Intel Dual-Core, 2.70 GHz

4.0

Nvidia GeForce GT220

2

Intel Dual-Core, 2.40 GHz

4.0

Nvidia GeForce GT320M

3

Intel Dual-Core, 2.40 GHz

4.0

Nvidia GeForce GS8400

Improvement in prediction process

The efficiency of the prediction process accelerated by separable convolution was evaluated with a problem having a fixed grid space size of 1,000×1,000 and varying the convolution kernel size from 1 to 50 on the computer setup 1. The result of the time cost by GPU is shown in Figure

GPU calculation time cost

**GPU calculation time cost.**

Validation

This set of tests was aimed at validating the proposed modeling of computer performance by estimating the total iteration time cost _{trans}, _{G}, or _{C}, is also compared with the actual performance, respectively. All the time cost results are measured by averaging the time cost of 10,000 iterations. Needless to say, the convolution kernel size _{
x
}+_{
y
} and grid space size _{
g
} are the two major factors in the proposed modeling. Two tests were thus conducted by each, changing the convolution kernel size and the grid space size.

Test 1

Test 1 was performed by fixing the grid space size of the parallel grid-based RBE to 1,000×1,000 and varying the convolution kernel size _{
x
}=_{
y
}=

The results of all the components of the time cost for the three computer setups are shown in Figures

Time cost of all the components for setup 1 with fixed grid space

**Time cost of all the components for setup 1 with fixed grid space.**

Time cost of all the components for setup 2 with fixed grid space

**Time cost of all the components for setup 2 with fixed grid space.**

Time cost of all the components for setup 3 with fixed grid space

**Time cost of all the components for setup 3 with fixed grid space.**

**Time cost**

**Setup**

**1**

**2**

**3**

Average relative error

_{trans}

1.159 ms

1.165 ms

1.305 ms

_{G}

0.216 ms

0.462 ms

0.856 ms

_{C}

0.402 ms

0.446 ms

0.382 ms

1.777 ms

2.073 ms

2.543 ms

(5.88

(6.55

(6.05

Maximum relative error

_{trans}

2.351 ms

2.254 ms

2.670 ms

_{G}

0.716 ms

1.464 ms

3.259 ms

_{C}

0.779 ms

0.857 ms

0.818 ms

3.228 ms

4.149 ms

6.081 ms

(10.63

(11.24

(11.45

Test 2

Test 2 was performed by fixing the convolution kernel size of the parallel grid-based RBE to 16×16 or 32×32 and varying grid space size _{
x
}=_{
y
}=

The results for all the components of the time cost for the three computer setups are shown in Figures

Time cost of all the components for setup 1 with fixed kernel

**Time cost of all the components for setup 1 with fixed kernel.**

Time cost of all the components for setup 2 with fixed kernel

**Time cost of all the components for setup 2 with fixed kernel.**

Time cost of all the components for setup 3 with fixed kernel

**Time cost of all the components for setup 3 with fixed kernel.**

**Total time cost**

**Setup**

**1**

**2**

**3**

Average relative error

0.513ms

0.530ms

0.617ms

(5.59

(5.68

(5.90

Maximum relative error

2.140 ms

2.491 ms

2.835 ms

(10.08

(10.64

(10.26

Simulated target searching task

The performance of the prediction process dominates the accuracy of the RBE when no valid observations are obtained. The aim of this test is to evaluate how well the proposed modeling help the prediction process keep the accuracy during the no observation period. A simplified target searching task is described in this subsection. The motion model of the simulated target is given by

where ^{
t
} and ^{
t
} are the velocity and direction of the target motion, respectively, each subject to a Gaussian noise, and

where _{
i
}) respectively, and _{
i
} and the target. Table

**Parameter**

**Value**

Sensor platform, _{
i
}

Velocity

0.05

Turn coef.

0.8

PoD var.

[ 0.2, 0.2]

Target,

Velocity

N(0.1, 0.02)

Direction

N(0rad, 0.7rad)

Prior

N([20, 25], diag{200, 200 })

Figure

Search and rescue task

**Search and rescue task.**

**Total time**

**Sensor platform**

**1**

**2**

**3**

**4**

Average relative error

0.618 ms

0.633 ms

0.626 ms

0.618 ms

(5.78

(6.21

(5.82

(5.93

Maximum relative error

2.856 ms

2.823 ms

2.892 ms

2.854 ms

(9.89

(9.56

(9.25

(9.68

Conclusion and future work

The performance modeling for the real-time grid-based RBE, especially parallel computation using GPU, has been proposed to identify the best resolution of the RBE with given computer hardware. The modeling allows the estimation of time costs necessary within CPU and GPU and that of data transmission between CPU and GPU for the real-time grid-based RBE. In order to speed up the RBE, the prediction has been additionally reformulated with the separable convolution.

The proposed modeling was experimentally investigated by varying its major parameters. The result of the first test with varying convolution kernel size shows that the average error of the estimation by the proposed modeling stays below 7% regardless of the convolution kernel size and that a high-performance GPU is necessary if the convolution kernel size is large. In the second test with varying grid space size, it is found that the proposed modeling estimates within the average error of 6%, irrespective of the grid space size, and that a high-quality memory is necessary if fast RBE is required for large grid space. Utilizing prediction with separable convolution, the RBE has also been found to perform within 1 ms, although the size of the problem was relatively large.

The current study is still the first step for achieving high-fidelity RBE in a real-time environment. The project is further planned to utilize the best resolution of the RBE identified by the proposed modeling and investigate its efficacy.