1 |
The
rotation also has some effects on reducing the depth range of the FOV.
By small-angle approximation, the reduction in this dimension can be
approximated as
2 |
The error in the Kinect’s roll component is complex and therefore approximated as an additional reduction of the horizontal and vertical viewing angles. However, error in roll has no effects of reducing FOV’s depth.
In the final stage, the translational errors Δ are applied by additionally removing the FOV along x, y, z axes.
Possible placement locations
In reality, cameras cannot be placed without constraints. It is possible to place a camera 2 m above the floor in the center of the room, but some supporting structure, such as a tripod, would be necessary, which can obstruct people and therefore is undesirable.
Generally, the possible and suitable locations to place the cameras are usually along the edge of the room to increase the coverage as well as to avoid obstruction in the space. This can reduce the complexity of optimizing the location of the cameras. Instead of searching through 3-D space xyz, searching on the xy-plane can be reduced to searching along those edges, reducing one dimension. Moreover, if the height of the camera placement is fixed by the height of the existing structure, z is also restricted, so the 3-D search problem is now reduced to only 1-D.
Camera placement optimization algorithm
In order to obtain the optimal, or nearly optimal, solution, a camera is added one-by-one to minimize the uncovered area. After the addition, the configurations of all cameras are adjusted around the current configurations to also minimize the uncovered area. The two steps are repeated until the whole area is covered. After that, the final optimization is done to maximize the coverage. Each camera is readjusted one-by-one to maximize the sum of the number of cameras covering each point while keeping every point covered. All optimization processes are done by PSO22 following previous works,19,21 using a Python library pyswarm.23 The process’s flowchart is illustrated in Figure 3.
Particle’s state
PSO requires definition of the state of each particle. Each variable in the state is the value which can be adjusted to optimize the objective function.
In this coverage optimization problem, a state of each camera is defined to represent camera’s location and orientation. The location depends on the limitation of location of cameras, and the orientation includes pitch and yaw angles around fixed y-axis and z-axis, respectively. Roll angle is set to zero. So, the state is (xstate, θstate, ψstate).
Coverage test
To evaluate the coverage of the system, we need to know if the point P(x, y, z) within the region of interest (ROI) is in the FOV of the camera Ci (
), given the position Oi =(Oxi,Oyi,Ozi) and orientation (ϕi, θi, ψi) of Ci, for all cameras in the system. This is done by transforming the point P(x, y, z) relative to world frame, into Pi(xi, yi, zi) relative to the camera Ci‘s frame. By comparing xi, yi, zi to the obtained reduced model, we can judge whether P is in FOVCi
.
Process of adding a new camera
The first step in each loop is adding a camera to the system. To cover the ROI using the least cameras, the placement is chosen by minimizing the number of points in the area which is not covered by any cameras. That is at point P(x, y, z), the index f(P) is defined as 0 if P is in FOV of any camera, and 1 otherwise. The objective function fopt to be minimized is
3 |
Process of readjustment
In this step, all configurations, including those of the newly added camera, are changed within small boundary around their current settings, that is, for Ci with
, Ci will be varied within the range of [xstate,i−xb,xstate,i+xb] , [θstate,i−θb,θstate,i+θb] , and [ψstate,i−ψb,ψstate,i+ψb] , respectively, given that the ranges are truncated to the limit. xb,θb, and ψb are the size of the boundary of xstate,θstate, and ψstate
, respectively.
The objective function used in this process is the same as those used in the adding process, that is, fopt. If the result provides lower number of remaining points, the result is kept; otherwise, the process continues with the previous result.
After this process, if all the points in ROI are covered, that is, fopt = 0, the optimization process breaks the loop and gets to the final readjustment. If there are still some points left, the loop continues and proceeds to add a new camera again.
The objective function used in this process is the same as those used in the adding process, that is, fopt. If the result provides lower number of remaining points, the result is kept; otherwise, the process continues with the previous result.
After this process, if all the points in ROI are covered, that is, fopt = 0, the optimization process breaks the loop and gets to the final readjustment. If there are still some points left, the loop continues and proceeds to add a new camera again.
Process of final optimization
This final optimization aims at maximizing the coverage while maintaining the complete coverage without adding new camera. At each point P ∈ ROI, the number of cameras seeing the point is denoted as n(P), and the objective function to be maximized is
4 |
subject to the constraint that n(P)>0,∀P∈ROI
.
First, fopt2
is calculated based on the camera configurations from the loop, stored as fcur and floop. Next, configuration of each camera is adjusted one-by-one within the boundary, and fopt2 is calculated. If fopt2>fcur , the adjusted result is kept and fcur is changed to fopt2
.
After adjusting all cameras, fcur is then compared with floop at the beginning of the loop. If fcur>floop
, there was some improvement in the previous loop, and the optimization continues. floop is then set to fcur and the adjustment continues. On the other hand, fcur = floop
indicates that there was no improvement in the previous loop.
Therefore, the process is stopped, and the final result is obtained.
Implementation
A space in our laboratory is a good candidate for testing the optimization algorithm, as there are possible places in complex shapes to put the cameras. By measuring the dimension of the space and the structures, a model of the area was obtained.
The area of the experiment is set as a 3-by-3.5-m area, sampled by 0.1-m step, as drawn in dots in Figure 4. In order to cover the area such that people’s position as well as quadrotor’s position can be obtained, the area must be covered from the height of 0.7–2.5 m, also 0.1-m step. The points become our set of ROI. The optimization is done to cover all of these sampled points.
In the system, Kinect’s depth sensors were utilized. According to the specification of Xbox Kinect sensor,24 the horizontal angle of view (2αc,h) is 57° and the vertical angle of view (2αc,v) is 43°. The depth range of the Kinect is set at 0.45 m–4.5 m. By placing x-axis along the depth, and y-axis pointing to the left, the simplified model of Kinect’s FOV can be illustrated as dashed black lines in Figure 5.
With reduced model discussed in the section “Placement error prevention”, for each degree of error δ with αc = 28°, ΔDmin ≈ 0.005 m on the minimum depth side (Dmin = 0.45 m), and ΔDmax ≈ 0.05 m on the maximum depth side (Dmax = 4.5 m). Error in roll angle results in additional 1° reduction in both horizontal and vertical angles.
With the angular errors δ = 2° and the translational errors Δ = 0.03 m, the reduced FOV is shown in Figure 5. Point Pi(xi, yi, zi) in Kinect Ki‘s frame is in its FOV if
5 |
- Limitation of possible placement locations
- Conversion of states to position and orientation
The extreme limits of the states are set as xstate ∈ [0 m, 12.73 m], θstate ∈ [10∘, 30∘], and ψstate ∈ [−70∘, 70∘]. The boundary for readjustment is set as xb = 0.5 m for xstate, and θb = ψb = 15∘ for θstate and ψstate, subject to the extreme limits. Other PSO’s parameters used are provided in Table 1.
|
Table 1. PSO’s parameters used in the optimization simulations.

View larger version
Optimization result
The optimization script was run for 20 times, with each run resulting in different number of cameras required, ranging from 5 to 7 cameras. Each run took around 1–3 days to complete, depending on the number of Kinects to use.
The run which resulted the lowest number of cameras, that is, 5, was selected. The final readjustment in the section “Process of final optimization” was applied to the result. This process took more than a week before converging to the final result. The plots showing the position of the cameras, with their reduced fields of view and the number of cameras covering each point are shown in Figure 7(a) and (b). The combination of all five Kinects is drawn in 3-D in Figure 7(c).
Optimization using only environmental cameras
The optimization problem utilizing only environmental cameras was also completed for comparison of the number of required cameras discussed in the section “System configuration.”
Using only environmental cameras to track the face, the requirement for coverage is increased. Instead of only covering all the points, we need to cover the points from all directions to obtain the face. The camera needs to look into the face, with the maximum deviation angle from the camera’s principal axis less than or equals to 30°.
As the optimization is for comparison with the proposed method, the simulation’s conditions and steps were maintained to be as similar as possible. In addition to covering all points in the ROI, the angles which can be covered by the quadrotor in the proposed system must also be covered. The subroutine of final readjustment is also removed to reduce time as only the number of cameras is needed.
Using the same possible locations for camera placement obviously cannot cover all angles as no cameras can observe the face in +y direction. Therefore, additional locations were added along the dashed blue lines in Figure 6, with black dotted arrows show the central angle.
Figure 8 shows the top view of the person relative to the camera. θp is the angle of the person’s location from the camera’s principal axis, which can be obtained by
6 |
where (xp, yp) and (xc, yc) are the position of the camera and the person, respectively, and ψc is the rotation angle of the camera from x-axis. So, the person is inside the camera’s FOV if |θp| ≤ αc, horizontal, camera’s horizontal viewing angle.
If (xp, yp) is not in the FOV, then all angles cannot be seen by this camera. If it is inside, the angle α can be obtained by
7 |
where ψp is the rotation angle of the person from x-axis. So
8 |
The simulation was run with the angle steps of 1°, 2°, and 3°. Some angles at positions on the top left corner cannot be covered by the possible camera positions, resulting in incomplete solutions. The simulations were stopped after there was no improvement, and the lowest number of cameras obtained was 44 cameras in 1° step simulation, as shown in Figure 9 (red areas show uncovered face angles and positions). There are a large number of cameras for real implementation compared to our proposed system, which requires only five environmental cameras and some moving cameras, basically one camera per one person.
Sensor fusion and quadrotor’s tracking and control
As multiple Kinect sensors are used, a quadrotor or a person is likely to be detected by more than one Kinect, so it is necessary to correctly correlate data from the same object together. Data fusion is implemented by simple weighted averaging, whose weights are determined empirically.
Person fusion and tracking
The positional data for a person are 3-D positions (x, y, z) relative to the world’s coordinate. For the orientation, only a rotation angle is required, as looking up or down cannot be detected by the human-tracking algorithm. Rotation angle is taken as the orientation of the whole body as looking left and right cannot be observed.
At the beginning, a tracking_list is initialized as an empty list. In each sample, if tracking_list is empty, all detected heads are grouped as the same head by Euclidean distance. The position and orientation of each head is averaged in the group. On the other hand, if tracking_list is not empty, the detected heads will be correlated to those in the tracking_list. It is considered as belonging to the same person if they are close enough. Old and new data are averaged to xavg, yavg, zavg in the ratio of 3:1, in which the new data are equally averaged among the new data.
If there are any heads left after the matching process, the remaining heads are grouped, averaged, and added to the tracking_list. The frames which are not updated for too long are eliminated (considered as the person has left the area and the tracking is lost). At the end, tracking_list is published to Robot Operating System (ROS) in the form of tf frames. The fused head is published as a frame with origin at the averaged (xavg, yavg, zavg), rotated by ψz, avg around the world’s z-axis so that the frame’s x-axis points into the person’s face.
Quadrotor fusion and tracking
Kinect sensors can only detect the quadrotor’s 3-D position, not the orientation, by the depth images. Positions of the quadrotor are fused to provide 3-D position of the quadrotor in the world’s coordinate. To handle noise and false detection, the information about the number of cameras seeing the same object and the number of quadrotor in use is utilized. The orientation of the quadrotor is obtained from the on-board IMU and later coordinated with the 3-D position.
The fusion process is generally similar to the fusion for humans, with some modifications. As the size of the quadrotor is quite small, false detection is possible after background subtraction. If there are more than one sensor detecting an object around the same position, there is higher possibility that the object really exists at that position. A boolean flag tracking is introduced: It is set to false if there is only one Kinect seeing this object at the first detection, and true if there are two or more Kinects. The number of detecting cameras is also used to arrange the order of the quadrotors in tracking_list during the initialization.
After tracking flag is set and the quadrotor is kept in tracking_list, it is used in the tracking process. If the flag is false and in the next time step more than one camera can see this object, it is considered as a quadrotor and tracking becomes true. On the other hand, if no camera can see this object in the next time step, the previous detection is considered as a noise, and it is removed from tracking_list.
The number of detected quadrotors is also limited to the number of quadrotors being used (nquad), which is generally known beforehand. The length of the list is truncated to nquad as it is impossible to have more quadrotor than that.
The new data are matched with the tracking_list in the same way as fusion of the head. As the quadrotor tends to move faster, the ratio of old data to new data is 1:1.
After the matching process, if there are any points left, they are grouped together by distance, averaged, and added to extra list. If there is a true flag in extra and the length of tracking_list is less than nquad, the quadrotor in extra with true flag is added to tracking_list. If the length of tracking_list is nquad, but there is a false flag in the list, then extra‘s quadrotor with true flag replaces that false quadrotor. The process is repeated until there is no more true flag in extra or false flag in tracking_list. Before publishing tracking_list to ROS’s tf topic, the outdated quadrotor is removed from tracking_list.
Quadrotor control
We proposed the threshold algorithm,14 with the experiments using one Kinect for controlling a quadrotor to track a person. In this article, the same algorithm is used with multiple Kinect sensors controlling one quadrotor.
When a person enters the area, the head is detected and fused, and the goal
is calculated. Instead of directly using this goal for the control, it
is averaged for 1 s, and then set as the goal for the quadrotor, that
is, (xTd,f,ψd,f)T=(x⎯⎯Td,ψ⎯⎯⎯d)T . A threshold (xTth,ψth)T
is applied around the fixed goal position.
From
this point, new head’s position and orientation are constantly used to
compute goal, while the quadrotor’s goal is kept unchanged. If the
person is standing still, the new goal is only affected by the noise and
should not go beyond the threshold consecutively for some time. If the
new goal breaks the threshold consecutively for 1 s, it is recognized as
the person is in movement and the quadrotor’s goal (xTd,f,ψd,f)T
should be recalculated. The process of goal averaging then starts
again. In this way, quadrotor’s goal is not frequently changed,
resulting in less oscillating trajectory, but the quadrotor can still
move correspondingly to track human face. Sudden change of goal position
can also cause a problem to the proportional integral derivative (PID)
controller, as the derivative value can be very high, resulting in large
control command and the instability of the control system. To reduce
the effect of this surge and provide smooth trajectory, a simple
low-pass filter is applied to the goal position output from the
threshold algorithm by
9 |
where (xTd,f,ψd,f)T[i]
is the output of the threshold algorithm at time step i, (xTd,q,ψd,q)[i]
is the filtered goal to be used by the PID controller to control the quadrotor at time step i, and 0 ≤ w ≤ 1 is the weight of the filter.
Experiment and results
For evaluating the optimization result and tracking performance, Kinects were set up following the optimization, and tracking experiment was conducted. Position-tracking accuracy and tracking success ratio were also investigated.
System and implementation
Hardware components
The system composes of: Xbox 360 Kinect sensors, Crazyflie Quadrotor, and controlling computers.
- Xbox 360 Kinect
Xbox 360 Kinect uses structured light pattern and triangulation in depth calculation. The IR emitter projects a constant pattern of IR light speckles onto the scene, where the IR camera captures those speckles and correlates them with the memorized reference pattern to give depth values.25 Depth images are transmitted at 30 Hz for 640 × 480 pixels resolution.
Interference between multiple Kinect sensorsMultiple
Kinects project unmodulated IR patterns into the same area and
therefore detect multiple patterns, resulting in confusion and loss of
depth information. A mechanical modification using a vibrating unit,
made of a simple DC motor and an unbalanced load, to vibrate each Kinect
was proposed by Maimone and Fuchs26 and Butler et al.27
The motion synchronously moves the Kinect’s projector and IR camera and
therefore its own pattern is clearly visible. On the other hand,
patterns from other Kinects move in different direction and frequency
and are blurred. Altogether, the depth sensing of the Kinect is
recovered.
As the system is relatively easy to reproduce, the method is chosen to solve the interference problem. The vibration has a drawback on blurry RGB images, but as they are not used in our processing, this does not affect the system. The other drawback is that the vibration creates a disturbing sound, possibly due to multiple movable parts of the structures holding the cameras. Redesigning the structure by adding some cushion and reducing excessive vibration may help reducing the noise, but this will be ignored at this point.
The Crazyflie 2.0 is a tiny quadrotor built using the PCB as the frame.28 With open-source firmwares and libraries, wiki page, and active communities of developers, it is one of the platforms suitable for developments. Its tiny size and light weight make it a good choice for indoor use with people. The specification is given in Table 2.
View larger version
As the system is relatively easy to reproduce, the method is chosen to solve the interference problem. The vibration has a drawback on blurry RGB images, but as they are not used in our processing, this does not affect the system. The other drawback is that the vibration creates a disturbing sound, possibly due to multiple movable parts of the structures holding the cameras. Redesigning the structure by adding some cushion and reducing excessive vibration may help reducing the noise, but this will be ignored at this point.
- Small-sized quadrotor
The Crazyflie 2.0 is a tiny quadrotor built using the PCB as the frame.28 With open-source firmwares and libraries, wiki page, and active communities of developers, it is one of the platforms suitable for developments. Its tiny size and light weight make it a good choice for indoor use with people. The specification is given in Table 2.
|
Table 2. Specifications of Crazyflie 2.0.

View larger version
Software implementation
- Quadrotor detection
Prior to the quadrotor detection, the background is created from the minimum value of each pixel in a period of time, which corresponds to the nearest depth. Then, new depth images are compared to the background for any additional objects, which are classified as a quadrotor based on the detected size so that small noise and human being are not detected as a quadrotor. The detected quadrotors’ positions are published into ROS in the form of tf frames, where they are processed further by the sensor fusion process to merge data of the same quadrotor into a single one.
- Human skeleton tracking
- Sensor fusion and tracking
If data from different Kinects are close enough, they are considered as the same person or quadrotor and are fused together by averaging. In case of quadrotor, if more than one Kinect can see the same object, that object is more likely to be the quadrotor. The number of detected quadrotors is limited by the number of quadrotors being used, predefined by the user. Tracking of both person and quadrotor is done based on the previous data.
- Goal position setup
is therefore the frame’s yaw angle. The goal position x d is defined at a distance dgoal in front of the face and hgoal above. These values should be adjusted according to the camera’s specification and settings. For the current experiment, dgoal=1.5m and hgoal=0.6m . By assuming that the camera is installed along the quadrotor’s x-axis, rotating ψˆhead around the world’s z-axis will point the camera toward the face. So, we set ψd=ψˆhead . x d
and ψd are then used to set the goal frame for the quadrotor.
- Control system
PID controllers are used to control the quadrotor. By setting the quadrotor’s x-axis to point in the forward direction, y-axis to the left, and z-axis upward, pitch angle controls x-axis motion and roll angle controls y-axis motion (Figure 13). Yaw angle ψ is the counterclockwise rotation angle around z-axis.
The quadrotor receives the command set of roll, pitch, yaw speed, and thrust (uϕ,uθ,uψ,uT)
,
and the on-board firmware calculates the power of each motor. The error
is obtained by finding the transformation from the quadrotor’s frame to
the goal frame, resulting in ex, ey, ez, and eψ, which are used to compute uθ, uϕ, uT, and uψ, respectively. For x, y, and ψ, proportional derivative (PD) controllers are used, with the same coefficients for x and y. The integral part is included for z to compensate the power drop during the flight.
Camera placement and calibration
The method of calibrating relative position and orientation between two cameras was performed. Another camera was added as the reference for the calibration, with its FOV intersecting with other cameras’ FOVs. Its orientation was also set to be simple, that is,
to minimize possible error in setting up the reference, comparing to
setting up at any specific non-zero angle of one of the five cameras.
In ROS, there is a package, camera_pose_calibration,33 for extrinsic camera calibration. To calibrate for depth sensor, the images from the IR camera are used, with the IR pattern projector covered. The calibration process involves holding a checkerboard in different locations, outputting each camera’s position and orientation.
The calibration was done one-by-one between each camera and the reference camera. For each camera, after the coarse placement, its relative position and orientation was calibrated. If the error from the optimization result exceeds the error limit assumed for the optimization (i.e. 2° for angles and 0.03 m for position), the adjustment is made and the calibration process is repeated until the error is within the error limit. Figure 14 shows the Kinects set in place.
In ROS, there is a package, camera_pose_calibration,33 for extrinsic camera calibration. To calibrate for depth sensor, the images from the IR camera are used, with the IR pattern projector covered. The calibration process involves holding a checkerboard in different locations, outputting each camera’s position and orientation.
The calibration was done one-by-one between each camera and the reference camera. For each camera, after the coarse placement, its relative position and orientation was calibrated. If the error from the optimization result exceeds the error limit assumed for the optimization (i.e. 2° for angles and 0.03 m for position), the adjustment is made and the calibration process is repeated until the error is within the error limit. Figure 14 shows the Kinects set in place.
Experimental setup
An experiment with one moving person was set to test the constructed system. A path was created to cover the space as shown in Figure 15, where the blue dots are the approximate positions (marked by markers on the floor for the person to stand), the black arrows show the approximated facing direction, and the dashed red line shows the order of moving with the movement numbers. The path was designed such that the quadrotor’s goal will not be outside the area, with some margins to allow errors. The path ends in the center and the person rotates and stops at around
, −π, π2 , and 0 radian, respectively, before finishing at −π2
radian.
The quadrotor’s detection size was set to (10±3)×10−2 m. PID controllers’ coefficients are given in Table 3. Threshold values of 0.15 m, 15° were applied, with the weight w in equation (9) set as 0.98. The displacement threshold for quadrotor fusion and head fusion was set to 0.5 m.
View larger version
The
concern about safety problem was not explicitly dealt with in this
implementation. The goal position of the quadrotor was set to be higher
than the height of the person for 0.6 m, and 1.5 m away in the
horizontal direction to reduce the risk of collision, but there was no
limit on the closest distance the quadrotor can be from the person.
For better evaluation, the motion capture system by motion analysis34 was used for more accurate positions and orientations. As two systems were run separately, time synchronization was manually done. Our motion analysis uses 12 units of “Kestrel Digital RealTime System,” with 3 units at each corner of the area. Each unit can record up to 2048 × 1088 pixels at 300 fps and can provide the position of each marker at millimeter-level accuracy. The output of the position of each marker was set at 120 Hz in the experiment.
For the head’s position and rotation, the person wore a cap with four markers. On the quadrotor, four markers were attached to the tip of each wing, with another marker in front.
The quadrotor’s detection size was set to (10±3)×10−2 m. PID controllers’ coefficients are given in Table 3. Threshold values of 0.15 m, 15° were applied, with the weight w in equation (9) set as 0.98. The displacement threshold for quadrotor fusion and head fusion was set to 0.5 m.
|
Table 3. PID controllers for each dimension.

View larger version
For better evaluation, the motion capture system by motion analysis34 was used for more accurate positions and orientations. As two systems were run separately, time synchronization was manually done. Our motion analysis uses 12 units of “Kestrel Digital RealTime System,” with 3 units at each corner of the area. Each unit can record up to 2048 × 1088 pixels at 300 fps and can provide the position of each marker at millimeter-level accuracy. The output of the position of each marker was set at 120 Hz in the experiment.
For the head’s position and rotation, the person wore a cap with four markers. On the quadrotor, four markers were attached to the tip of each wing, with another marker in front.
Results
Tracking of person’s position
Tracking and controlling of quadrotor
The quadrotor’s 2-D positions tracked by motion analysis (blue solid line) and the proposed system (dashed red line) are plotted in Figure 18. This shows that the experiment involved controlling the quadrotor throughout the area. There were also some moments in which the quadrotor went beyond the limit, but the system could still manage to control it back into the region.
Human-tracking result
Figure 19 shows the snapshots of the tracking experiment. Graphs showing the quadrotor’s position and orientation compared to the goal position set up by the person’s position are plotted in Figure 20. Dotted black line indicates the thresholded goal position and orientation, while dashed red line indicates the quadrotor’s fused position and orientation. Blue solid line shows the position and orientation tracked by motion analysis. The average absolute errors, comparing motion analysis result to the goal position, were 0.23 m (SD 0.17 m) for 3-D position and 0.15 rad (SD 0.13 rad) for rotation angle. Video of the tracking experiment can be found in the supplementary file.
No comments:
Post a Comment