Mapping pupil position to screen targets

This post describes the relationship between the different factors I consider in my eye tracking interface set up. I go through the geometric model which I plan to use to constrain my system.

To use eye tracking to interact with a computer it is necessary to map the location of the pupil in an image (from a head mounted camera for example) to the fixation point of gaze on a screen. Most systems require calibration to account for the variations in the positions of the cameras, the user and the screen. To calibrate the system you gather a set of training examples where the user is asked to fixate on a a set of points on the screen. A model is then fitted to this data to allow gaze towards arbitrary points on the screen to be accurately determined.

Figure 1 - The set up

Figure 2 - the vectors

Figure 1 shows the main elements we consider in the tracking. We label these elements as shown in figure 2 – which are:

  • O is the origin of the system, which is centred on the fixed camera below the screen.
  • S_o is the origin of the screen relative to the camera origin O.
  • the vectors S_x and S_y are the basis vectors for the screen, they correspond to the width and height of a single pixel in the real world.
  • Y is the location of the target on the screen that the user is looking at.
  • H is centre of the head target. The head target is a plane with four LEDs on it which are tracked by the fixed camera.
  • E is the centre of the eye which is being tracked.
  • the vector vec{EG} is a unit vector pointing in the direction that the eye is looking.
  • vec{EY} is the vector from the eye to the target

It is useful to think of the system in two frames of reference. There is the frame of reference of the fixed camera and the frame of reference of the head.

Fixed camera frame of reference

In the fixed camera’s frame of reference we have three fixed but unknown vectors: S_o, S_x and S_y. Once these are know we can express the target vector Y^{(i)} as follows

Y^{(i)} = S_o + t_x^{(i)}S_x + t_y^{(i)}S_y

where t_x^{(i)}, t_y^{(i)} are the screen pixel coordinates of the target for training example i.

The relationship between the head and the fixed camera frame of reference is described by the following equation

k = R^{(i)}(K - T^{(i)})

Where k in mathbb{R}^3 is a point relative in the head frame of reference and K in mathbb{R}^3 is the same point relative to the fixed-camera. R is a rotation matrix and T^{(i)} the translation vector vec{OH^{(i)}}. To go back the other way you just need to use:


Where R^{prime} is the transpose of R and hence the inverse rotation.

The head frame of reference

We assume that the eye and the eye-camera are fixed in the head frame of reference. The position of the eye relative to the head target is fixed in the head coordinate space – we’ll call this translation o. Thus

E=R^{prime (i)}o + vec{OH}^{(i)}

We assume that there exists a mapping from the image of the pupil from the eye-camera to the vector vec{EG} (more on this in a future post).

  • p^{(i)} = (p_x^{(i)}, p_y^{(i)}) is the location of the centre of the pupil in the eye camera image for the i^{th} training example.
  • g(p^{(i)}, theta) is the mapping from the pupil position in the eye image to the unit vector vec{EG} relative to the head which is parameterised by the vector theta
  • to get vec{EG} in the fixed camera coordinates we need to  transform it as follows:

vec{EG} = R^{prime (i)} g(p^{(i)}, theta)

The intercept of the gaze and the target

Consider a user is looking at a target on a screen. We can consider their gaze as the line defined by the position of their eye E and the direction in which they are looking vec{EG}. We consider the screen as a plane defined by it origin S_o and two points on the plane S_x and S_y.

Thus the line of gaze can be expressed parametrically as:

P(s) = E + svec{EG}

The plane can be defined by

n cdot (p - S_o) = 0

where p is a point on the plane and n = S_x times S_y is the normal to the plane. Providing the line of gaze is not parallel to the plane of the screen we can plug P(s) into the definition of the plane.

  • n cdot (E + s vec{EG} - S_o) = 0
  • n cdot(E - S_o) + n cdot(s vec{EG}) = 0
  • s = frac{n cdot ( S_o - E) }{n cdot (vec{EG})}

Finally we have that

  • Y^{(i)} = S_o + t_x^{(i)} S_x + t_y^{(i)}S_y = E +frac{n cdot (S_o - E)}{n cdot (vec{EG}) } vec{EG}

The unknowns

In order to fit this model we need some training data. These data are obtained by haivng  user look at a series of targets on the screen and storing the following for each presentation i:

  • p^{(i)} = (p_x^{(i)}, p_y^{(i)}) : the pupil position
  • vec{OH} : the translation of the head from the camera
  • R : the rotation of the head relative to the camera
  • (t_x^{(i)}, t_y^{(i)}) : the screen pixel coordinate of the ith target

The remain unknown constants are:

  • o : the translation of the eye relative to the head
  • S_o : the origin of the screen relative to the camera
  • (S_x, S_y) : the unit screen basis vectors
  • theta : the parameter vector for mapping pupil position to the unit eye vector.

I will talk about obtaining these unknowns in a future post.


Once the unknowns have been determined you are able to predict the screen target postion given the pupil position and the head’s position and orientation. We can calculate t_x and t_y independently as follows:

  • t_x = S_x cdot left(E - S_o+frac{n cdot (S_o - E)}{n cdot (vec{EG})}vec{EG}right)
  • t_y = S_y cdot left(E - S_o+frac{n cdot (S_o - E)}{n cdot (vec{EG})}vec{EG}right)