As part of the Audio Analytic recruitment and interview process we ask candidates to write a short code sample based on the sorts of coding problems that we regularly encounter. Here we would like you to write a short program in C to implement the unsupervised K-means clustering algorithm. We have provided a data set to be clustered; you will write a program to cluster the data and return to us the program and the program’s output.
Although the problem is based on the K-means algorithm we have simplified the implementation for the purposes of the coding test; we recommend that you follow the specification in order to get the expected output.
The input data file (click here to download) is a text file containing comma separated 2D points, each number being a floating point number with one point specified per line (the last line is blank) – the file is written with Unix style line endings (i.e. a single ‘\n’ character at the end of each line), there is no other white space in the file. The input file contains approximately 10,000 points to be classified.
You will perform an unsupervised clustering for K=5 clusters, the initialisation of the cluster centres is listed in the table below:
The K-means algorithm that you should use is explained in the next section. When the clustering terminates you should have determined the assignment of each of the data points from the input data set and the clustering error metric.
The output file should be a plain text file that contains the following output, the first line will have “error = ” followed by the error metric printed to 3 decimal places (i.e. the printf format would be “error = %.3f”). Each of the following lines will be the cluster name for the input points in the same order as it appeared in the input data file.
The program should be run and its output dumped into a text file called “OUTPUT.TXT”. You should then put into a standard archive (.tar.gz/bz2, .zip, .rar etc) the source code for your program, a build script and your “OUTPUT.TXT” file. If you have any additional comments you wish to add to your submission please put them in a file called “COMMENTS.TXT”. Please then email the archive by replying to the last email you have from us and paste the first 10 lines of the output into the email body.
The K-means algorithm is an iterative 2-step expectation-maximisation algorithm. The following two steps are run repeatedly until no additional changes occur; this algorithm ensures that the square of the error metric will always decrease on every step (until convergence is achieved). The initial coordinates of the cluster centres should be taken from Table 1. Expectation Step
In the expectation step each input data point is assigned to the closest cluster centroid measured using the Euclidian distance metric (i.e. dist = SQRT([x-Cx]2 + [y-Cy]2 ) ). The total error metric should also be accumulated as the sum of the distances of each point to its assigned cluster centroid.
In the maximisation step the total error metric is minimised by adjusting the cluster centroids to the centre of the data points assigned to it. The new centroid of the cluster should be found by computing the mean of the point coordinates in each dimension (i.e. x and y coordinates are treated independently).
Convergence of the solution is achieved when the expectation step does not change the assignment of any data points; this can also be detected by checking if the total error metric does not change from the previous iteration.
A few more points
- You may hard-code any values that you like (except the output!);
- You may use any libraries that you wish to, but you must tell us which libraries those are;
- The cluster centroid initialisations is specified, there is no need to initialise the centroids for this exercise;
- We will be looking for code efficiency, cleanness and maintainability.