mirror of
https://github.moeyy.xyz/https://github.com/trekhleb/javascript-algorithms.git
synced 2024-12-27 15:41:16 +08:00
Simplify k-Means clustering algorithm.
This commit is contained in:
parent
b7cd425ce9
commit
569fd95bd0
@ -147,7 +147,7 @@ a set of rules that precisely define a sequence of operations.
|
|||||||
* **Machine Learning**
|
* **Machine Learning**
|
||||||
* `B` [NanoNeuron](https://github.com/trekhleb/nano-neuron) - 7 simple JS functions that illustrate how machines can actually learn (forward/backward propagation)
|
* `B` [NanoNeuron](https://github.com/trekhleb/nano-neuron) - 7 simple JS functions that illustrate how machines can actually learn (forward/backward propagation)
|
||||||
* `B` [k-NN](src/algorithms/ml/knn) - k-nearest neighbors classification algorithm
|
* `B` [k-NN](src/algorithms/ml/knn) - k-nearest neighbors classification algorithm
|
||||||
* `B` [k-Means](src/algorithms/ml/kmeans) - k-Means clustering algorithm
|
* `B` [k-Means](src/algorithms/ml/k-means) - k-Means clustering algorithm
|
||||||
* **Uncategorized**
|
* **Uncategorized**
|
||||||
* `B` [Tower of Hanoi](src/algorithms/uncategorized/hanoi-tower)
|
* `B` [Tower of Hanoi](src/algorithms/uncategorized/hanoi-tower)
|
||||||
* `B` [Square Matrix Rotation](src/algorithms/uncategorized/square-matrix-rotation) - in-place algorithm
|
* `B` [Square Matrix Rotation](src/algorithms/uncategorized/square-matrix-rotation) - in-place algorithm
|
||||||
|
@ -1,10 +1,10 @@
|
|||||||
# k-Means Algorithm
|
# k-Means Algorithm
|
||||||
|
|
||||||
The **k-Means algorithm** is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimentions of vectors.
|
The **k-Means algorithm** is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimensions of vectors.
|
||||||
|
|
||||||
In k-Means classification, the output is a set of classess asssigned to each vector. Each cluster location is continously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.
|
In k-Means classification, the output is a set of classes assigned to each vector. Each cluster location is continuously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.
|
||||||
|
|
||||||
The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used mostly for this task.
|
The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/math/euclidean-distance) is used mostly for this task.
|
||||||
|
|
||||||
![Euclidean distance between two points](https://upload.wikimedia.org/wikipedia/commons/5/55/Euclidean_distance_2d.svg)
|
![Euclidean distance between two points](https://upload.wikimedia.org/wikipedia/commons/5/55/Euclidean_distance_2d.svg)
|
||||||
|
|
||||||
@ -13,9 +13,9 @@ _Image source: [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)_
|
|||||||
The algorithm is as follows:
|
The algorithm is as follows:
|
||||||
|
|
||||||
1. Check for errors like invalid/inconsistent data
|
1. Check for errors like invalid/inconsistent data
|
||||||
2. Initialize the k cluster locations with initial/random k points
|
2. Initialize the `k` cluster locations with initial/random `k` points
|
||||||
3. Calculate the distance of each data point from each cluster
|
3. Calculate the distance of each data point from each cluster
|
||||||
4. Assign the cluster label of each data point equal to that of the cluster at it's minimum distance
|
4. Assign the cluster label of each data point equal to that of the cluster at its minimum distance
|
||||||
5. Calculate the centroid of each cluster based on the data points it contains
|
5. Calculate the centroid of each cluster based on the data points it contains
|
||||||
6. Repeat each of the above steps until the centroid locations are varying
|
6. Repeat each of the above steps until the centroid locations are varying
|
||||||
|
|
||||||
@ -23,9 +23,9 @@ Here is a visualization of k-Means clustering for better understanding:
|
|||||||
|
|
||||||
![KNN Visualization 1](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)
|
![KNN Visualization 1](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)
|
||||||
|
|
||||||
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)_
|
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering)_
|
||||||
|
|
||||||
The centroids are moving continously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between itrations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
|
The centroids are moving continuously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between iterations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
@ -1,36 +1,40 @@
|
|||||||
import kMeans from '../kmeans';
|
import KMeans from '../kMeans';
|
||||||
|
|
||||||
describe('kMeans', () => {
|
describe('kMeans', () => {
|
||||||
it('should throw an error on invalid data', () => {
|
it('should throw an error on invalid data', () => {
|
||||||
expect(() => {
|
expect(() => {
|
||||||
kMeans();
|
KMeans();
|
||||||
}).toThrowError('Either dataSet or labels or toClassify were not set');
|
}).toThrowError('The data is empty');
|
||||||
});
|
});
|
||||||
|
|
||||||
it('should throw an error on inconsistent data', () => {
|
it('should throw an error on inconsistent data', () => {
|
||||||
expect(() => {
|
expect(() => {
|
||||||
kMeans([[1, 2], [1]], 2);
|
KMeans([[1, 2], [1]], 2);
|
||||||
}).toThrowError('Inconsistent vector lengths');
|
}).toThrowError('Matrices have different shapes');
|
||||||
});
|
});
|
||||||
|
|
||||||
it('should find the nearest neighbour', () => {
|
it('should find the nearest neighbour', () => {
|
||||||
const dataSet = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];
|
const data = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];
|
||||||
const k = 2;
|
const k = 2;
|
||||||
const expectedCluster = [0, 1, 0, 1, 1, 0, 1];
|
const expectedClusters = [0, 1, 0, 1, 1, 0, 1];
|
||||||
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
|
expect(KMeans(data, k)).toEqual(expectedClusters);
|
||||||
|
|
||||||
|
expect(KMeans([[0, 0], [0, 1], [10, 10]], 2)).toEqual(
|
||||||
|
[0, 0, 1],
|
||||||
|
);
|
||||||
});
|
});
|
||||||
|
|
||||||
it('should find the clusters with equal distances', () => {
|
it('should find the clusters with equal distances', () => {
|
||||||
const dataSet = [[0, 0], [1, 1], [2, 2]];
|
const dataSet = [[0, 0], [1, 1], [2, 2]];
|
||||||
const k = 3;
|
const k = 3;
|
||||||
const expectedCluster = [0, 1, 2];
|
const expectedCluster = [0, 1, 2];
|
||||||
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
|
expect(KMeans(dataSet, k)).toEqual(expectedCluster);
|
||||||
});
|
});
|
||||||
|
|
||||||
it('should find the nearest neighbour in 3D space', () => {
|
it('should find the nearest neighbour in 3D space', () => {
|
||||||
const dataSet = [[0, 0, 0], [0, 1, 0], [2, 0, 2]];
|
const dataSet = [[0, 0, 0], [0, 1, 0], [2, 0, 2]];
|
||||||
const k = 2;
|
const k = 2;
|
||||||
const expectedCluster = [1, 1, 0];
|
const expectedCluster = [1, 1, 0];
|
||||||
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
|
expect(KMeans(dataSet, k)).toEqual(expectedCluster);
|
||||||
});
|
});
|
||||||
});
|
});
|
85
src/algorithms/ml/k-means/kMeans.js
Normal file
85
src/algorithms/ml/k-means/kMeans.js
Normal file
@ -0,0 +1,85 @@
|
|||||||
|
import * as mtrx from '../../math/matrix/Matrix';
|
||||||
|
import euclideanDistance from '../../math/euclidean-distance/euclideanDistance';
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Classifies the point in space based on k-Means algorithm.
|
||||||
|
*
|
||||||
|
* @param {number[][]} data - array of dataSet points, i.e. [[0, 1], [3, 4], [5, 7]]
|
||||||
|
* @param {number} k - number of clusters
|
||||||
|
* @return {number[]} - the class of the point
|
||||||
|
*/
|
||||||
|
export default function KMeans(
|
||||||
|
data,
|
||||||
|
k = 1,
|
||||||
|
) {
|
||||||
|
if (!data) {
|
||||||
|
throw new Error('The data is empty');
|
||||||
|
}
|
||||||
|
|
||||||
|
// Assign k clusters locations equal to the location of initial k points.
|
||||||
|
const dataDim = data[0].length;
|
||||||
|
const clusterCenters = data.slice(0, k);
|
||||||
|
|
||||||
|
// Continue optimization till convergence.
|
||||||
|
// Centroids should not be moving once optimized.
|
||||||
|
// Calculate distance of each candidate vector from each cluster center.
|
||||||
|
// Assign cluster number to each data vector according to minimum distance.
|
||||||
|
|
||||||
|
// Matrix of distance from each data point to each cluster centroid.
|
||||||
|
const distances = mtrx.zeros([data.length, k]);
|
||||||
|
|
||||||
|
// Vector data points' classes. The value of -1 means that no class has bee assigned yet.
|
||||||
|
const classes = Array(data.length).fill(-1);
|
||||||
|
|
||||||
|
let iterate = true;
|
||||||
|
while (iterate) {
|
||||||
|
iterate = false;
|
||||||
|
|
||||||
|
// Calculate and store the distance of each data point from each cluster.
|
||||||
|
for (let dataIndex = 0; dataIndex < data.length; dataIndex += 1) {
|
||||||
|
for (let clusterIndex = 0; clusterIndex < k; clusterIndex += 1) {
|
||||||
|
distances[dataIndex][clusterIndex] = euclideanDistance(
|
||||||
|
[clusterCenters[clusterIndex]],
|
||||||
|
[data[dataIndex]],
|
||||||
|
);
|
||||||
|
}
|
||||||
|
// Assign the closest cluster number to each dataSet point.
|
||||||
|
const closestClusterIdx = distances[dataIndex].indexOf(
|
||||||
|
Math.min(...distances[dataIndex]),
|
||||||
|
);
|
||||||
|
|
||||||
|
// Check if data point class has been changed and we still need to re-iterate.
|
||||||
|
if (classes[dataIndex] !== closestClusterIdx) {
|
||||||
|
iterate = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
classes[dataIndex] = closestClusterIdx;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Recalculate cluster centroid values via all dimensions of the points under it.
|
||||||
|
for (let clusterIndex = 0; clusterIndex < k; clusterIndex += 1) {
|
||||||
|
// Reset cluster center coordinates since we need to recalculate them.
|
||||||
|
clusterCenters[clusterIndex] = Array(dataDim).fill(0);
|
||||||
|
let clusterSize = 0;
|
||||||
|
for (let dataIndex = 0; dataIndex < data.length; dataIndex += 1) {
|
||||||
|
if (classes[dataIndex] === clusterIndex) {
|
||||||
|
// Register one more data point of current cluster.
|
||||||
|
clusterSize += 1;
|
||||||
|
for (let dimensionIndex = 0; dimensionIndex < dataDim; dimensionIndex += 1) {
|
||||||
|
// Add data point coordinates to the cluster center coordinates.
|
||||||
|
clusterCenters[clusterIndex][dimensionIndex] += data[dataIndex][dimensionIndex];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// Calculate the average for each cluster center coordinate.
|
||||||
|
for (let dimensionIndex = 0; dimensionIndex < dataDim; dimensionIndex += 1) {
|
||||||
|
clusterCenters[clusterIndex][dimensionIndex] = parseFloat(Number(
|
||||||
|
clusterCenters[clusterIndex][dimensionIndex] / clusterSize,
|
||||||
|
).toFixed(2));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Return the clusters assigned.
|
||||||
|
return classes;
|
||||||
|
}
|
@ -1,98 +0,0 @@
|
|||||||
/**
|
|
||||||
* Calculates calculate the euclidean distance between 2 vectors.
|
|
||||||
*
|
|
||||||
* @param {number[]} x1
|
|
||||||
* @param {number[]} x2
|
|
||||||
* @returns {number}
|
|
||||||
*/
|
|
||||||
function euclideanDistance(x1, x2) {
|
|
||||||
// Checking for errors.
|
|
||||||
if (x1.length !== x2.length) {
|
|
||||||
throw new Error('Inconsistent vector lengths');
|
|
||||||
}
|
|
||||||
// Calculate the euclidean distance between 2 vectors and return.
|
|
||||||
let squaresTotal = 0;
|
|
||||||
for (let i = 0; i < x1.length; i += 1) {
|
|
||||||
squaresTotal += (x1[i] - x2[i]) ** 2;
|
|
||||||
}
|
|
||||||
return Number(Math.sqrt(squaresTotal).toFixed(2));
|
|
||||||
}
|
|
||||||
/**
|
|
||||||
* Classifies the point in space based on k-nearest neighbors algorithm.
|
|
||||||
*
|
|
||||||
* @param {number[][]} dataSet - array of dataSet points, i.e. [[0, 1], [3, 4], [5, 7]]
|
|
||||||
* @param {number} k - number of nearest neighbors which will be taken into account (preferably odd)
|
|
||||||
* @return {number[]} - the class of the point
|
|
||||||
*/
|
|
||||||
export default function kMeans(
|
|
||||||
dataSetm,
|
|
||||||
k = 1,
|
|
||||||
) {
|
|
||||||
const dataSet = dataSetm;
|
|
||||||
if (!dataSet) {
|
|
||||||
throw new Error('Either dataSet or labels or toClassify were not set');
|
|
||||||
}
|
|
||||||
|
|
||||||
// starting algorithm
|
|
||||||
// assign k clusters locations equal to the location of initial k points
|
|
||||||
const clusterCenters = [];
|
|
||||||
const nDim = dataSet[0].length;
|
|
||||||
for (let i = 0; i < k; i += 1) {
|
|
||||||
clusterCenters[clusterCenters.length] = Array.from(dataSet[i]);
|
|
||||||
}
|
|
||||||
|
|
||||||
// continue optimization till convergence
|
|
||||||
// centroids should not be moving once optimized
|
|
||||||
// calculate distance of each candidate vector from each cluster center
|
|
||||||
// assign cluster number to each data vector according to minimum distance
|
|
||||||
let flag = true;
|
|
||||||
while (flag) {
|
|
||||||
flag = false;
|
|
||||||
// calculate and store distance of each dataSet point from each cluster
|
|
||||||
for (let i = 0; i < dataSet.length; i += 1) {
|
|
||||||
for (let n = 0; n < k; n += 1) {
|
|
||||||
dataSet[i][nDim + n] = euclideanDistance(clusterCenters[n], dataSet[i].slice(0, nDim));
|
|
||||||
}
|
|
||||||
|
|
||||||
// assign the cluster number to each dataSet point
|
|
||||||
const sliced = dataSet[i].slice(nDim, nDim + k);
|
|
||||||
let minmDistCluster = Math.min(...sliced);
|
|
||||||
for (let j = 0; j < sliced.length; j += 1) {
|
|
||||||
if (minmDistCluster === sliced[j]) {
|
|
||||||
minmDistCluster = j;
|
|
||||||
break;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (dataSet[i].length !== nDim + k + 1) {
|
|
||||||
flag = true;
|
|
||||||
dataSet[i][nDim + k] = minmDistCluster;
|
|
||||||
} else if (dataSet[i][nDim + k] !== minmDistCluster) {
|
|
||||||
flag = true;
|
|
||||||
dataSet[i][nDim + k] = minmDistCluster;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// recalculate cluster centriod values via all dimensions of the points under it
|
|
||||||
for (let i = 0; i < k; i += 1) {
|
|
||||||
clusterCenters[i] = Array(nDim).fill(0);
|
|
||||||
let classCount = 0;
|
|
||||||
for (let j = 0; j < dataSet.length; j += 1) {
|
|
||||||
if (dataSet[j][dataSet[j].length - 1] === i) {
|
|
||||||
classCount += 1;
|
|
||||||
for (let n = 0; n < nDim; n += 1) {
|
|
||||||
clusterCenters[i][n] += dataSet[j][n];
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
for (let n = 0; n < nDim; n += 1) {
|
|
||||||
clusterCenters[i][n] = Number((clusterCenters[i][n] / classCount).toFixed(2));
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// return the clusters assigned
|
|
||||||
const soln = [];
|
|
||||||
for (let i = 0; i < dataSet.length; i += 1) {
|
|
||||||
soln.push(dataSet[i][dataSet[i].length - 1]);
|
|
||||||
}
|
|
||||||
return soln;
|
|
||||||
}
|
|
Loading…
Reference in New Issue
Block a user