mirror of
https://github.moeyy.xyz/https://github.com/trekhleb/javascript-algorithms.git
synced 2024-11-10 11:09:43 +08:00
Simplify k-Means clustering algorithm.
This commit is contained in:
parent
b7cd425ce9
commit
569fd95bd0
@ -147,7 +147,7 @@ a set of rules that precisely define a sequence of operations.
|
||||
* **Machine Learning**
|
||||
* `B` [NanoNeuron](https://github.com/trekhleb/nano-neuron) - 7 simple JS functions that illustrate how machines can actually learn (forward/backward propagation)
|
||||
* `B` [k-NN](src/algorithms/ml/knn) - k-nearest neighbors classification algorithm
|
||||
* `B` [k-Means](src/algorithms/ml/kmeans) - k-Means clustering algorithm
|
||||
* `B` [k-Means](src/algorithms/ml/k-means) - k-Means clustering algorithm
|
||||
* **Uncategorized**
|
||||
* `B` [Tower of Hanoi](src/algorithms/uncategorized/hanoi-tower)
|
||||
* `B` [Square Matrix Rotation](src/algorithms/uncategorized/square-matrix-rotation) - in-place algorithm
|
||||
|
@ -1,10 +1,10 @@
|
||||
# k-Means Algorithm
|
||||
|
||||
The **k-Means algorithm** is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimentions of vectors.
|
||||
The **k-Means algorithm** is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimensions of vectors.
|
||||
|
||||
In k-Means classification, the output is a set of classess asssigned to each vector. Each cluster location is continously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.
|
||||
In k-Means classification, the output is a set of classes assigned to each vector. Each cluster location is continuously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.
|
||||
|
||||
The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used mostly for this task.
|
||||
The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/math/euclidean-distance) is used mostly for this task.
|
||||
|
||||
![Euclidean distance between two points](https://upload.wikimedia.org/wikipedia/commons/5/55/Euclidean_distance_2d.svg)
|
||||
|
||||
@ -13,9 +13,9 @@ _Image source: [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)_
|
||||
The algorithm is as follows:
|
||||
|
||||
1. Check for errors like invalid/inconsistent data
|
||||
2. Initialize the k cluster locations with initial/random k points
|
||||
2. Initialize the `k` cluster locations with initial/random `k` points
|
||||
3. Calculate the distance of each data point from each cluster
|
||||
4. Assign the cluster label of each data point equal to that of the cluster at it's minimum distance
|
||||
4. Assign the cluster label of each data point equal to that of the cluster at its minimum distance
|
||||
5. Calculate the centroid of each cluster based on the data points it contains
|
||||
6. Repeat each of the above steps until the centroid locations are varying
|
||||
|
||||
@ -23,9 +23,9 @@ Here is a visualization of k-Means clustering for better understanding:
|
||||
|
||||
![KNN Visualization 1](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)
|
||||
|
||||
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)_
|
||||
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering)_
|
||||
|
||||
The centroids are moving continously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between itrations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
|
||||
The centroids are moving continuously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between iterations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
|
||||
|
||||
## References
|
||||
|
@ -1,36 +1,40 @@
|
||||
import kMeans from '../kmeans';
|
||||
import KMeans from '../kMeans';
|
||||
|
||||
describe('kMeans', () => {
|
||||
it('should throw an error on invalid data', () => {
|
||||
expect(() => {
|
||||
kMeans();
|
||||
}).toThrowError('Either dataSet or labels or toClassify were not set');
|
||||
KMeans();
|
||||
}).toThrowError('The data is empty');
|
||||
});
|
||||
|
||||
it('should throw an error on inconsistent data', () => {
|
||||
expect(() => {
|
||||
kMeans([[1, 2], [1]], 2);
|
||||
}).toThrowError('Inconsistent vector lengths');
|
||||
KMeans([[1, 2], [1]], 2);
|
||||
}).toThrowError('Matrices have different shapes');
|
||||
});
|
||||
|
||||
it('should find the nearest neighbour', () => {
|
||||
const dataSet = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];
|
||||
const data = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];
|
||||
const k = 2;
|
||||
const expectedCluster = [0, 1, 0, 1, 1, 0, 1];
|
||||
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
|
||||
const expectedClusters = [0, 1, 0, 1, 1, 0, 1];
|
||||
expect(KMeans(data, k)).toEqual(expectedClusters);
|
||||
|
||||
expect(KMeans([[0, 0], [0, 1], [10, 10]], 2)).toEqual(
|
||||
[0, 0, 1],
|
||||
);
|
||||
});
|
||||
|
||||
it('should find the clusters with equal distances', () => {
|
||||
const dataSet = [[0, 0], [1, 1], [2, 2]];
|
||||
const k = 3;
|
||||
const expectedCluster = [0, 1, 2];
|
||||
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
|
||||
expect(KMeans(dataSet, k)).toEqual(expectedCluster);
|
||||
});
|
||||
|
||||
it('should find the nearest neighbour in 3D space', () => {
|
||||
const dataSet = [[0, 0, 0], [0, 1, 0], [2, 0, 2]];
|
||||
const k = 2;
|
||||
const expectedCluster = [1, 1, 0];
|
||||
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
|
||||
expect(KMeans(dataSet, k)).toEqual(expectedCluster);
|
||||
});
|
||||
});
|
85
src/algorithms/ml/k-means/kMeans.js
Normal file
85
src/algorithms/ml/k-means/kMeans.js
Normal file
@ -0,0 +1,85 @@
|
||||
import * as mtrx from '../../math/matrix/Matrix';
|
||||
import euclideanDistance from '../../math/euclidean-distance/euclideanDistance';
|
||||
|
||||
/**
|
||||
* Classifies the point in space based on k-Means algorithm.
|
||||
*
|
||||
* @param {number[][]} data - array of dataSet points, i.e. [[0, 1], [3, 4], [5, 7]]
|
||||
* @param {number} k - number of clusters
|
||||
* @return {number[]} - the class of the point
|
||||
*/
|
||||
export default function KMeans(
|
||||
data,
|
||||
k = 1,
|
||||
) {
|
||||
if (!data) {
|
||||
throw new Error('The data is empty');
|
||||
}
|
||||
|
||||
// Assign k clusters locations equal to the location of initial k points.
|
||||
const dataDim = data[0].length;
|
||||
const clusterCenters = data.slice(0, k);
|
||||
|
||||
// Continue optimization till convergence.
|
||||
// Centroids should not be moving once optimized.
|
||||
// Calculate distance of each candidate vector from each cluster center.
|
||||
// Assign cluster number to each data vector according to minimum distance.
|
||||
|
||||
// Matrix of distance from each data point to each cluster centroid.
|
||||
const distances = mtrx.zeros([data.length, k]);
|
||||
|
||||
// Vector data points' classes. The value of -1 means that no class has bee assigned yet.
|
||||
const classes = Array(data.length).fill(-1);
|
||||
|
||||
let iterate = true;
|
||||
while (iterate) {
|
||||
iterate = false;
|
||||
|
||||
// Calculate and store the distance of each data point from each cluster.
|
||||
for (let dataIndex = 0; dataIndex < data.length; dataIndex += 1) {
|
||||
for (let clusterIndex = 0; clusterIndex < k; clusterIndex += 1) {
|
||||
distances[dataIndex][clusterIndex] = euclideanDistance(
|
||||
[clusterCenters[clusterIndex]],
|
||||
[data[dataIndex]],
|
||||
);
|
||||
}
|
||||
// Assign the closest cluster number to each dataSet point.
|
||||
const closestClusterIdx = distances[dataIndex].indexOf(
|
||||
Math.min(...distances[dataIndex]),
|
||||
);
|
||||
|
||||
// Check if data point class has been changed and we still need to re-iterate.
|
||||
if (classes[dataIndex] !== closestClusterIdx) {
|
||||
iterate = true;
|
||||
}
|
||||
|
||||
classes[dataIndex] = closestClusterIdx;
|
||||
}
|
||||
|
||||
// Recalculate cluster centroid values via all dimensions of the points under it.
|
||||
for (let clusterIndex = 0; clusterIndex < k; clusterIndex += 1) {
|
||||
// Reset cluster center coordinates since we need to recalculate them.
|
||||
clusterCenters[clusterIndex] = Array(dataDim).fill(0);
|
||||
let clusterSize = 0;
|
||||
for (let dataIndex = 0; dataIndex < data.length; dataIndex += 1) {
|
||||
if (classes[dataIndex] === clusterIndex) {
|
||||
// Register one more data point of current cluster.
|
||||
clusterSize += 1;
|
||||
for (let dimensionIndex = 0; dimensionIndex < dataDim; dimensionIndex += 1) {
|
||||
// Add data point coordinates to the cluster center coordinates.
|
||||
clusterCenters[clusterIndex][dimensionIndex] += data[dataIndex][dimensionIndex];
|
||||
}
|
||||
}
|
||||
}
|
||||
// Calculate the average for each cluster center coordinate.
|
||||
for (let dimensionIndex = 0; dimensionIndex < dataDim; dimensionIndex += 1) {
|
||||
clusterCenters[clusterIndex][dimensionIndex] = parseFloat(Number(
|
||||
clusterCenters[clusterIndex][dimensionIndex] / clusterSize,
|
||||
).toFixed(2));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Return the clusters assigned.
|
||||
return classes;
|
||||
}
|
@ -1,98 +0,0 @@
|
||||
/**
|
||||
* Calculates calculate the euclidean distance between 2 vectors.
|
||||
*
|
||||
* @param {number[]} x1
|
||||
* @param {number[]} x2
|
||||
* @returns {number}
|
||||
*/
|
||||
function euclideanDistance(x1, x2) {
|
||||
// Checking for errors.
|
||||
if (x1.length !== x2.length) {
|
||||
throw new Error('Inconsistent vector lengths');
|
||||
}
|
||||
// Calculate the euclidean distance between 2 vectors and return.
|
||||
let squaresTotal = 0;
|
||||
for (let i = 0; i < x1.length; i += 1) {
|
||||
squaresTotal += (x1[i] - x2[i]) ** 2;
|
||||
}
|
||||
return Number(Math.sqrt(squaresTotal).toFixed(2));
|
||||
}
|
||||
/**
|
||||
* Classifies the point in space based on k-nearest neighbors algorithm.
|
||||
*
|
||||
* @param {number[][]} dataSet - array of dataSet points, i.e. [[0, 1], [3, 4], [5, 7]]
|
||||
* @param {number} k - number of nearest neighbors which will be taken into account (preferably odd)
|
||||
* @return {number[]} - the class of the point
|
||||
*/
|
||||
export default function kMeans(
|
||||
dataSetm,
|
||||
k = 1,
|
||||
) {
|
||||
const dataSet = dataSetm;
|
||||
if (!dataSet) {
|
||||
throw new Error('Either dataSet or labels or toClassify were not set');
|
||||
}
|
||||
|
||||
// starting algorithm
|
||||
// assign k clusters locations equal to the location of initial k points
|
||||
const clusterCenters = [];
|
||||
const nDim = dataSet[0].length;
|
||||
for (let i = 0; i < k; i += 1) {
|
||||
clusterCenters[clusterCenters.length] = Array.from(dataSet[i]);
|
||||
}
|
||||
|
||||
// continue optimization till convergence
|
||||
// centroids should not be moving once optimized
|
||||
// calculate distance of each candidate vector from each cluster center
|
||||
// assign cluster number to each data vector according to minimum distance
|
||||
let flag = true;
|
||||
while (flag) {
|
||||
flag = false;
|
||||
// calculate and store distance of each dataSet point from each cluster
|
||||
for (let i = 0; i < dataSet.length; i += 1) {
|
||||
for (let n = 0; n < k; n += 1) {
|
||||
dataSet[i][nDim + n] = euclideanDistance(clusterCenters[n], dataSet[i].slice(0, nDim));
|
||||
}
|
||||
|
||||
// assign the cluster number to each dataSet point
|
||||
const sliced = dataSet[i].slice(nDim, nDim + k);
|
||||
let minmDistCluster = Math.min(...sliced);
|
||||
for (let j = 0; j < sliced.length; j += 1) {
|
||||
if (minmDistCluster === sliced[j]) {
|
||||
minmDistCluster = j;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (dataSet[i].length !== nDim + k + 1) {
|
||||
flag = true;
|
||||
dataSet[i][nDim + k] = minmDistCluster;
|
||||
} else if (dataSet[i][nDim + k] !== minmDistCluster) {
|
||||
flag = true;
|
||||
dataSet[i][nDim + k] = minmDistCluster;
|
||||
}
|
||||
}
|
||||
// recalculate cluster centriod values via all dimensions of the points under it
|
||||
for (let i = 0; i < k; i += 1) {
|
||||
clusterCenters[i] = Array(nDim).fill(0);
|
||||
let classCount = 0;
|
||||
for (let j = 0; j < dataSet.length; j += 1) {
|
||||
if (dataSet[j][dataSet[j].length - 1] === i) {
|
||||
classCount += 1;
|
||||
for (let n = 0; n < nDim; n += 1) {
|
||||
clusterCenters[i][n] += dataSet[j][n];
|
||||
}
|
||||
}
|
||||
}
|
||||
for (let n = 0; n < nDim; n += 1) {
|
||||
clusterCenters[i][n] = Number((clusterCenters[i][n] / classCount).toFixed(2));
|
||||
}
|
||||
}
|
||||
}
|
||||
// return the clusters assigned
|
||||
const soln = [];
|
||||
for (let i = 0; i < dataSet.length; i += 1) {
|
||||
soln.push(dataSet[i][dataSet[i].length - 1]);
|
||||
}
|
||||
return soln;
|
||||
}
|
Loading…
Reference in New Issue
Block a user