Добавлено 04 June 2016 - 07:19
В помощь бамбукам в программировании (типа меня) решил выложить кусок кода по кластеризации с помощью К-средних.
Кто мучает зад в екселе может пригодится. Суть примерно такова на пальцах---есть у нас какие-то характеристики чего-то там ХЗ ...допустим команд или теннисистов и прочих...истов.
для примера покажу на 2-ух характеристиках (это первичные голоспособности или можно сказать МО голов в матче для дом-ком. и гостя)
вот типа такая таблица в екселе (кусок начала)
N G1 G2
1 1,437578815 2,162629758
2 2,398831873 0,824742268
3 0,858085809 0,520833333
4 1,759259259 1,126760563
5 0,903387704 1,058965102
6 1,259445844 1,941457587
7 1,861252115 0,641025641
8 1,457541191 0,871080139
9 1,048832272 1,28113879
10 1,896774194 1,413427562
11 3,485424588 1,26984127
12 0,696517413 1,297577855
13 1,070528967 1,573426573
итд.
Для макроса надо указать диапазон --указываем в месте с номером записи (напротив этого номера макрос потом вернёт на новом листе номер кластера и номера можно будет перенести в таблицу ретро данных при желании)...кроме того что собственно нам и надо (так как номера мы по идее потом и сами сможем вычислить)
вернёт центройды кластеров которые можно забить в другой макрос или встроенную пользовательскую функцию для вычисления номера текущей пары (которой в БД ретро нет)....
Короче для примера этого три столбика указываем макросу вместе с названиями столбцов (там выскакивает форма и в ней укажем).
вернёт вот так
Row Title Centroid
1 22
2 25
3 3
4 4
5 9
6 6
7 7
8 30
9 24
10 4
11 11
12 12
13 31
14 25
Итд
………
И центройды
G1 G2
Centroid 1 1,803281661 2,724977774
Centroid 2 2,531964172 0,462462353
Centroid 3 0,766622408 0,538307801
Centroid 4 1,817694831 1,258800239
Centroid 5 0,613161496 0,929501544
Centroid 6 1,143208693 1,950110175
Centroid 7 1,987655371 0,48704115
Centroid 8 1,329178087 0,768057358
Centroid 9 0,860811549 1,179025478
Centroid 10 2,125100911 1,700579006
Centroid 11 3,297781929 1,183288055
Centroid 12 0,56341852 1,44774447
Centroid 13 0,676079708 2,109946984
Centroid 14 2,700693921 0,881581638
Centroid 15 1,138309219 0,5153572
Centroid 16 2,561895645 1,338201546
Centroid 17 2,823474519 1,906330779
Centroid 18 1,138822493 2,399598699
Centroid 19 1,287469701 1,623651071
Centroid 20 1,908009282 0,896526437
Centroid 21 1,668020355 0,699246295
Centroid 22 1,649808326 2,059822317
Centroid 23 0,984029117 0,837128352
Centroid 24 1,103769405 1,359760858
Centroid 25 2,272613552 0,774294554
Centroid 26 2,191999879 1,176660469
Centroid 27 0,745261784 2,844384237
Centroid 28 1,426667124 1,283051046
Centroid 29 4,312170305 0,776953678
Centroid 30 1,543347751 1,000982379
Centroid 31 0,874060958 1,642373355
Centroid 32 1,64611422 1,58585708
Centroid 33 1,059077433 3,82291582
Centroid 34 1,495604132 0,439407432
Centroid 35 1,199098998 1,05285298
Centroid 36 3,2132208 0,573182037
Можно конечно больше параметров использовать для ваших нужд(это просто пример)
Считает долго но вроде работает нормально (качество деления ХЗ как проверить) но вот по приведённому примеру сами смотрите (вроде приемлемо)...
далее уже для кластеров можно строить например нейросети или регрессионные модели или вводить ограничения на другие параметры которые допустим не участвовали в кластеризации (но надо конечно следить за объёмом записей в кластерах--чтоб они были не очень маленькие желательно чтоб было больше 300-400 записей...но возможно некоторые сойдут и со 100-200 в виду их уникальности)
вот сам код (нашёл на просторах нета)
' k-Means Cluster Analysis
'---------------------------------------------------------------------------------------
Private Type Records
Dimension() As Double
Distance() As Double
Cluster As Integer
End Type
Dim Table As Range
Dim Record() As Records
Dim Centroid() As Records
Sub Run()
'Run k-Means
If Not kMeansSelection Then
Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error")
End If
End Sub
Function kMeansSelection() As Boolean
'Get user table selection
On Error Resume Next
Set Table = Application.InputBox(Prompt:= _
"Please select the range to analyse.", _
Title:="Specify Range", Type:=8)
If Table Is Nothing Then Exit Function 'Cancelled
'Check table dimensions
If Table.Rows.Count < 4 Or Table.Columns.Count < 2 Then
Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns."
End If
'Get number of clusters
Dim numClusters As Integer
numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1)
If Not numClusters > 0 Or numClusters = False Then
Exit Function 'Cancelled
End If
If Err.Number = 0 Then
If kMeans(Table, numClusters) Then
outputClusters
End If
End If
kMeansSelection_Error:
kMeansSelection = (Err.Number = 0)
End Function
Function kMeans(Table As Range, Clusters As Integer) As Boolean
'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns)
'Clusters - Number of clusters to reduce records into.
On Error Resume Next
'Script Performance Variables
Dim PassCounter As Integer
'Initialize Data Arrays
ReDim Record(2 To Table.Rows.Count)
Dim r As Integer 'record
Dim d As Integer 'dimension index
Dim d2 As Integer 'dimension index
Dim c As Integer 'centroid index
Dim c2 As Integer 'centroid index
Dim di As Integer 'distance
Dim x As Double 'Variable Distance Placeholder
Dim y As Double 'Variable Distance Placeholder
For r = LBound(Record) To UBound(Record)
'Initialize Dimension Value Arrays
ReDim Record®.Dimension(2 To Table.Columns.Count)
'Initialize Distance Arrays
ReDim Record®.Distance(1 To Clusters)
For d = LBound(Record®.Dimension) To UBound(Record®.Dimension)
Record®.Dimension(d) = Table.Rows®.Cells(d).Value
Next d
Next r
'Initialize Initial Centroid Arrays
ReDim Centroid(1 To Clusters)
Dim uniqueCentroid As Boolean
For c = LBound(Centroid) To UBound(Centroid)
'Initialize Centroid Dimension Depth
ReDim Centroid©.Dimension(2 To Table.Columns.Count)
'Initialize record index to next record
r = LBound(Record) + c - 2
Do ' Loop to ensure new centroid is unique
r = r + 1 'Increment record index throughout loop to find unique record to use as a centroid
'Assign record dimensions to centroid
For d = LBound(Centroid©.Dimension) To UBound(Centroid©.Dimension)
Centroid©.Dimension(d) = Record®.Dimension(d)
Next d
uniqueCentroid = True
For c2 = LBound(Centroid) To c - 1
'Loop Through Record Dimensions and check if all are the same
x = 0
y = 0
For d2 = LBound(Centroid©.Dimension) To _
UBound(Centroid©.Dimension)
x = x + Centroid©.Dimension(d2) ^ 2
y = y + Centroid(c2).Dimension(d2) ^ 2
Next d2
uniqueCentroid = Not Sqr(x) = Sqr(y)
If Not uniqueCentroid Then Exit For
Next c2
Loop Until uniqueCentroid
Next c
'Calculate Distances from Centroids
Dim lowestDistance As Double
Dim lastCluster As Integer
Dim ClustersStable As Boolean
Do 'While Clusters are not Stable
PassCounter = PassCounter + 1
ClustersStable = True 'Until Proved otherwise
'Loop Through Records
For r = LBound(Record) To UBound(Record)
lastCluster = Record®.Cluster
lowestDistance = 0 'Reset lowest distance
'Loop through record distances to centroids
For c = LBound(Centroid) To UBound(Centroid)
'======================================================
' Calculate Elucidean Distance
'======================================================
' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2)
'------------------------------------------------------
' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2
' d(p,q) = X
x = 0
y = 0
'Loop Through Record Dimensions
For d = LBound(Record®.Dimension) To _
UBound(Record®.Dimension)
y = Record®.Dimension(d) - Centroid©.Dimension(d)
y = y ^ 2
x = x + y
Next d
x = Sqr(x) 'Get square root
'If distance to centroid is lowest (or first pass) assign record to centroid cluster.
If c = LBound(Centroid) Or x < lowestDistance Then
lowestDistance = x
'Assign distance to centroid to record
Record®.Distance© = lowestDistance
'Assign record to centroid
Record®.Cluster = c
End If
Next c
'Only change if true
If ClustersStable Then ClustersStable = Record®.Cluster = lastCluster
Next r
'Move Centroids to calculated cluster average
For c = LBound(Centroid) To UBound(Centroid) 'For every cluster
'Loop through cluster dimensions
For d = LBound(Centroid©.Dimension) To _
UBound(Centroid©.Dimension)
Centroid©.Cluster = 0 'Reset nunber of records in cluster
Centroid©.Dimension(d) = 0 'Reset centroid dimensions
'Loop Through Records
For r = LBound(Record) To UBound(Record)
'If Record is in Cluster then
If Record®.Cluster = c Then
'Use to calculate avg dimension for records in cluster
'Add to number of records in cluster
Centroid©.Cluster = Centroid©.Cluster + 1
'Add record dimension to cluster dimension for later division
Centroid©.Dimension(d) = Centroid©.Dimension(d) + _
Record®.Dimension(d)
End If
Next r
'Assign Average Dimension Distance
Centroid©.Dimension(d) = Centroid©.Dimension(d) / _
Centroid©.Cluster
Next d
Next c
Loop Until ClustersStable
kMeans = (Err.Number = 0)
End Function
Function outputClusters() As Boolean
Dim c As Integer 'Centroid Index
Dim r As Integer 'Row Index
Dim d As Integer 'Dimension Index
Dim oSheet As Worksheet
On Error Resume Next
Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook)
'Loop Through Records
Dim rowNumber As Integer
rowNumber = 1
'Output Headings
With oSheet.Rows(rowNumber)
With .Cells(1)
.Value = "Row Title"
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
With .Cells(2)
.Value = "Centroid"
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
End With
'Print by Row
rowNumber = rowNumber + 1 'Blank Row
For r = LBound(Record) To UBound(Record)
oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows®.Cells(1).Value
oSheet.Rows(rowNumber).Cells(2).Value = Record®.Cluster
rowNumber = rowNumber + 1
Next r
'Print Centroids - Headings
rowNumber = rowNumber + 1
For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension)
With oSheet.Rows(rowNumber).Cells(d)
.Value = Table.Rows(1).Cells(d).Value
.Font.Bold = True
.HorizontalAlignment = xlCenter
End With
Next d
'Print Centroids
rowNumber = rowNumber + 1
For c = LBound(Centroid) To UBound(Centroid)
With oSheet.Rows(rowNumber).Cells(1)
.Value = "Centroid " & c
.Font.Bold = True
End With
'Loop through cluster dimensions
For d = LBound(Centroid©.Dimension) To UBound(Centroid©.Dimension)
oSheet.Rows(rowNumber).Cells(d).Value = Centroid©.Dimension(d)
Next d
rowNumber = rowNumber + 1
Next c
oSheet.Columns.AutoFit '//AutoFit columns to contents
outputClusters_Error:
outputClusters = (Err.Number = 0)
End Function
Function addWorksheet(Name As String, Optional Workbook As Workbook) As Worksheet
On Error Resume Next
'// If a Workbook wasn't specified, use the active workbook
If Workbook Is Nothing Then Set Workbook = ActiveWorkbook
Dim Num As Integer
'// If a worksheet(s) exist with the same name, add/increment a number after the name
While WorksheetExists(Name, Workbook)
Num = Num + 1
If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " ("))
Name = Name & " (" & Num & ")"
Wend
'//Add a sheet to the workbook
Set addWorksheet = Workbook.Worksheets.Add
'//Name the sheet
addWorksheet.Name = Name
End Function
Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean
On Error Resume Next
WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "")
On Error GoTo 0
End Function