k均值聚类在R上的大稀疏矩阵?

[英]k-means clustering in R on very large, sparse matrix?

本文翻译自 movingabout 查看原文 2010/06/14 5367 cluster/ Mat/ sparse-matrix/ cluster-analysis/ k-means/ matrix

I am trying to do some k-means clustering on a very large matrix.

我尝试在一个非常大的矩阵上做k-均值聚类。

The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).

矩阵是大约500000行x 4000 cols，但非常稀疏(每一行只有几个“1”值)。

The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.

整个事情不适合内存，所以我把它转换成一个稀疏的ARFF文件。但是显然，R不能读取稀疏的ARFF文件格式。我也有数据作为一个普通的CSV文件。

Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.

在R中是否有任何可用的包有效地加载这些稀疏矩阵?然后，我将使用集群包中的常规k-means算法进行处理。

Many thanks

非常感谢

4 个解决方案

#1

The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.

bigmemory包(或者现在的软件包系列——参见他们的网站)使用k-means作为扩展分析的运行示例。特别是包含k-means函数的子包biganalytics。

#2

Please check:

请检查:

library(foreign)
?read.arff

Cheers.

欢呼。

#3

sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.

sparkcl执行稀疏的层次聚类和稀疏k-means聚类，这应该有利于r适合的(所以适合于内存)矩阵。

http://cran.r-project.org/web/packages/sparcl/sparcl.pdf

= =

For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:

对于真正的大矩阵，我将尝试使用Apache Spark稀疏矩阵和MLlib——仍然不知道现在是如何进行实验的:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

https://spark.apache.org/docs/latest/api/scala/index.html org.apache.spark.mllib.linalg.Matrices美元

https://spark.apache.org/docs/latest/mllib-clustering.html

#4

There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

有一种特殊的用于R的SparseM包可以有效地保存它。如果这不起作用，我会尝试使用更高的性能语言，比如C。

注意！

本站翻译的文章，版权归属于本站，未经许可禁止转摘，转摘请注明本文地址：http://www.silva-art.net/blog/2010/06/14/bb236b59ea301f3a7531e5842696f93f.html。

猜您在找

在R中聚集一个大的，非常稀疏的二进制矩阵。 - Clustering a large, very sparse, binary matrix in R k均值聚类（K-Means Clustering）核K-均值聚类（Kernel K-means Clustering）核K-均值聚类（Kernel K-means Clustering）在R中创建一个非常大的稀疏矩阵 - Create a very large Sparse Matrix in R