k均值聚类在R上的大稀疏矩阵?

[英]k-means clustering in R on very large, sparse matrix?


I am trying to do some k-means clustering on a very large matrix.

我尝试在一个非常大的矩阵上做k-均值聚类。

The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).

矩阵是大约500000行x 4000 cols,但非常稀疏(每一行只有几个“1”值)。

The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.

整个事情不适合内存,所以我把它转换成一个稀疏的ARFF文件。但是显然,R不能读取稀疏的ARFF文件格式。我也有数据作为一个普通的CSV文件。

Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.

在R中是否有任何可用的包有效地加载这些稀疏矩阵?然后,我将使用集群包中的常规k-means算法进行处理。

Many thanks

非常感谢

4 个解决方案

#1


13  

The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.

bigmemory包(或者现在的软件包系列——参见他们的网站)使用k-means作为扩展分析的运行示例。特别是包含k-means函数的子包biganalytics。

#2


1  

Please check:

请检查:

library(foreign)
?read.arff

Cheers.

欢呼。

#3


1  

sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.

sparkcl执行稀疏的层次聚类和稀疏k-means聚类,这应该有利于r适合的(所以适合于内存)矩阵。

http://cran.r-project.org/web/packages/sparcl/sparcl.pdf

http://cran.r-project.org/web/packages/sparcl/sparcl.pdf

==

= =

For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:

对于真正的大矩阵,我将尝试使用Apache Spark稀疏矩阵和MLlib——仍然不知道现在是如何进行实验的:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

https://spark.apache.org/docs/latest/api/scala/index.html org.apache.spark.mllib.linalg.Matrices美元

https://spark.apache.org/docs/latest/mllib-clustering.html

https://spark.apache.org/docs/latest/mllib-clustering.html

#4


0  

There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

有一种特殊的用于R的SparseM包可以有效地保存它。如果这不起作用,我会尝试使用更高的性能语言,比如C。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2010/06/14/bb236b59ea301f3a7531e5842696f93f.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号