I am trying to do some k-means clustering on a very large matrix.
我尝试在一个非常大的矩阵上做k-均值聚类。
The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).
矩阵是大约500000行x 4000 cols,但非常稀疏(每一行只有几个“1”值)。
The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.
整个事情不适合内存,所以我把它转换成一个稀疏的ARFF文件。但是显然,R不能读取稀疏的ARFF文件格式。我也有数据作为一个普通的CSV文件。
Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.
在R中是否有任何可用的包有效地加载这些稀疏矩阵?然后,我将使用集群包中的常规k-means算法进行处理。
Many thanks
非常感谢
13
The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.
bigmemory包(或者现在的软件包系列——参见他们的网站)使用k-means作为扩展分析的运行示例。特别是包含k-means函数的子包biganalytics。
1
Please check:
请检查:
library(foreign)
?read.arff
Cheers.
欢呼。
1
sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.
sparkcl执行稀疏的层次聚类和稀疏k-means聚类,这应该有利于r适合的(所以适合于内存)矩阵。
http://cran.r-project.org/web/packages/sparcl/sparcl.pdf
http://cran.r-project.org/web/packages/sparcl/sparcl.pdf
==
= =
For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:
对于真正的大矩阵,我将尝试使用Apache Spark稀疏矩阵和MLlib——仍然不知道现在是如何进行实验的:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$
https://spark.apache.org/docs/latest/api/scala/index.html org.apache.spark.mllib.linalg.Matrices美元
https://spark.apache.org/docs/latest/mllib-clustering.html
https://spark.apache.org/docs/latest/mllib-clustering.html
0
There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.
有一种特殊的用于R的SparseM包可以有效地保存它。如果这不起作用,我会尝试使用更高的性能语言,比如C。
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2010/06/14/bb236b59ea301f3a7531e5842696f93f.html。