I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete:
我有一个大的,稀疏的二进制矩阵(大约39000 x 14000;大多数行只有一个“1”条目。我想把类似的行串在一起,但是我的初始计划需要太长时间才能完成:
d <- dist(inputMatrix, method="binary")
hc <- hclust(d, method="complete")
The first step doesn't finish, so I'm not sure how the second step would fare. What are some approaches to efficiently grouping similar rows of a large, sparse, binary matrix in R?
第一步没有完成,所以我不确定第二步会怎样。在R中,有什么方法可以有效地将一个大的,稀疏的,二进制矩阵的相似行分组?
11
I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than dist(x, method = "binary")
. It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the correct order internally). This is then used in some C++ code which handles the data as 64 bit unsigned integers for speed. The Jaccard distance of two vectors x and y is equal to x ^ y / (x | y)
where ^
is the xor operator. The Hamming Weight calculation is used to count the number of bits set if the result of the xor
or or
is non-zero.
我已经编写了一些Rcpp代码和R代码,它可以计算二进制矩阵约的二进制/Jaccard距离。80x比dist快(x,方法= "二进制")。它将输入矩阵转换成一个原始矩阵,它是输入的转置(因此,位模式在内部是正确的顺序)。然后在一些c++代码中使用,该代码将数据处理为64位无符号整数,以加快速度。两个向量的Jaccard距离x和y = x ^ y /(x | y)^ xor运算符。如果xor的结果为非零,则使用汉明权重计算来计算集合的位数。
I've put together the code on github at https://github.com/NikNakk/binaryDist/ and reproduced the two files below. I've confirmed that the results are the same as dist(x, method = "binary")
for a few random datasets.
我将github上的代码放在https://github.com/NikNakk/binaryDist/并复制下面的两个文件。我已经确认了一些随机数据集的结果与dist(x, method = "二进制")相同。
On a dataset of 39000 rows by 14000 columns with 1-5 ones per row, it took about 11 minutes. The output distance matrix was 5.7 GB.
在14000行的数据集上,每一行有1到5个1到5个的列,大约需要11分钟。输出距离矩阵为5.7 GB。
#include <Rcpp.h>
using namespace Rcpp;
//countBits function taken from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
const uint64_t m1 = 0x5555555555555555; //binary: 0101...
const uint64_t m2 = 0x3333333333333333; //binary: 00110011..
const uint64_t m4 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ...
const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3...
int countBits(uint64_t x) {
x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
return (x * h01)>>56; //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
}
// [[Rcpp::export]]
int countBitsFromRaw(RawVector rv) {
uint64_t* x = (uint64_t*)RAW(rv);
return(countBits(*x));
}
// [[Rcpp::export]]
NumericVector bDist(RawMatrix mat) {
int nr(mat.nrow()), nc(mat.ncol());
int nw = nr / 8;
NumericVector res(nc * (nc - 1) / 2);
// Access the raw data as unsigned 64 bit integers
uint64_t* data = (uint64_t*)RAW(mat);
uint64_t a(0);
// Work through each possible combination of columns (rows in the original integer matrix)
for (int i = 0; i < nc - 1; i++) {
for (int j = i + 1; j < nc; j++) {
uint64_t sx = 0;
uint64_t so = 0;
// Work through each 64 bit integer and calculate the sum of (x ^ y) and (x | y)
for (int k = 0; k < nw; k++) {
uint64_t o = data[nw * i + k] | data[nw * j + k];
// If (x | y == 0) then (x ^ y) will also be 0
if (o) {
// Use Hamming weight method to calculate number of set bits
so = so + countBits(o);
uint64_t x = data[nw * i + k] ^ data[nw * j + k];
if (x) {
sx = sx + countBits(x);
}
}
}
res(a++) = (double)sx / so;
}
}
return (res);
}
library("Rcpp")
library("plyr")
sourceCpp("bDist.cpp")
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of packWidth
packRow <- function(row, packWidth = 64L) {
packBits(as.raw(c(row, rep(0, (packWidth - length(row)) %% packWidth))))
}
as.PackedMatrix <- function(x, packWidth = 64L) {
UseMethod("as.PackedMatrix")
}
# Converts a binary integer matrix into a packed raw matrix
# padding out at the end to make the input length a multiple of packWidth
as.PackedMatrix.matrix <- function(x, packWidth = 64L) {
stopifnot(packWidth %% 8 == 0, class(x) %in% c("matrix", "Matrix"))
storage.mode(x) <- "raw"
if (ncol(x) %% packWidth != 0) {
x <- cbind(x, matrix(0L, nrow = nrow(x), ncol = packWidth - (ncol(x) %% packWidth)))
}
out <- packBits(t(x))
dim(out) <- c(ncol(x) %/% 8, nrow(x))
class(out) <- "PackedMatrix"
out
}
# Converts back to an integer matrix
as.matrix.PackedMatrix <- function(x) {
out <- rawToBits(x)
dim(out) <- c(nrow(x) * 8L, ncol(x))
storage.mode(out) <- "integer"
t(out)
}
# Generates random sparse data for testing the main function
makeRandomData <- function(nObs, nVariables, maxBits, packed = FALSE) {
x <- replicate(nObs, {
y <- integer(nVariables)
y[sample(nVariables, sample(maxBits, 1))] <- 1L
if (packed) {
packRow(y, 64L)
} else {
y
}
})
if (packed) {
class(x) <- "PackedMatrix"
x
} else {
t(x)
}
}
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readPackedMatrix <- function(file = NULL, text = NULL, packWidth = 64L) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]), packWidth = packWidth)
i <- i + 1
}
out <- do.call("cbind", lst)
class(out) <- "PackedMatrix"
out
}
# Wrapper for the C++ code which
binaryDist <- function(x) {
if (class(x) != "PackedMatrix") {
x <- as.PackedMatrix(x)
}
dst <- bDist(x)
attr(dst, "Size") <- ncol(x)
attr(dst, "Diag") <- attr(dst, "Upper") <- FALSE
attr(dst, "method") <- "binary"
attr(dst, "call") <- match.call()
class(dst) <- "dist"
dst
}
x <- makeRandomData(2000, 400, maxBits = 5, packed = TRUE)
system.time(bd <- binaryDist(x))
From original answer:
从最初的回答:
Other things to consider would be doing some prefiltering of comparisons between two rows with single ones since the distance will either be 0 for duplicates or 1 for any other possibility.
其他需要考虑的事情是对两行之间的比较进行一些预筛选,因为距离要么是0,要么是其他可能的1。
A couple of relatively straightforward options that might be faster without needing much code are the vegdist
function from the vegan package and the Dist
function from the amap package. The latter will probably only be quicker if you have multiple cores and take advantage of the fact it supports parallelisation.
一些相对简单的选择可能更快,但不需要太多代码,这是素食软件包的vegdist函数和amap包的Dist函数。如果您拥有多个内核并利用它支持并行化的事实,那么后者可能只会更快。
6
The reason this takes so long to compute is that the call to dist
is computing and storing more than 760 million pairwise distances. If your data is stored sparsely, this will take a long time and huge amount of storage. If your data is not stored sparsely, then each distance computation requires at least 14,000 operations, for a total operation count exceeding 1 quadrillion!
这需要花费很长时间来计算的原因是,对dist的调用是计算和存储超过7.6亿条的pairwise距离。如果您的数据存储稀疏,这将花费很长时间和大量的存储。如果您的数据没有稀疏地存储,那么每个距离计算至少需要14000个操作,总操作数超过1千万亿!
An approach that will be much quicker is k-means clustering, since it doesn't require pre-computing a distance matrix; at each iteration you will need only 39000*k distance calculations, where k is the number of clusters. To get pairwise distances that are similar to the Jaccard index (0 if identical, 1 if no indices coincide, in between if some but not all indices coincide), you could divide each row x
by sqrt(2*sum(x^2))
. For instance, if you had the following input matrix:
一种更快的方法是k-means聚类,因为它不需要预先计算一个距离矩阵;在每个迭代中,您只需要39000*k的距离计算,其中k是集群的数量。为了得到与Jaccard指数相似的成对距离(如果相同,如果没有指标重合,如果没有所有的指标一致),你可以用sqrt(2*sum(x 2))来划分每一行x。例如,如果您有以下输入矩阵:
(mat <- rbind(c(1, 0, 0, 0, 0), c(0, 0, 0, 1, 1)))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 0 0 0 0
# [2,] 0 0 0 1 1
the normalized version would be (assuming binary values only in the matrix; if this were not the case you would use rowSums(mat^2)
):
归一化的版本是(只在矩阵中假定二进制值;如果不这样你会使用rowSums(垫^ 2)):
(mat.norm <- mat / sqrt(2*rowSums(mat)))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.7071068 0 0 0.0 0.0
# [2,] 0.0000000 0 0 0.5 0.5
These two observations (which have no indices in common), have Euclidean distance 1, coinciding with the Jaccard distance for this case.
这两个观察(没有共同的指标),有欧几里得距离1,与这个例子的Jaccard距离一致。
dist(mat.norm, "euclidean")
# 1
# 2 1
Additionally, identical observations will clearly have Euclidean distance 0, again corresponding to the Jaccard distance.
另外,相同的观测值显然会有欧几里得距离为0,这又与Jaccard距离相对应。
2
do you have duplicate rows? There is no need to compute their distances twice.
你有重复的行吗?没有必要计算它们的距离两次。
all rows with a single 1 will be 100% different from all rows with a single one in a different place.
所有带有单个1的行将与在不同位置的所有行完全不同。
Thus, it does not make sense to run clustering on such data. The output is rather predictable, and boils down to finding the 1.
因此,在这样的数据上运行集群是没有意义的。输出是相当可预测的,可以归结为找到1。
Try restricting your data set to those objects that have more than one 1 only. Unless you can get interesting results on these only, no need to continue further. Binary data has too little information.
尝试将您的数据集限制到那些只有1个以上的对象。除非你能在这些方面得到有趣的结果,否则就没有必要再继续下去了。二进制数据的信息太少。
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2015/06/19/7dff46f088adb162a633c31756aaf5c9.html。