I have a data frame which is mostly zeros (sparse data frame?) something similar to
我有一个主要是零(稀疏数据帧)的数据框架。
name,factor_1,factor_2,factor_3
ABC,1,0,0
DEF,0,1,0
GHI,0,0,1
The actual data is about 90,000 rows with 10,000 features. Can I convert this to a sparse matrix? I am expecting to gain time and space efficiencies by utilizing a sparse matrix instead of a data frame.
实际数据大约有9万行,有10,000个特性。我能把它转换成稀疏矩阵吗?我期望利用稀疏矩阵而不是数据帧来获得时间和空间效率。
Any help would be appreciated
如有任何帮助,我们将不胜感激。
Update #1: Here is some code to generate the data frame. Thanks Richard for providing this
更新#1:这里有一些生成数据帧的代码。谢谢Richard提供这个。
x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF", "GHI"),
class = "factor"),
factor_1 = c(1L, 0L, 0L),
factor_2 = c(0L,1L, 0L),
factor_3 = c(0L, 0L, 1L)),
.Names = c("name", "factor_1","factor_2", "factor_3"),
class = "data.frame",
row.names = c(NA,-3L))
8
It might be a bit more memory efficient (but slower) to avoid copying all the data into a dense matrix:
它可能会提高内存的效率(但速度较慢),以避免将所有数据复制到一个稠密的矩阵中:
y <- Reduce(cbind2, lapply(x[,-1], Matrix, sparse = TRUE))
rownames(y) <- x[,1]
#3 x 3 sparse Matrix of class "dgCMatrix"
#
#ABC 1 . .
#DEF . 1 .
#GHI . . 1
If you have sufficient memory you should use Richard's answer, i.e., turn your data.frame into a dense matrix and than use Matrix
.
如果你有足够的记忆,你应该用理查德的答案,即。把你的数据框变成一个密集矩阵,而不是使用矩阵。
4
I do this all the time and it's a pain in the butt, so I wrote a method for it called sparsify() in my R package - mltools. It operates on data.table
s which are just fancy data.frames
.
我一直这样做,这是很痛苦的,所以我写了一种方法叫sparsify()在我的R包- mltools中。它操作的数据。这些表格只是花哨的数据。
To solve your specific problem...
解决你的具体问题……
Install mltools (or just copy the sparsify() method into your environment)
安装mltools(或将sparsify()方法复制到您的环境中)
Load packages
加载包
library(data.table)
library(Matrix)
library(mltools)
Sparsify
Sparsify
x <- data.table(x) # convert x to a data.table
sparseM <- sparsify(x[, !"name"]) # sparsify everything except the name column
rownames(sparseM) <- x$name # set the rownames
> sparseM
3 x 3 sparse Matrix of class "dgCMatrix"
factor_1 factor_2 factor_3
ABC 1 . .
DEF . 1 .
GHI . . 1
In general, the sparsify() method is pretty flexible. Here's some examples of how you can use it:
一般来说,sparsify()方法非常灵活。下面是一些如何使用的例子:
Make some data. Notice data types and unused factor levels
做一些数据。注意数据类型和未使用的因素级别。
dt <- data.table(
intCol=c(1L, NA_integer_, 3L, 0L),
realCol=c(NA, 2, NA, NA),
logCol=c(TRUE, FALSE, TRUE, FALSE),
ofCol=factor(c("a", "b", NA, "b"), levels=c("a", "b", "c"), ordered=TRUE),
ufCol=factor(c("a", NA, "c", "b"), ordered=FALSE)
)
> dt
intCol realCol logCol ofCol ufCol
1: 1 NA TRUE a a
2: NA 2 FALSE b NA
3: 3 NA TRUE NA c
4: 0 NA FALSE b b
Out-Of-The-Box Use
开箱即用的使用
> sparsify(dt)
4 x 7 sparse Matrix of class "dgCMatrix"
intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,] 1 NA 1 1 1 . .
[2,] NA 2 . 2 NA NA NA
[3,] 3 NA 1 NA . . 1
[4,] . NA . 2 . 1 .
Convert NAs to 0s and Sparsify Them
将NAs转换为0并使其稀疏。
> sparsify(dt, sparsifyNAs=TRUE)
4 x 7 sparse Matrix of class "dgCMatrix"
intCol realCol logCol ofCol ufCol_a ufCol_b ufCol_c
[1,] 1 . 1 1 1 . .
[2,] . 2 . 2 . . .
[3,] 3 . 1 . . . 1
[4,] . . . 2 . 1 .
Generate Columns That Identify NA Values
生成识别NA值的列。
> sparsify(dt[, list(realCol)], naCols="identify")
4 x 2 sparse Matrix of class "dgCMatrix"
realCol_NA realCol
[1,] 1 NA
[2,] . 2
[3,] 1 NA
[4,] 1 NA
Generate Columns That Identify NA Values In the Most Memory Efficient Manner
生成以最高效的方式识别NA值的列。
> sparsify(dt[, list(realCol)], naCols="efficient")
4 x 2 sparse Matrix of class "dgCMatrix"
realCol_NotNA realCol
[1,] . NA
[2,] 1 2
[3,] . NA
[4,] . NA
3
You could make the first column into row names, then use Matrix
from the Matrix
package.
您可以将第一个列变成行名称,然后使用矩阵包中的矩阵。
rownames(x) <- x$name
x <- x[-1]
library(Matrix)
Matrix(as.matrix(x), sparse = TRUE)
# 3 x 3 sparse Matrix of class "dtCMatrix"
# factor_1 factor_2 factor_3
# ABC 1 . .
# DEF . 1 .
# GHI . . 1
where the original x
data frame is
原始x数据帧在哪里?
x <- structure(list(name = structure(1:3, .Label = c("ABC", "DEF",
"GHI"), class = "factor"), factor_1 = c(1L, 0L, 0L), factor_2 = c(0L,
1L, 0L), factor_3 = c(0L, 0L, 1L)), .Names = c("name", "factor_1",
"factor_2", "factor_3"), class = "data.frame", row.names = c(NA,
-3L))
3
Just how sparse is your matrix? That determines how how to improve it's size.
你的矩阵有多稀疏?这决定了如何改进它的大小。
Your example matrix has 3 1
s and 6 0
s. With that ratio, there's little space savings by naively using Matrix.
你的例子矩阵有3个1和6个0。有了这个比例,天真地使用矩阵就节省了很少的空间。
> library('pryr') # for object_size
> library('Matrix')
> m <- matrix(rbinom(9e4*1e4, 1, 1/3), ncol = 1e4)
> object_size(m)
3.6 GB
> object_size(Matrix(m, sparse = T))
3.6 GB
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2014/11/19/1076eeba3a2644c89c5e00141961066.html。