union
union(otherDataset)是数据合并,返回一个新的数据集,由原数据集和otherDataset联合而成。
val rdd1 = sc.parallelize(1 to 9, 3)
val rdd2 = rdd1.map(x => x * 2)
rdd2.collect
val rdd3 = rdd2.filter(x => x > 10)
rdd3.collect
val rdd4 = rdd1.union(rdd3)
rdd4.collect
res: Array[Int] = Array(1,2,3,4,5,6,7,8,9,12,14,16,18)
join
join(otherDataset, [numTasks])是连接操作,将输入数据集(K,V)和另外一个数据集(K,W)进行Join, 得到(K, (V,W));该操作是对于相同K的V和W集合进行笛卡尔积 操作,也即V和W的所有组合;
val rdd0 = sc.parallelize((1,1),(1,2),(1,3),(2,1),(2,2),(2,3), 3)
val rdd5 = rdd0.join(rdd0)
rdd5.collect
res: Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (1,(2,1)), (1,(2,2)), (1,(2,3)), (1,(3,1)), (1,(3,2)), (1,(3,3)), (2,(1,1)), (2,(1,2)), (2,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)), (2,(3,1)), (2,(3,2)), (2,(3,3)))
cogroup
cogroup(otherDataset, [numTasks])是将输入数据集(K, V)和另外一个数据集(K, W)进行cogroup,得到一个格式为(K, Seq[V], Seq[W])的数据集。
val rdd6 = rdd0.cogroup(rdd0)
rdd6.collect
res: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(ArrayBuffer(1, 2, 3),ArrayBuffer(1, 2, 3))), (2,(ArrayBuffer(1, 2, 3),ArrayBuffer(1, 2, 3))))
本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。