I am working on Movie Lens data set. In one the the csv
files, the data is structured as:
我正在做电影镜头数据集,其中一个是csv文件,数据结构为:
movieId
movieTitle
genres
movieId movieTitle流派
and genres
again is a list of |
separated values, the field is nullable.
类型也是|分离值的列表,字段是可空的。
I am trying to get a unique list of all the genres
so that I can rearrange the data as following:
我正在尝试得到所有流派的唯一列表,以便我可以重新排列数据如下:
movieId
movieTitle
genre1
genre2
...
genreN
电影名称genre1 genre2…genreN
and a row, which has genre
as genre1 | genre2
will look like:
而行,有genre1 | genre2这样的体裁:
1
Title1
1
1
0
...
0
1标题1 1 1 1 0…0
So far, I have been able to read the csv
file using the following code:
到目前为止,我已经能够使用以下代码读取csv文件:
val conf = new SparkConf().setAppName(App.name).setMaster(App.sparkMaster)
val context = new SparkContext(conf)
val sparkSession = SparkSession.builder()
.appName(App.name)
.config("header", "true")
.config(conf = conf)
.getOrCreate()
val movieFrame: DataFrame = sparkSession.read.csv(moviesPath)
If I try something like:
如果我尝试:
movieFrame.rdd.map(row ⇒ row(2).asInstanceOf[String]).collect()
Then I get the following exception:
然后我得到以下例外:
java.lang.ClassNotFoundException: com.github.babbupandey.ReadData$$anonfun$1
Then, in addition, I tried providing the schema explicitly using the following code:
然后,我尝试使用以下代码显式地提供模式:
val moviesSchema: StructType = StructType(Array(StructField("movieId", StringType, nullable = true),
StructField("title", StringType, nullable = true),
StructField("genres", StringType, nullable = true)))
and tried:
和尝试:
val movieFrame: DataFrame = sparkSession.read.schema(moviesSchema).csv(moviesPath)
and then I got the same exception.
然后我得到了同样的例外。
Is there any way in which I can the set of genres
as a List
or a Set
so I can further massage the data into the desired format? Any help will be appreciated.
有什么方法可以让我的一套类型作为一个列表或一套,以便我可以进一步推敲数据到想要的格式?如有任何帮助,我们将不胜感激。
1
Here is how I got the set of genres:
以下是我如何得到一系列流派:
val genreList: Array[String] = for (row <- movieFrame.select("genres").collect) yield row.getString(0)
val genres: Array[String] = for {
g ← genreList
genres ← g.split("\\|")
} yield genres
val genreSet : Set[String] = genres.toSet
-1
This worked to give an Array[Array[String]]
这样就得到了一个数组[数组[字符串]]]
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect()
To get Array[String]
得到数组(字符串)
val genres = genreLst.flatten
or
或
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect().flatten
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2016/10/17/aefa9f23f887a28817e22c250cc08085.html。