将Spark的DataFrame列转换为Scala中的[String]

[英]Converting a Spark's DataFrame column to List[String] in Scala

本文翻译自 Vinay Pandey 查看原文 2016/10/17 3472 spark-dataframe/ scala/ apache-spark/ csv/ dataframe

I am working on Movie Lens data set. In one the the csv files, the data is structured as:

我正在做电影镜头数据集，其中一个是csv文件，数据结构为:

movieId movieTitle genres

movieId movieTitle流派

and genres again is a list of | separated values, the field is nullable.

类型也是|分离值的列表，字段是可空的。

I am trying to get a unique list of all the genres so that I can rearrange the data as following:

我正在尝试得到所有流派的唯一列表，以便我可以重新排列数据如下:

movieId movieTitle genre1 genre2 ... genreN

电影名称genre1 genre2…genreN

and a row, which has genre as genre1 | genre2 will look like:

而行，有genre1 | genre2这样的体裁:

1 Title1 1 1 0 ... 0

1标题1 1 1 1 0…0

So far, I have been able to read the csv file using the following code:

到目前为止，我已经能够使用以下代码读取csv文件:

val conf         = new SparkConf().setAppName(App.name).setMaster(App.sparkMaster)
val context      = new SparkContext(conf)
val sparkSession = SparkSession.builder()
                   .appName(App.name)
                   .config("header", "true")
                   .config(conf = conf)
                   .getOrCreate()

val movieFrame: DataFrame = sparkSession.read.csv(moviesPath)

If I try something like:

如果我尝试:

movieFrame.rdd.map(row ⇒ row(2).asInstanceOf[String]).collect()

Then I get the following exception:

然后我得到以下例外:

java.lang.ClassNotFoundException: com.github.babbupandey.ReadData$$anonfun$1

Then, in addition, I tried providing the schema explicitly using the following code:

然后，我尝试使用以下代码显式地提供模式:

val moviesSchema: StructType = StructType(Array(StructField("movieId", StringType, nullable = true),
                                                        StructField("title", StringType, nullable = true),
                                                        StructField("genres", StringType, nullable = true)))

and tried:

和尝试:

val movieFrame: DataFrame = sparkSession.read.schema(moviesSchema).csv(moviesPath)

and then I got the same exception.

然后我得到了同样的例外。

Is there any way in which I can the set of genres as a List or a Set so I can further massage the data into the desired format? Any help will be appreciated.

有什么方法可以让我的一套类型作为一个列表或一套，以便我可以进一步推敲数据到想要的格式?如有任何帮助，我们将不胜感激。

2 个解决方案

#1

Here is how I got the set of genres:

以下是我如何得到一系列流派:

val genreList: Array[String] = for (row <- movieFrame.select("genres").collect) yield row.getString(0)
val genres: Array[String] =  for {
        g ← genreList
        genres ← g.split("\\|")
    } yield genres
val genreSet : Set[String] = genres.toSet

#2

-1

This worked to give an Array[Array[String]]

这样就得到了一个数组[数组[字符串]]]

    val genreLst = movieFrame.select("genres").rdd.map(r =>     r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect()

To get Array[String]

得到数组(字符串)

    val genres = genreLst.flatten

或

    val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect().flatten

智能推荐

注意！

本站翻译的文章，版权归属于本站，未经许可禁止转摘，转摘请注明本文地址：http://www.silva-art.net/blog/2016/10/17/aefa9f23f887a28817e22c250cc08085.html。

猜您在找

将spark DataFrame列转换为python列表 - Convert spark DataFrame column to python list 将“list”类型的列转换为数据框中的多个列 - Converting a column of type 'list' to multiple columns in a data frame 将List中的值转换为Pandas DataFrame - Converting dictionary with values in List to Pandas DataFrame 如何在Spark SQL的DataFrame中更改列类型? - How to change column types in Spark SQL's DataFrame? 将ftable(列联表)转换为R中的dataframe - Converting an ftable (contingency table) to a dataframe in R