Question:
Based on this description of datasets and dataframes I wrote this very short test code which works.error: value toDS is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.catalog.Table]
Answer:
toDS()
is not a member of RRD[T]
. Welcome to the bizarre world of Scala implicits where nothing is what it seems to be.toDS()
is a member of DatasetHolder[T]
. In SparkSession
, there is an object called implicits
. When brought in scope with an expression like import sc.implicits._
, an implicit method called rddToDatasetHolder
becomes available for resolution:rdd.toDS()
, the compiler first searches the RDD
class and all of its superclasses for a method called toDS()
. It doesn’t find one so what it does is start searching all the compatible implicits in scope. While doing so, it finds the rddToDatasetHolder
method which accepts an RDD
instance and returns an object of a type which does have a toDS()
method. Basically, the compiler rewrites:rddToDatasetHolder
itself, it has two argument lists:Encoder[T]
type. There are many predefined encoders for the standard Scala types, but for most complex custom types no predefined encoders exist.So, in short: The existence of a predefined
Encoder[String]
makes it possible to call toDS()
on an instance of RDD[String]
, but the absence of a predefined Encoder[org.apache.spark.sql.catalog.Table]
makes it impossible to call toDS()
on an instance of RDD[org.apache.spark.sql.catalog.Table]
.By the way,
SparkSession.implicits
contains the implicit class StringToColumn
which has a $
method. This is how the $"foo"
expression gets converted to a Column
instance for column foo
.Resolving all the implicit arguments and implicit transformations is why compiling Scala code is so dang slow.
If you have better answer, please add a comment about this, thank you!