Question:
I have a pyspark Dataframe which has two columns, ID and count, count column is a dict/Map<str,int>. The values in count are not sorted, I am trying to sort the values inside the count column and get only top 4 based on value and remove the rest other Key-ValuesI have
Answer:
You can firstexplode
the map, get the 4 rows with the highest count per ID, and then reconstruct it as a map.df = df.select(‘id’, F.explode(‘count’)) \
.withColumn(‘rn’, F.expr(‘row_number() over (partition by id order by value desc)’)) \
.filter(‘rn <= 4') \
.groupBy('id') \
.agg(F.map_from_entries(F.collect_list(F.struct('key', 'value'))))
df.show(truncate=False)
[/code]
If you have better answer, please add a comment about this, thank you!