• python
  • javascript
  • reactjs
  • sql
  • c#
  • java
Facebook Twitter Instagram
Devs Fixed
  • python
  • javascript
  • reactjs
  • sql
  • c#
  • java
Devs Fixed
Home ยป Resolved: Dataframe – Sort and select top 4 key-values from Dict/Map column in Pyspark dataframe

Resolved: Dataframe – Sort and select top 4 key-values from Dict/Map column in Pyspark dataframe

0
By Isaac Tonny on 17/06/2022 Issue
Share
Facebook Twitter LinkedIn

Question:

I have a pyspark Dataframe which has two columns, ID and count, count column is a dict/Map<str,int>. The values in count are not sorted, I am trying to sort the values inside the count column and get only top 4 based on value and remove the rest other Key-Values
I have
I want something like, Only top 4 based on values are selected in the count column
My approach
If it is not possible in Pyspark, is it possible to convert to pandas df and then sort top 4 by value? Any help is much appreciated

Answer:

You can first explode the map, get the 4 rows with the highest count per ID, and then reconstruct it as a map.

df = df.select(‘id’, F.explode(‘count’)) \
.withColumn(‘rn’, F.expr(‘row_number() over (partition by id order by value desc)’)) \
.filter(‘rn <= 4') \ .groupBy('id') \ .agg(F.map_from_entries(F.collect_list(F.struct('key', 'value')))) df.show(truncate=False) [/code]

If you have better answer, please add a comment about this, thank you!

dataframe pandas pyspark python
Share. Facebook Twitter LinkedIn

Related Posts

Resolved: How to efficient create SimpleITK image?

01/04/2023

Resolved: How can I write CSS selector(s) that apply to table rows for all td elements on that row after a td with a certain class?

01/04/2023

Resolved: How do I use SetWindowText with Unicode in Win32 using PowerShell?

01/04/2023

Leave A Reply

© 2023 DEVSFIX.COM

Type above and press Enter to search. Press Esc to cancel.