In this post, we will see how to resolve save dataframe with records limit but also make sure same value is not across multiple files
Question:
suppose I have this dataframe:id | value |
---|---|
A | 1 |
A | 2 |
A | 3 |
B | 1 |
B | 2 |
C | 1 |
D | 1 |
D | 2 |
and so on. basically I want to make sure even with records limit any certain id can only appear in one single file(suppose number of entries with that id is less than the limit)
say I am trying to output as csv with records limit:
is there a way to ensure? thanks
Best Answer:
You could ensure that all records with the same id end up in the same file withrepartition
and partitionBy
. In that case, you will have one file per id which respects you constraints.repartition
without partitionBy
. In that case, records with the same id will necessarily end up in the same file but there will be collisions. Note that in that case, you cannot really control the maximum size of a file, only the average size of each file. Let’s say that we have n
records and that we want an average file size of s
, we could do the following:If you have better answer, please add a comment about this, thank you!
Source: Stackoverflow.com