
  • 1 错误重现
  • 2 出现原因以及解决
  • 3 对Dataframe使用union时的问题

1 错误重现

ERROR queue.BoundedInMemoryExecutor: error producing records0]
org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs://hdp-yl-1:8020/user/testJoin/test_join27/join/default/1d0f7a5b-fcbc-40aa-994d-ada47e3a3257-0_0-59-5054_20211119171950.parquet

2 出现原因以及解决



write_df2 = write_df2.withColumn("superior_emp_id",col("superior_emp_id").cast("string"))

3 对Dataframe使用union时的问题

在spark上对dataframe使用union时,可能也会导致该问题。导致问题的原因为:Also as standard in SQL, this function resolves columns by position (not by name)【只是根据位置而不是根据名字做拼接】。


temp_df_inc_left schema:
root|-- sub_total_trans_cost: double (nullable = true)|-- sub_total_trans_price: integer (nullable = true)|-- id: integer (nullable = true)|-- trans_code: string (nullable = true)|-- account_id: string (nullable = true)|-- pay_channel_code: string (nullable = true)|-- total_trans_price: double (nullable = true)|-- total_trans_cost: double (nullable = true)|-- create_time: string (nullable = true)|-- update_time: string (nullable = true)temp_df_inc_right:
root|-- sub_total_trans_price: integer (nullable = true)|-- sub_total_trans_cost: double (nullable = true)|-- id: integer (nullable = true)|-- trans_code: string (nullable = true)|-- account_id: string (nullable = true)|-- pay_channel_code: string (nullable = true)|-- total_trans_price: double (nullable = true)|-- total_trans_cost: double (nullable = true)|-- create_time: string (nullable = true)|-- update_time: string (nullable = true)20
|sub_total_trans_cost|sub_total_trans_price|id |trans_code                      |account_id|pay_channel_code|total_trans_price|total_trans_cost|create_time        |update_time        |
|2565.2              |2905.0               |147|9a4a7a3f424f43018fef4a2ec0188c1e|yun4      |Alipay          |3385.8           |4285.2          |2021-03-20 03:49:34|2021-03-20 03:49:34|
|1414.0              |1614.0               |133|b47b20ac89fe4ccfb9b29005d338ad51|yun8      |Cash            |1781.5           |2307.6          |2021-03-30 12:39:57|2021-03-30 12:39:57|
|2127.2              |2367.0               |138|d8403a2dd23c4f24ab46ed709a995be4|yun1      |Cash            |2620.8           |3557.6          |2021-01-14 20:31:25|2021-01-14 20:31:25|
only showing top 3 rows


   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.** This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set* union (that does deduplication of elements), use this function followed by a [[distinct]].** The difference between this function and [[union]] is that this function* resolves columns by name (not by position):

