Top 10 PySpark DataFrame Interview Questions and Solutions

This article covers the Top PySpark DataFrame interview questions with clear explanations and practical examples to help data engineers and Spark developers prepare effectively. Learn key concepts, real-world use cases, and expert-level answers to crack PySpark interviews with confidence.

Question 1: How do you create a DataFrame in Spark from a collection of data?

Solution:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()

# Sample data
data = [("John", 25), ("Doe", 30), ("Jane", 28)]
columns = ["name", "age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

# Stop Spark session
spark.stop()

Question 2: How do you select specific columns from a DataFrame?

Solution:

selected_df = df.select("name", "age")
selected_df.show()

Question 3: How do you filter rows in a DataFrame based on a condition?

Solution:

filtered_df = df.filter(df["age"] > 25)
filtered_df.show()

Question 4: How do you group by a column and perform an aggregation in Spark DataFrame?

Solution:

data = [
    ("John", "HR", 3000),
    ("Doe", "HR", 4000),
    ("Jane", "IT", 5000),
    ("Mary", "IT", 6000)
]

columns = ["name", "department", "salary"]

df = spark.createDataFrame(data, columns)

avg_salary_df = df.groupBy("department").avg("salary")

avg_salary_df.show()

Question 5: How do you join two DataFrames in Spark?

Solution:

data1 = [("John", 1), ("Doe", 2), ("Jane", 3)]
data2 = [(1, "HR"), (2, "IT"), (3, "Finance")]

columns1 = ["name", "dept_id"]
columns2 = ["dept_id", "department"]

df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

joined_df = df1.join(df2, "dept_id", "inner")

joined_df.show()

Question 6: How do you handle missing data in Spark DataFrame?

Solution:

data = [("John", None), ("Doe", 25), ("Jane", None), ("Mary", 30)]
columns = ["name", "age"]

df = spark.createDataFrame(data, columns)

df_filled = df.fillna({'age': 0})

df_filled.show()

Question 7: How do you apply a custom function to a DataFrame column using UDF?

Solution:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def convert_uppercase(department):
    return department.upper() if department else None

convert_uppercase_udf = udf(convert_uppercase, StringType())

df_transformed = df.withColumn(
    "department_upper",
    convert_uppercase_udf(df["department"])
)

df_transformed.show()

Question 8: How do you sort a DataFrame by a specific column?

Solution:

sorted_df = df.orderBy("age")
sorted_df.show()

Question 9: How do you add a new column to a DataFrame?

Solution:

df_with_new_column = df.withColumn("new_column", df["age"] * 2)
df_with_new_column.show()

Question 10: How do you remove duplicate rows from a DataFrame?

Solution:

data = [("John", 25), ("Doe", 30), ("Jane", 28), ("John", 25)]
columns = ["name", "age"]

df = spark.createDataFrame(data, columns)

df_deduplicated = df.dropDuplicates()

df_deduplicated.show()

PySpark Client Mode and Cluster Mode

Apache Spark can run in multiple deployment modes, including client and cluster modes, which determine where the Spark driver program runs and how tasks are scheduled across the cluster. Understanding the differences between these modes is essential for optimizing Spark job performance and resource utilization.

1. PySpark Client Mode

Client mode
is a deployment mode where the Spark driver runs on the machine where the
spark-submit
command is executed. The driver program communicates with the cluster’s executors to schedule and execute tasks.

Key Characteristics of Client Mode:

Driver Location
: Runs on the machine where the user launches the application.

Best for Interactive Use
: Ideal for development, debugging, and interactive sessions like using notebooks (e.g., Jupyter) where you want immediate feedback.

Network Dependency
: The driver needs to maintain a constant connection with the executors. If the network connection between the client machine and the cluster is unstable, the job can fail.

Resource Utilization
: The client machine’s resources (CPU, memory) are used for the driver, so a powerful client machine is beneficial.

Additional characteristics:

More suitable for local development and testing
Easier to debug due to local access to logs
Sensitive to VPN, firewall, or unstable internet

Code Implementation for Client Mode:

To run a PySpark application in client mode, you would use the
spark-submit
command with
–deploy-mode client
. Here’s an example:

spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 3 \
--executor-cores 2 \
--executor-memory 4G \
--driver-memory 2G \
my_pyspark_script.py

Explanation:

--master yarn : Specifies YARN as the cluster manager
--deploy-mode client : Runs the driver on the client machine
--num-executors, --executor-cores, --executor-memory : Executor resource configuration
--driver-memory : Driver memory allocation
my_pyspark_script.py : PySpark application script

Example:

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
.appName("ClientModeExample") \
.getOrCreate()

# Sample DataFrame creation
data = [("John", 30), ("Doe", 25), ("Alice", 29)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Perform operations
df.show()
df.groupBy("Age").count().show()

# Stop SparkSession
spark.stop()

2. PySpark Cluster Mode

Cluster mode
is a deployment mode where the Spark driver runs inside the cluster, typically on one of the worker nodes, and not on the client machine. This mode is more suitable for production jobs that require high availability and reliability.

Key Characteristics of Cluster Mode:

Driver Location
: Runs on one of the cluster’s worker nodes.

Best for Production
: Suitable for production environments where long-running jobs need stability and don’t require interactive sessions.

Less Network Dependency
: Since the driver is located within the cluster, it has more stable connections with executors, reducing the risk of job failures due to network issues.

Resource Management
: Utilizes cluster resources for the driver, freeing up client resources and often providing more powerful hardware for the driver process.

Additional characteristics:

More resilient to network disruptions
Recommended for scheduled batch pipelines
Better suited for Airflow, Jenkins, Oozie automation

Code Implementation for Cluster Mode:

To run a PySpark application in cluster mode, you use
spark-submit
with
–deploy-mode cluster
. Here’s an example:

spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 8G \
--driver-memory 4G \
--conf spark.yarn.submit.waitAppCompletion=false \
my_pyspark_script.py

Explanation:

--master yarn : Specifies YARN
--deploy-mode cluster : Driver runs inside cluster
--conf spark.yarn.submit.waitAppCompletion=false : Submits asynchronously

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("ClusterModeExample") \
.getOrCreate()

df = spark.read.csv("hdfs:///path/to/input.csv", header=True, inferSchema=True)

result_df = df.filter(df['age'] > 30).groupBy("city").count()

result_df.write.csv("hdfs:///path/to/output.csv")

spark.stop()

Choosing Between Client Mode and Cluster Mode

Use Client Mode:

For interactive analysis or debugging
For smaller workloads
When running jobs locally

Use Cluster Mode:

For production batch jobs
For long-running workloads
When driver requires high compute resources

Conclusion

Understanding the differences between client mode and cluster mode in PySpark is crucial for effectively managing resources and optimizing job performance. Client mode is great for development and debugging, while cluster mode is ideal for production environments where stability and resource management are critical.