Explore 150+ PySpark Interview Questions and Answers designed for beginners to advanced professionals. This complete guide includes Spark fundamentals, RDDs, DataFrames, Spark SQL, joins, performance tuning, real-world scenarios, and coding questions with in-depth explanations to help you crack PySpark and Big Data interviews.
PySpark Basics
1. What is PySpark?
PySpark is the Python interface to Apache Spark, built to process very large datasets efficiently using distributed computing. Instead of running on a single machine like traditional Python scripts, PySpark spreads the workload across multiple machines in a cluster. This allows it to handle millions or even billions of records while maintaining high performance.
For example, imagine processing terabytes of website click logs to understand user behavior. Doing this in Pandas on a laptop would likely crash due to memory limits, but PySpark can divide the data into partitions and process them in parallel across many worker nodes. This makes PySpark a powerful choice for big data analytics, machine learning, and ETL pipelines.
2. How is PySpark different from Pandas?
PySpark and Pandas serve similar purposes but operate at very different scales. Pandas is designed for small to medium datasets that fit into a single machine’s memory, making it ideal for quick analysis and prototyping. PySpark, on the other hand, is built for large-scale data processing and runs across multiple machines in a distributed environment.
For example, analyzing a 100 MB CSV file in Pandas is fast and convenient, but analyzing a 500 GB dataset is nearly impossible on a single system. PySpark solves this by splitting the dataset into chunks and processing them simultaneously across a cluster. Additionally, PySpark uses lazy evaluation and query optimization, which helps improve performance when working with complex transformations.
3. What are the main components of Spark architecture?
Spark architecture is built around several key components that work together to execute distributed workloads. The Driver controls the execution of the application and coordinates all tasks. The Cluster Manager is responsible for allocating resources across machines. The Executors are worker processes that perform actual computations on partitions of data. The SparkSession or SparkContext serves as the entry point for creating and managing Spark jobs.
For example, when you run a Spark job to analyze sales transactions, the Driver determines what needs to be done, the Cluster Manager assigns CPU and memory resources, and Executors process chunks of transaction data in parallel. This modular design allows Spark to scale efficiently across large clusters.
4. What is the Spark Driver?
The Spark Driver is the central control unit of a Spark application. It creates execution plans, converts logical queries into physical execution steps, divides work into stages and tasks, and assigns tasks to Executors. It also tracks job progress and collects final results.
For example, when running a PySpark job to compute total revenue per product, the Driver decides how to break the computation into tasks, sends those tasks to Executors, and then aggregates the final output. If the Driver stops running, the Spark job fails because it is responsible for orchestrating the entire process.
5. What is an Executor in Spark?
An Executor is a background process running on a worker node that performs computations assigned by the Driver. Executors run tasks in parallel, store intermediate results in memory or disk, and return computed results to the Driver.
For example, if a dataset has 200 partitions and the cluster has multiple Executors, each Executor processes a subset of partitions concurrently. More Executors and more CPU cores typically result in faster execution, especially for large joins or aggregations.
6. What is a SparkContext?
SparkContext is the low-level interface that establishes a connection between a Spark application and the cluster. It manages job execution, allocates resources, and allows the creation of RDDs.
For example, if you want to create an RDD from a list of values or load a raw text file into Spark memory, SparkContext provides that capability. While SparkContext is still used internally, most modern Spark applications use SparkSession instead.
7. What is SparkSession?
SparkSession is the unified entry point for Spark applications introduced in Spark 2.x. It replaces SparkContext, SQLContext, and HiveContext, providing a single interface for working with RDDs, DataFrames, Spark SQL, and streaming workloads.
For example, when reading a CSV file into a DataFrame, SparkSession handles schema inference, query optimization, and execution without requiring multiple configuration objects. This simplifies application development and improves productivity.
8. What is the difference between SparkContext and SparkSession?
SparkContext focuses on low-level cluster communication and RDD operations, whereas SparkSession provides a high-level API that supports structured data, SQL queries, machine learning, and streaming. SparkSession internally contains SparkContext, meaning developers usually do not need to interact with SparkContext directly.
In real practice, SparkSession is preferred because it simplifies code and supports modern Spark features.
9. What are RDDs in PySpark?
RDD stands for Resilient Distributed Dataset, which is Spark’s foundational data structure. It represents an immutable, distributed collection of objects that can be processed in parallel across cluster nodes.
For example, reading a text file into Spark creates an RDD where each line becomes a distributed element that can be processed independently.
10. What are the characteristics of RDDs?
RDDs are immutable, distributed, fault-tolerant, lazily evaluated, and partitioned across multiple nodes. These features allow Spark to scale computations while ensuring reliability and performance.
For instance, if a node fails, Spark can reconstruct lost RDD partitions using transformation lineage.
11. What is immutability in RDDs?
Immutability means once an RDD is created, it cannot be modified. Any transformation results in a new RDD, leaving the original unchanged. This ensures thread safety and prevents inconsistent results in distributed environments.
For example, applying map() on an RDD returns a new RDD instead of modifying the original one.
12. What is fault tolerance in Spark?
Fault tolerance allows Spark to recover from failures without losing data. Spark uses lineage to recompute lost partitions rather than storing multiple copies of data.
For example, if an executor crashes during processing, Spark rebuilds missing data using previous transformation steps.
13. What is lineage in Spark?
Lineage is the record of transformations applied to create an RDD. It enables Spark to recompute lost data in case of failures.
If an RDD was created using map() and filter(), Spark stores that history to rebuild it when needed.
14. What is lazy evaluation in Spark?
Lazy evaluation means Spark delays execution until an action such as count() or collect() is called. This allows Spark to optimize the execution plan before running it.
For example, Spark may skip unnecessary transformations if they do not affect the final output.
15. What are transformations and actions in Spark?
Transformations define how data should change, while actions trigger actual computation and return results.
For example, filter() builds a transformation, but count() executes the job.
16. What is the difference between transformations and actions in Spark?
Transformations define how Spark should process data, but they do not trigger execution immediately. They are lazy, meaning Spark builds an execution plan but waits until an action is called before running anything. This allows Spark to optimize the entire workflow and avoid unnecessary computation. Examples of transformations include map(), filter(), and select(). Actions, on the other hand, trigger actual execution and return results to the driver or write output to storage. Examples include count(), collect(), show(), and save(). For example, filtering a dataset creates a transformation, but calling count() forces Spark to compute the result.
17. What is narrow vs wide transformation?
A narrow transformation processes data within the same partition and does not require data movement across the cluster. This makes it faster and more efficient because Spark can execute tasks locally. Examples include map() and filter(). A wide transformation requires shuffling data across multiple partitions, which increases network overhead and slows execution. Operations such as groupByKey(), join(), and distinct() are wide transformations. For example, filtering records is narrow, but grouping records by a key requires redistributing data, making it wide.
18. What is map vs flatMap?
The map() function transforms each input record into exactly one output record, preserving the structure of the dataset. The flatMap() function allows one input record to produce multiple output records and then flattens the result into a single list. For example, if each row contains a sentence, map() would return sentences as they are, while flatMap() could split each sentence into individual words and return a flattened list of words. FlatMap is useful when dealing with nested or multi-value data.
19. What is filter transformation?
The filter() transformation is used to select records that match a specific condition, returning a new dataset while keeping the original unchanged. It is commonly used in data cleaning, validation, and segmentation tasks. For example, filtering transactions where the purchase amount is greater than 5000 allows analysts to focus on high-value customers. Since filter() is a transformation, Spark does not execute it immediately but waits until an action is called.
20. What is reduceByKey vs groupByKey?
reduceByKey() performs aggregation before shuffling data, which reduces network traffic and improves performance. It combines values locally on each partition before sending them across the cluster. groupByKey(), however, collects all values for each key without performing local aggregation, which can consume more memory and slow down execution. For example, when summing total sales per customer, reduceByKey() is more efficient because it reduces data earlier in the pipeline.
21. What is distinct in RDD?
The distinct() operation removes duplicate elements from an RDD by grouping identical values across partitions. Since it requires shuffling data, it can be expensive on large datasets. It is commonly used when extracting unique records, such as removing duplicate customer IDs or email addresses from logs. For example, ensuring that a dataset contains only unique user IDs before further processing.
22. What is repartition in RDD?
repartition() changes the number of partitions in an RDD and always triggers a shuffle, redistributing data evenly across the cluster. It is useful when increasing parallelism or balancing skewed data before performing heavy operations like joins or aggregations. For example, increasing partitions before processing a large dataset can improve CPU utilization and execution speed.
23. What is coalesce in RDD?
coalesce() reduces the number of partitions without triggering a shuffle, making it more efficient than repartition() when shrinking partitions. It is commonly used after filtering large datasets to reduce the number of output files. For example, reducing partitions before writing data to storage helps avoid creating many small files.
24. What is collect() and when should it be avoided?
The collect() action retrieves all records from an RDD or DataFrame to the driver machine. This can cause memory overflow if the dataset is large, so it should be avoided in production environments. It is best used only for small datasets or debugging. A safer alternative is take(n), which retrieves only a limited number of records for preview.
25. How does Spark handle RDD persistence?
RDD persistence allows Spark to store computed results in memory or disk to avoid recomputation. This is useful when the same dataset is reused multiple times, such as in machine learning training loops or iterative analytics. Persisting RDDs improves performance by reducing redundant computation and speeding up repeated operations.
26. What is checkpointing in RDD?
Checkpointing saves RDD data to stable storage such as HDFS, breaking long lineage chains and improving recovery time in case of failures. It is commonly used in streaming jobs or long-running pipelines to prevent expensive recomputation from the beginning.
27. How does RDD fault tolerance work?
RDD fault tolerance works through lineage tracking. Spark stores the history of transformations applied to create an RDD, allowing lost partitions to be recomputed if a node fails. This avoids storing full copies of data while still ensuring reliability.
DataFrames and Spark SQL
28. What is a DataFrame in PySpark?
A DataFrame in PySpark is a distributed, structured dataset organized in rows and columns, similar to a table in a database or an Excel sheet. Unlike RDDs, DataFrames have a schema, meaning column names and data types are defined. This allows Spark to optimize queries automatically using the Catalyst Optimizer and Tungsten Engine.
DataFrames are easier to use than RDDs because they support SQL queries, built-in functions, and optimized execution. They are ideal for analytics, reporting, and machine learning workloads where structured data is involved.
Example: Creating a DataFrame from a list of records
data = [(1, "John", 5000), (2, "Sara", 6000)]df = spark.createDataFrame(data, ["id", "name", "salary"])df.show()
29. How is DataFrame different from RDD?
RDDs are low-level, unstructured, and require manual optimization, whereas DataFrames are high-level, structured, and optimized automatically. RDD operations are written using functional programming, while DataFrames allow SQL-style queries and column-based operations.
DataFrames consume less memory, execute faster, and are easier to maintain in production. Spark can analyze DataFrame queries and optimize them, but it cannot do this effectively with RDDs.
Example: Filtering salary using RDD vs DataFrame
RDD: rdd.filter(lambda x: x[2] > 5000)
DataFrame: df.filter(df.salary > 5000).show()
30. What is Catalyst Optimizer?
Catalyst Optimizer is Spark’s query optimization engine for DataFrames and Spark SQL. It analyzes query plans and rewrites them into more efficient execution strategies.
It performs tasks such as:
- Removing unused columns
- Reordering joins
- Pushing filters earlier to reduce data scanning
This helps Spark run queries faster without requiring manual tuning.
Example:
df.filter(df.salary >5000).select("name")
Spark optimizes this to filter first, then select, reducing computation.
31. What is Tungsten Execution Engine?
Tungsten is Spark’s low-level performance engine that improves memory management and CPU efficiency. It stores data in a binary format, reduces JVM object creation, and minimizes garbage collection overhead.
This makes Spark significantly faster for aggregations, joins, and large-scale analytics.
Example: Running large aggregation queries like
df.groupBy("department").sum("salary")
executes faster due to Tungsten’s optimized memory handling.
32. What is schema inference in DataFrames?
Schema inference allows Spark to automatically detect column names and data types when reading files like CSV or JSON. This saves time and ensures Spark correctly interprets numbers, dates, and strings.
It is useful when working with structured datasets where manual schema definition would be tedious.
Example:
df = spark.read.csv("data.csv", header=True, inferSchema=True)df.printSchema()
33. How do you create a DataFrame from CSV?
Spark provides built-in CSV readers that support headers and automatic type detection.
Example:
df = spark.read.csv("data.csv", header=True, inferSchema=True)df.show()
This converts CSV rows into structured columns for analytics.
34. How do you create a DataFrame from JSON?
Spark reads JSON files and automatically converts nested structures into DataFrame columns.
Example:
df = spark.read.json("data.json")df.printSchema()df.show()
This is useful when working with API logs, event data, or semi-structured files.
35. What is the difference between select and withColumn?
select() is used to choose or transform columns, while withColumn() is used to add or modify a column.
Example:
df.select("name", "salary")df = df.withColumn("bonus", df.salary *0.1)
Use select() to extract data and withColumn() to create new calculated fields.
36. What is explode in PySpark?
explode() transforms array or nested list columns into multiple rows, expanding complex data into a flat format.
Example:
explode(col("items"))
If a column contains [A, B, C], it becomes three separate rows.
37. What is when and otherwise?
when() and otherwise() apply conditional logic, similar to IF-ELSE statements in SQL or Python.
Example:
when(col("score") >80, "High").otherwise("Low")
This is commonly used for grading systems, classification, or labeling records.
38. What is window function in PySpark?
Window functions perform calculations across related rows while keeping each row in the output. They are useful for ranking, running totals, and trend analysis.
Example: Ranking employees by salary within each department
from pyspark.sql.window import Windowfrom pyspark.sql.functions import rankrank().over(Window.partitionBy("dept"))
39. What is rank vs dense_rank?
rank() assigns the same rank to tied values but skips numbers afterward, while dense_rank() does not skip numbers.
Example:
Salaries: 100, 100, 90
Rank → 1, 1, 3
Dense Rank → 1, 1, 2
Used in leaderboards or performance ranking.
40. How do you remove duplicates in PySpark?
Duplicates can be removed using distinct() or dropDuplicates().
Example:
df.distinct().show()
This ensures each row appears only once.
41. What is dropDuplicates()?
dropDuplicates() removes duplicate rows based on specific columns, giving more control than distinct().
Example:
df.dropDuplicates(["email"])
This keeps only one record per email address.
42. How do you rename columns in DataFrame?
Columns can be renamed for clarity or standardization.
Example:
df = df.withColumnRenamed("old_name", "new_name")
This improves readability and data consistency.
43. What is alias in PySpark?
alias() assigns temporary names to DataFrames or columns, improving readability in complex joins or SQL queries.
Example:
df.alias("employees")
Useful when joining the same DataFrame multiple times.
44. How do you join DataFrames?
Joining DataFrames merges datasets based on a common key.
Example:
df1.join(df2, "id", "inner")
Used to combine customer and order data.
45. What types of joins are supported in Spark?
Spark supports:
- Inner Join
- Left Join
- Right Join
- Full Outer Join
- Left Semi Join
- Left Anti Join
Example:
df1.join(df2, "id", "left")
Used to keep all records from one dataset even if no match exists.
Joins and Performance
46. What is broadcast join?
A broadcast join is a join strategy where Spark copies a small table to all executor nodes so that each node can join it locally with a large table. This avoids shuffling the large dataset across the network, making joins significantly faster. Spark automatically chooses broadcast joins when one table is smaller than the broadcast threshold, or it can be manually forced.
This technique is highly effective when joining a large fact table with a small dimension table.
Example:
from pyspark.sql.functions import broadcastdf_large.join(broadcast(df_small), "id")
47. When should broadcast join be used?
Broadcast join should be used when one dataset is small enough to fit in executor memory. It is ideal for dimension tables such as country codes, product categories, or configuration mappings.
It should be avoided when the smaller table is too large because broadcasting consumes executor memory and can cause failures. The general rule is to broadcast tables smaller than a few hundred megabytes.
Example: Joining a large sales dataset with a small currency lookup table.
48. What is shuffle join?
A shuffle join happens when Spark must redistribute both tables across partitions so that matching keys land on the same executor. This requires heavy network transfer and disk I/O, making it slower than broadcast joins.
Shuffle joins occur when neither dataset is small enough to broadcast.
Example: Joining two very large datasets like transaction logs and customer activity history.
49. What causes shuffle in Spark?
Shuffle occurs when Spark must move data across partitions or nodes. It is typically triggered by wide transformations like groupBy, join, distinct, and repartition.
Shuffle is expensive because it involves network transfer, disk writes, and reorganization of data across executors.
Example: Running groupBy("customer_id") on a massive dataset.
50. How do you optimize joins in Spark?
Join optimization involves reducing shuffle, choosing correct join strategies, and filtering data early. Common techniques include broadcast joins, filtering before joining, partitioning data properly, caching frequently used tables, and tuning shuffle partitions.
Example:
df_filtered = df_large.filter("year = 2024")df_filtered.join(broadcast(df_small), "id")
51. What is join skew?
Join skew happens when some join keys contain far more records than others, causing uneven workload distribution. This results in certain executors processing much more data, slowing down the job.
Example: One customer ID appearing millions of times while others appear only a few times.
52. How do you handle skewed joins?
Skewed joins can be handled using salting, broadcast joins, adaptive query execution (AQE), splitting skewed keys, or filtering heavy keys separately.
Example: Isolating a highly frequent key and processing it independently.
53. What is salting technique?
Salting adds a random value to skewed keys so that Spark distributes data more evenly across partitions, preventing a single executor from becoming overloaded.
Example: Adding a random suffix to customer IDs before joining.
54. What is sort merge join?
Sort merge join is Spark’s default join strategy for large datasets. It sorts both datasets on join keys and then merges them efficiently. It performs well when both datasets are large and already partitioned.
Example: Joining two large historical datasets.
55. How does Spark choose join strategy?
Spark chooses join strategy based on table size, configuration thresholds, statistics, and cost estimation. It may select broadcast join, shuffle hash join, or sort merge join automatically.
AQE can change join strategy dynamically at runtime.
Partitioning and Shuffling
56. What is partitioning in Spark?
Partitioning divides data into smaller chunks so Spark can process it in parallel across multiple executors. Proper partitioning improves performance and resource utilization.
Example: Partitioning sales data by region to speed up region-based queries.
57. What is default partition count?
Default partition count depends on Spark configuration. Typically:
- RDDs → based on available cores
- Spark SQL →
spark.sql.shuffle.partitions(default 200)
58. What is repartition vs coalesce?
repartition() increases or decreases partitions with shuffle, ensuring even distribution.coalesce() reduces partitions without shuffle, making it faster when shrinking partitions.
Example:
df.repartition(20)df.coalesce(5)
59. What happens during shuffle?
During shuffle, Spark writes intermediate data to disk, transfers it over the network, and reorganizes partitions for downstream tasks. This is one of the most expensive operations in Spark.
60. Why is shuffle expensive?
Shuffle is expensive due to network transfer, disk I/O, serialization, deserialization, and sorting overhead. It can slow down Spark jobs significantly if not optimized.
61. How do you reduce shuffle overhead?
Shuffle overhead can be reduced by using broadcast joins, reducing partition count, caching reused data, filtering early, and avoiding unnecessary wide transformations.
62. What is data skew?
Data skew occurs when some partitions contain significantly more data than others, causing performance imbalance and executor delays.
63. How do you detect skew in Spark?
Skew can be detected using Spark UI, analyzing stage duration, checking task execution times, and monitoring partition sizes.
64. What is adaptive partitioning?
Adaptive partitioning dynamically adjusts partition sizes at runtime to balance workload and improve efficiency.
65. What is AQE (Adaptive Query Execution)?
AQE allows Spark to optimize query plans at runtime by changing join strategies, merging partitions, and handling skew dynamically.
Caching and Persistence
66. What is cache in Spark?
Caching stores data in memory so Spark can reuse it without recomputation. It is useful when datasets are reused multiple times.
67. What is persist in Spark?
Persist stores data using custom storage levels, allowing memory-only, disk-only, or hybrid storage.
68. What is the difference between cache and persist?
cache() is a shortcut for memory-only storage, while persist() allows custom storage strategies like memory+disk.
69. What storage levels are available in Spark?
Spark supports:
- MEMORY_ONLY
- MEMORY_AND_DISK
- DISK_ONLY
- OFF_HEAP
70. When should caching be used?
Caching should be used when data is reused frequently, such as in iterative ML algorithms or repeated queries.
71. What happens if memory is insufficient for cache?
If memory is insufficient, Spark spills cached data to disk or recomputes partitions when needed.
72. How do you unpersist DataFrames?
df.unpersist()
PySpark Performance Optimization
73. What are common Spark performance bottlenecks?
Common bottlenecks include shuffle overhead, skewed data, insufficient memory, too many partitions, slow storage, and inefficient joins.
74. How do you tune Spark jobs?
Spark jobs are tuned by adjusting executor memory, cores, partition counts, caching strategy, join method, and query optimization.
75. What is predicate pushdown?
Predicate pushdown pushes filters to the data source, reducing data read volume.
Example: Filtering records before reading Parquet.
76. What is column pruning?
Column pruning reads only required columns, reducing I/O and memory usage.
77. What is broadcast variable?
A broadcast variable allows small datasets to be shared across executors efficiently, reducing redundant memory usage.
78. What is Kryo serialization?
Kryo is a faster serialization framework that reduces memory usage and speeds up data transfer.
79. How do you tune shuffle partitions?
Tune using:
spark.conf.set("spark.sql.shuffle.partitions", "100")
80. What is speculative execution?
Speculative execution reruns slow tasks on other executors to reduce job delays.
81. What is dynamic partition pruning?
It avoids scanning unnecessary partitions by filtering partitions dynamically during execution.
82. How does Tungsten improve performance?
Tungsten improves memory efficiency, CPU utilization, and binary execution speed.
83. How does Catalyst optimize queries?
Catalyst rewrites query plans using rule-based and cost-based optimization.
84. How do you optimize wide transformations?
Wide transformations are optimized using broadcast joins, reducing shuffle, caching, partition tuning, and filtering early.
85. How do you optimize Spark SQL queries?
Spark SQL queries are optimized using proper indexing formats (Parquet), partition pruning, AQE, Catalyst optimization, and caching.