150+ PySpark Interview Questions and Answers for Beginners to Advanced

Explore 150+ PySpark Interview Questions and Answers designed for beginners to advanced professionals. This complete guide includes Spark fundamentals, RDDs, DataFrames, Spark SQL, joins, performance tuning, real-world scenarios, and coding questions with in-depth explanations to help you crack PySpark and Big Data interviews.

PySpark Basics

1. What is PySpark?

PySpark is the Python API for Apache Spark, a distributed data processing engine designed for large-scale data analytics. It allows developers to write Spark applications using Python while leveraging Spark’s parallel execution capabilities. PySpark supports batch processing, streaming, machine learning, and SQL-based analytics across distributed clusters.

2. How is PySpark different from Pandas?

PySpark is designed for distributed computing and can handle massive datasets across multiple machines. Pandas, on the other hand, runs only on a single machine and is limited by system memory. PySpark processes data in parallel using clusters, while Pandas processes data sequentially in memory, making PySpark suitable for big data workloads.

3. What are the main components of Spark architecture?

Spark architecture consists of:

Driver: Controls execution and job scheduling
Cluster Manager: Manages resources (YARN, Kubernetes, Standalone)
Executors: Run tasks on worker nodes
SparkContext / SparkSession: Entry point for Spark operations
DAG Scheduler: Breaks jobs into stages and tasks

Each component works together to distribute and execute workloads efficiently.

4. What is the Spark Driver?

The Spark Driver is the central coordinator of a Spark application. It runs the main program, builds execution plans, schedules jobs, tracks task execution, and collects results. It also maintains metadata such as job status and handles communication with executors.

5. What is an Executor in Spark?

Executors are worker processes that run tasks assigned by the Driver. They execute transformations, store cached data, and return results. Each executor has its own memory and CPU resources and runs on worker nodes inside the cluster.

6. What is a SparkContext?

SparkContext is the core entry point for Spark applications in older Spark versions. It manages cluster connections, resource allocation, and job execution. In modern Spark, SparkSession replaces SparkContext as the primary interface.

7. What is SparkSession?

SparkSession is the unified entry point for Spark SQL, DataFrame, Streaming, and RDD APIs. It simplifies Spark development by combining SparkContext, SQLContext, and HiveContext into a single interface.

8. What is the difference between SparkContext and SparkSession?

SparkContext manages cluster-level execution, while SparkSession provides a higher-level interface that supports SQL queries, DataFrame operations, Hive integration, and SparkContext internally. SparkSession is recommended in modern Spark applications.

9. What are RDDs in PySpark?

RDD (Resilient Distributed Dataset) is the core data structure in Spark. It represents an immutable, fault-tolerant distributed collection of objects that can be processed in parallel across cluster nodes.

10. What are the characteristics of RDDs?

Key characteristics include:

Distributed across nodes
Immutable
Fault-tolerant using lineage
Lazy evaluation
Supports transformations and actions

These properties enable scalability and reliability.

11. What is immutability in RDDs?

RDDs cannot be modified once created. Any transformation on an RDD results in a new RDD, ensuring fault tolerance and predictable execution behavior.

12. What is fault tolerance in Spark?

Spark provides fault tolerance using lineage, which records the sequence of transformations applied to data. If a partition fails, Spark recomputes it using lineage rather than storing redundant copies.

13. What is lineage in Spark?

Lineage is a logical execution plan that tracks how an RDD was derived. It allows Spark to recompute lost partitions without replicating data.

14. What is lazy evaluation in Spark?

Spark does not execute transformations immediately. It builds a logical execution plan and only runs computations when an action is called, improving performance by optimizing execution.

15. What are transformations and actions in Spark?

Transformations create new RDDs or DataFrames (map, filter, join).
Actions trigger execution and return results (count, collect, show).
Transformations are lazy; actions cause execution.

RDD Concepts

16. What is the difference between transformations and actions?

Transformations define computation logic but do not execute it immediately. Actions trigger execution and return results to the driver or external storage.

17. What is narrow vs wide transformation?

Narrow transformations operate on data within a single partition (map, filter).
Wide transformations require data movement across partitions (groupByKey, join), causing shuffle and higher cost.

18. What is map vs flatMap?

Map transforms each element into one output element.
FlatMap transforms each element into zero or more output elements, flattening results into a single list.

19. What is filter transformation?

Filter removes elements that do not satisfy a condition. It is a narrow transformation that does not require shuffle.

20. What is reduceByKey vs groupByKey?

reduceByKey performs aggregation before shuffle, reducing network cost.
groupByKey transfers all values across nodes before aggregation, making it slower.

21. What is distinct in RDD?

Distinct removes duplicate records by shuffling and grouping data, which can be expensive on large datasets.

22. What is repartition in RDD?

Repartition increases or decreases partitions and always triggers shuffle, redistributing data evenly.

23. What is coalesce in RDD?

Coalesce reduces the number of partitions without shuffle, making it more efficient for shrinking partition count.

24. What is collect() and when should it be avoided?

collect() retrieves all data to the Driver. It should be avoided for large datasets because it can cause driver memory overflow.

25. How does Spark handle RDD persistence?

RDD persistence stores intermediate results in memory or disk to avoid recomputation. Spark supports multiple storage levels such as MEMORY_ONLY and MEMORY_AND_DISK.

26. What is checkpointing in RDD?

Checkpointing saves RDD data to stable storage, breaking long lineage chains and improving fault tolerance for complex pipelines.

27. How does RDD fault tolerance work?

If a partition fails, Spark reconstructs it using lineage transformations rather than storing replicas.

DataFrames and Spark SQL

28. What is a DataFrame in PySpark?

A DataFrame is a distributed structured dataset organized into named columns. It supports SQL queries, optimized execution, and schema enforcement.

29. How is DataFrame different from RDD?

DataFrames provide schema awareness, query optimization, and better performance through Catalyst and Tungsten. RDDs are lower-level and lack optimization.

30. What is Catalyst Optimizer?

Catalyst is Spark’s query optimizer that analyzes logical plans and rewrites them into optimized physical execution plans for better performance.

31. What is Tungsten Execution Engine?

Tungsten improves Spark performance by optimizing memory management, CPU usage, and code generation, reducing garbage collection overhead.

32. What is schema inference in DataFrames?

Schema inference automatically determines column data types when reading structured data formats like CSV and JSON.

33. How do you create a DataFrame from CSV?

You can load CSV files using Spark’s built-in CSV reader with schema inference and header support, converting structured text into a DataFrame.

34. How do you create a DataFrame from JSON?

Spark can parse JSON files into DataFrames, automatically extracting nested structures and generating schema definitions.

35. What is the difference between select and withColumn?

select chooses or transforms columns.
withColumn adds or modifies an existing column, creating a new DataFrame.

36. What is explode in PySpark?

Explode converts array or map columns into multiple rows, expanding nested structures into tabular form.

37. What is when and otherwise?

These functions provide conditional logic, similar to SQL CASE statements, allowing column-based conditional transformations.

38. What is window function in PySpark?

Window functions perform calculations across a defined group of rows, enabling ranking, running totals, and lag/lead operations.

39. What is rank vs dense_rank?

rank assigns same rank to duplicates but leaves gaps.
dense_rank assigns ranks without gaps.

40. How do you remove duplicates in PySpark?

Duplicates can be removed using distinct() or dropDuplicates(), either across all columns or selected columns.

Joins and Performance

41. What types of joins are supported in Spark?

Spark supports multiple join types including inner join, left join, right join, full outer join, left semi join, and left anti join. Each join type controls how rows from two datasets are matched and whether unmatched records are included. Spark optimizes joins internally depending on data size, partitioning, and join strategy.

42. What is a broadcast join?

A broadcast join occurs when Spark sends a small dataset to all executors to avoid shuffling a large dataset across the network. This significantly improves performance for joins involving a small lookup or dimension table by eliminating shuffle overhead.

43. When should broadcast join be used?

Broadcast join should be used when one table is small enough to fit in executor memory. It is ideal for star-schema joins, dimension tables, and lookup datasets. It should not be used if the dataset is large, as it can cause memory overflow.

44. What is shuffle join?

A shuffle join happens when both datasets are large and must be repartitioned across the cluster based on join keys. This causes significant network I/O and disk spill, making shuffle joins more expensive than broadcast joins.

45. What causes shuffle in Spark?

Shuffle occurs when Spark needs to redistribute data across partitions. Common operations that cause shuffle include joins, groupBy, distinct, orderBy, repartition, and aggregations.

46. How do you optimize joins in Spark?

Join optimization techniques include:

Using broadcast joins when applicable
Filtering data before joins
Reducing dataset size
Repartitioning on join keys
Handling skewed keys
Enabling Adaptive Query Execution (AQE)

These techniques reduce shuffle and improve execution speed.

47. What is join skew?

Join skew happens when one or a few join keys dominate the dataset, causing uneven workload distribution. This results in some executors finishing quickly while others become bottlenecks, slowing down the entire job.

48. How do you handle skewed joins?

Skewed joins can be handled by:

Salting keys
Broadcasting smaller datasets
Splitting skewed keys
Increasing parallelism
Using AQE skew join optimization

These methods distribute workload more evenly across executors.

49. What is the salting technique?

Salting is a technique where random values are added to skewed keys to distribute them across multiple partitions. This prevents a single partition from becoming overloaded and improves parallel processing.

50. What is sort merge join?

Sort merge join sorts both datasets by join key and then merges them. It is the default join strategy when datasets are large and cannot be broadcast. It is efficient but requires sorting and shuffle.

51. How does Spark choose join strategy?

Spark chooses join strategy based on:

Dataset size
Broadcast threshold
Join type
Available memory
Cost-based optimizer rules

It dynamically selects between broadcast join, shuffle hash join, and sort merge join.

Partitioning and Shuffling

52. What is partitioning in Spark?

Partitioning divides data into smaller chunks across cluster nodes. Proper partitioning ensures better parallelism, efficient CPU usage, and reduced shuffle cost.

53. What is the default partition count in Spark?

By default, Spark uses:

Number of input file blocks for file reads
200 shuffle partitions (spark.sql.shuffle.partitions)

This default often requires tuning based on data volume.

54. What is repartition vs coalesce?

Repartition increases or decreases partitions and always triggers shuffle.
Coalesce reduces partitions without full shuffle, making it more efficient when decreasing partition count.

55. What happens during shuffle?

During shuffle, Spark redistributes data across nodes based on partition keys. This involves disk writes, network transfers, and disk reads, making shuffle one of the most expensive Spark operations.

56. Why is shuffle expensive?

Shuffle is expensive because it involves:

Network data transfer
Disk I/O
Serialization and deserialization
Disk spilling when memory is insufficient

Reducing shuffle improves performance significantly.

57. How do you reduce shuffle overhead?

Shuffle overhead can be reduced by:

Using broadcast joins
Reducing shuffle partitions
Filtering early
Aggregating before shuffle
Using narrow transformations
Avoiding unnecessary repartitions

58. What is data skew?

Data skew occurs when some partitions contain much more data than others. This results in uneven task execution and slower job completion.

59. How do you detect skew in Spark?

Skew can be detected using:

Spark UI task duration analysis
Stage execution time comparison
Checking partition size distribution
Monitoring slow-running executors

60. What is adaptive partitioning?

Adaptive partitioning dynamically adjusts the number of shuffle partitions at runtime based on actual data size, reducing unnecessary small tasks.

61. What is AQE (Adaptive Query Execution)?

AQE is a Spark feature that optimizes query plans at runtime using real execution statistics. It can change join strategy, reduce shuffle partitions, and optimize skewed joins automatically.

Caching and Persistence

62. What is caching in Spark?

Caching stores intermediate results in memory so they can be reused without recomputation. It improves performance when the same dataset is accessed multiple times.

63. What is persistence in Spark?

Persistence allows storing data in different storage levels such as memory-only, memory-and-disk, or disk-only. It provides flexibility when memory is limited.

64. What is the difference between cache and persist?

Cache uses default MEMORY_ONLY storage, while persist allows specifying custom storage levels like MEMORY_AND_DISK. Persist provides greater control over memory usage.

65. What storage levels are available in Spark?

Common storage levels include:

MEMORY_ONLY
MEMORY_AND_DISK
DISK_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER

Each level balances memory usage and recomputation cost.

66. When should caching be used?

Caching should be used when:

The same dataset is reused multiple times
Computation cost is high
Iterative algorithms are executed

Caching unnecessarily can waste memory and slow execution.

67. What happens if memory is insufficient for cache?

If memory is insufficient, Spark evicts cached partitions or spills data to disk depending on the storage level, which may impact performance.

68. How do you unpersist DataFrames?

Unpersist removes cached data from memory to free resources. This is important to avoid memory pressure in long-running Spark applications.

PySpark Performance Optimization

69. What are common Spark performance bottlenecks?

Common bottlenecks include:

Large shuffles
Data skew
Insufficient memory
Poor partitioning
Inefficient joins
Overuse of UDFs
Small file problems

70. How do you tune Spark jobs?

Spark tuning involves:

Adjusting executor memory and cores
Tuning shuffle partitions
Using caching strategically
Optimizing joins
Reducing shuffle
Enabling AQE
Monitoring Spark UI

71. What is predicate pushdown?

Predicate pushdown allows Spark to filter data at the storage level before loading it into memory. This reduces data read and improves performance, especially for Parquet and ORC files.

72. What is column pruning?

Column pruning ensures Spark reads only required columns, reducing I/O and memory usage. It is automatically applied when selecting specific columns.

73. What is broadcast variable?

Broadcast variables distribute read-only data to all executors, reducing network traffic and improving lookup performance.

74. What is Kryo serialization?

Kryo is a faster and more memory-efficient serializer compared to Java serialization. It reduces task serialization overhead and improves job execution speed.

75. How do you tune shuffle partitions?

Shuffle partitions can be tuned by setting spark.sql.shuffle.partitions based on dataset size. Lower values reduce task overhead, while higher values increase parallelism for large datasets.

76. What is speculative execution?

Speculative execution re-runs slow tasks in parallel and uses the fastest completed result. This helps mitigate performance impact caused by slow or failing executors.

77. What is dynamic partition pruning?

Dynamic partition pruning reduces scan size by eliminating unnecessary partitions at runtime, improving join query performance in partitioned datasets.

78. How does Tungsten improve Spark performance?

Tungsten improves performance by optimizing:

Memory management
CPU usage
Off-heap memory
Code generation
Reduced garbage collection

79. How does Catalyst optimize queries?

Catalyst optimizes logical query plans by:

Rewriting expressions
Reordering joins
Pushing filters early
Selecting efficient execution strategies

This improves SQL and DataFrame execution efficiency.

80. How do you optimize wide transformations?

Wide transformations can be optimized by:

Increasing parallelism where needed

Reducing shuffle partitions

Aggregating before shuffle

Filtering data early

Using broadcast joins

Avoiding unnecessary groupBy operations