Common Interview Questions

What are the differences between a clustered and a non-clustered index?

A clustered index sorts and stores data rows in the table based on the key values, meaning there can only be one clustered index per table. A non-clustered index, however, creates a separate structure for the index with pointers to the actual data rows, and multiple non-clustered indexes can exist on a single table.

Explain normalization and its different forms.

Normalization organizes data in a database to reduce redundancy and improve integrity. Common forms include:

1NF: Ensure all columns contain atomic data.
2NF: Satisfy 1NF and remove partial dependencies.
3NF: Remove transitive dependencies.
Higher forms address more advanced structures.

What is the difference between DELETE, TRUNCATE, and DROP?

DELETE removes specific rows with a WHERE clause, allowing rollback. TRUNCATE removes all rows quickly without logging individual deletions, but rollback is not possible. DROP deletes the table itself, including its structure and data.

What are transactions, and why are they important?

Transactions ensure a sequence of database operations is completed successfully as a single unit, following ACID properties (Atomicity, Consistency, Isolation, Durability). They maintain database reliability and consistency, even in failure scenarios.

What is the difference between OLAP and OLTP?

OLAP (Online Analytical Processing) is designed for complex queries and analysis of large datasets, often in data warehouses. OLTP (Online Transaction Processing) supports high-volume transactional applications with fast query processing and real-time data integrity.

What are stored procedures, and why are they used?

Stored procedures are precompiled SQL statements stored in the database. They improve performance by reducing query parsing time, enhance security by abstracting queries from the application, and promote code reuse.

Explain the concept of referential integrity.

Referential integrity ensures that relationships between tables remain consistent. For instance, a foreign key in one table must correspond to a primary key in another, preventing orphaned rows or invalid references.

What is database partitioning, and what are its benefits?

Partitioning divides a large table into smaller, more manageable pieces based on specific criteria, like range or hash. It improves query performance, simplifies maintenance, and enhances scalability.

What are the differences between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN?

INNER JOIN: Returns rows with matching keys in both tables.
LEFT JOIN: Includes all rows from the left table, with nulls for non-matching rows in the right table.
RIGHT JOIN: Mirrors LEFT JOIN for the right table.
FULL JOIN: Combines LEFT and RIGHT JOIN to include all rows from both tables.

What is denormalization, and why is it used?

Denormalization introduces redundancy into a database to improve read performance by reducing the need for complex joins. It is commonly used in data warehousing and OLAP systems.

How does indexing affect database performance?

Indexing improves query speed by allowing the database to find rows quickly without scanning the entire table. However, excessive or poorly designed indexes can degrade performance by increasing write and storage overhead.

What are the differences between VARCHAR and TEXT in SQL?

VARCHAR has a defined length limit, stored inline with the row. TEXT stores larger strings, typically off-row, with a pointer in the main row. TEXT is less efficient for operations requiring inline data access.

What are database views, and how are they used?

Views are virtual tables representing the result of a query. They simplify complex queries, provide abstraction, enhance security by restricting data access, and can sometimes optimize performance.

What is the purpose of foreign keys?

Foreign keys enforce referential integrity by linking a column in one table to the primary key in another. They prevent invalid data entries and maintain logical relationships between tables.

Explain the difference between horizontal and vertical scaling in databases.

Horizontal scaling (sharding) distributes data across multiple servers, improving concurrency and storage capacity. Vertical scaling adds resources (CPU, memory) to a single server, improving performance for existing workloads.

What is the purpose of database replication?

Replication involves copying data from one database server to another. It enhances availability, disaster recovery, and load balancing while reducing latency for geographically distributed users.

What is an execution plan, and why is it important?

An execution plan outlines how the database engine executes a query, detailing steps like table scans, index usage, and joins. It helps identify performance bottlenecks and optimize queries.

What is a deadlock, and how can it be prevented?

A deadlock occurs when two or more transactions block each other, waiting indefinitely for resources. Prevention strategies include consistent locking order, using timeouts, or employing deadlock detection algorithms.

What is the difference between primary key and unique key?

A primary key uniquely identifies each row and cannot contain null values. A unique key also enforces uniqueness but allows nulls, and a table can have multiple unique keys.

What are common causes of slow database performance?

Causes include missing indexes, poor query design, excessive joins, lack of normalization, high contention, and hardware limitations. Profiling and optimization tools can help diagnose issues.

Explain the purpose of ACID properties in databases.

ACID (Atomicity, Consistency, Isolation, Durability) ensures reliable transaction processing. It guarantees that operations are completed fully or not at all, maintaining database integrity under concurrent or failure conditions.

What is a materialized view?

A materialized view stores the result of a query physically on disk, unlike regular views. This improves performance for expensive or frequently accessed queries but requires maintenance when the underlying data changes.

What is the difference between optimistic and pessimistic locking?

Optimistic locking assumes minimal contention, verifying data consistency at transaction end. Pessimistic locking prevents conflicts by locking resources during the transaction.

What are triggers, and when are they used?

Triggers are automatic actions executed in response to specific database events (e.g., INSERT, UPDATE, DELETE). They enforce rules, audit changes, or maintain derived data.

What is a surrogate key, and how does it differ from a natural key?

A surrogate key is an artificial identifier (e.g., auto-increment ID) used as a primary key, independent of business logic. Natural keys derive from meaningful data but may be less stable.

What is the difference between schema and instance in a database?

A schema defines the database structure (tables, relationships), while an instance is the data stored at a particular time. Schemas are static, whereas instances change over time.

What is sharding, and how is it implemented?

Sharding divides a database into smaller, distributed parts called shards. Implementation strategies include range-based, hash-based, or geo-based partitioning.

Explain the difference between data integrity and data consistency.

Data integrity ensures data accuracy and validity through constraints and rules, while consistency ensures data adheres to logical rules within a transaction or across the database.

What is a foreign key constraint, and how can it affect performance?

Foreign key constraints enforce relationships between tables, preventing invalid data. However, they can impact performance during inserts, updates, or deletes due to additional checks.

How can you optimize a SQL query?

Optimization techniques include indexing, avoiding SELECT *, rewriting subqueries as joins, filtering early, and analyzing execution plans. Query performance depends on the specific database engine.

What are NoSQL databases, and how do they differ from relational databases?

NoSQL databases are designed for unstructured, semi-structured, or polymorphic data. Unlike relational databases, they don’t rely on predefined schemas or SQL for queries. They excel in scalability and flexibility for use cases like real-time analytics and large-scale distributed systems.

What are the different types of NoSQL databases?

NoSQL databases are categorized as:

Key-Value Stores (e.g., Redis, DynamoDB).
Document Stores (e.g., MongoDB, Couchbase).
Column-Family Stores (e.g., Cassandra, HBase).
Graph Databases (e.g., Neo4j, ArangoDB).

When would you choose a NoSQL database over a relational database?

NoSQL is preferable for applications requiring:

Horizontal scalability.
Flexible or evolving schemas.
High write throughput.
Handling unstructured or semi-structured data (e.g., JSON, XML).

What are eventual consistency and strong consistency in NoSQL databases?

Eventual consistency ensures that all replicas of the database will eventually synchronize, suitable for high-availability systems. Strong consistency ensures immediate synchronization across all nodes, prioritizing data accuracy over availability.

What is CAP theorem, and how does it apply to NoSQL databases?

The CAP theorem states that a distributed system can only guarantee two of three: Consistency, Availability, and Partition Tolerance. NoSQL databases often sacrifice consistency for availability and partition tolerance.

How does MongoDB handle indexing, and what are its limitations?

MongoDB supports single-field, compound, and multikey indexes for efficient querying. However, excessive indexes can slow writes, and there’s a limit to the number of indexes per collection.

What are the differences between a document-oriented database and a key-value store?

Document-oriented databases (e.g., MongoDB) store structured data in document formats like JSON or BSON. Key-value stores (e.g., Redis) associate keys with values, focusing on fast retrieval of unstructured data.

What is a graph database, and when should it be used?

Graph databases (e.g., Neo4j) represent data as nodes (entities) and edges (relationships), ideal for scenarios like social networks, fraud detection, and recommendation engines where relationships are critical.

How does Cassandra handle replication and consistency?

Cassandra employs configurable replication across nodes for fault tolerance and offers tunable consistency, allowing developers to choose between consistency levels like ONE, QUORUM, or ALL, depending on the use case.

What are the key features of DynamoDB?

DynamoDB is a fully managed NoSQL key-value and document database with features like auto-scaling, high availability, strong and eventual consistency options, and integration with AWS services.

What is polyglot persistence?

Polyglot persistence involves using different types of databases (SQL, NoSQL, graph) for different components of a system based on specific data and query requirements, optimizing performance and scalability.

What is the difference between columnar databases and row-oriented databases?

Columnar databases (e.g., BigQuery, Apache HBase) store data by columns, optimizing for analytical queries on large datasets. Row-oriented databases store data by rows, better suited for transactional workloads.

What is MapReduce, and how does it relate to NoSQL databases?

MapReduce is a programming model for distributed computation, often used in NoSQL systems like Hadoop. It processes large datasets by mapping tasks to nodes and reducing results to a final output.

How does Couchbase differ from MongoDB?

Couchbase combines key-value and document storage with caching, offering high performance and consistency. MongoDB is document-focused with richer querying capabilities but relies on external caching mechanisms.

What is BASE in NoSQL databases, and how does it differ from ACID?

BASE (Basically Available, Soft state, Eventual consistency) is a set of principles emphasizing availability and performance over strict consistency, suitable for distributed NoSQL systems compared to ACID’s strong consistency.

How does Redis handle data persistence?

Redis supports data persistence via snapshots (RDB files) and append-only files (AOF). These methods enable recovery after failures, though Redis prioritizes in-memory speed for real-time applications.

What is a time-series database, and what are some examples?

Time-series databases are optimized for storing and querying time-stamped data, such as metrics and logs. Examples include InfluxDB, TimescaleDB, and Prometheus.

How does Cassandra handle writes and ensure durability?

Cassandra writes are distributed across nodes using a consistent hashing mechanism. Writes are first appended to a commit log for durability, then stored in-memory tables before being flushed to disk.

What are graph traversal algorithms in graph databases, and why are they important?

Graph traversal algorithms (e.g., depth-first search, breadth-first search) navigate relationships in graph databases. They are essential for querying paths, relationships, or networks efficiently.

What are some challenges of scaling NoSQL databases?

Challenges include:

Managing data consistency across distributed nodes.
Handling network latency in geographically dispersed systems.
Designing partitioning strategies that minimize hot spots.

What are the benefits of a multi-model database?

Multi-model databases (e.g., ArangoDB) support multiple data models like document, graph, and key-value in one system, reducing the need for multiple databases and simplifying development.

How do column-family stores like Cassandra differ from traditional relational databases?

Column-family stores structure data into rows and columns but group columns into families. This design optimizes high-write throughput and distributed storage for massive datasets, unlike relational databases’ strict schemas.

What is eventual consistency in Amazon DynamoDB, and how does it differ from strong consistency?

Eventual consistency allows reads to retrieve outdated data temporarily for higher availability. Strong consistency ensures the latest data but may reduce availability under high load.

What is the purpose of Redis’ pub/sub model?

Redis’ pub/sub model allows message broadcasting to subscribed channels, enabling real-time messaging and event-driven architectures like notifications or chat systems.

How does Elasticsearch index and retrieve data?

Elasticsearch uses inverted indexes for efficient full-text search. It tokenizes and analyzes text data during indexing, enabling fast retrieval through relevance-scored queries.

What are key features of Google BigQuery?

BigQuery is a serverless, highly scalable columnar data warehouse optimized for analytical queries. It supports SQL-based querying, integrations with Google Cloud, and real-time analytics.

What is consistency in the context of distributed databases like Cassandra?

Consistency ensures data uniformity across replicas. In Cassandra, it is tunable, allowing developers to trade off strict consistency for lower latency and higher availability.

How does HBase handle large-scale data storage?

HBase uses a distributed, column-family-oriented design built on Hadoop HDFS. It supports sparse datasets with billions of rows and columns, optimized for write-heavy workloads.

What are the main use cases for document-oriented databases?

Document-oriented databases are ideal for applications requiring flexible, nested data structures (e.g., JSON). Common use cases include content management systems, user profiles, and IoT data storage.

What is the purpose of sharding in MongoDB?

Sharding in MongoDB distributes data across multiple servers or clusters to handle massive datasets and improve horizontal scalability. Data is partitioned using shard keys.