Suraku | Data Warehousing and Data Mining Solved MCQs PYQs Bank

Introduction to DBMS Data Models Entity-Relationship (ER) Model Relational Model Constraint & Keys in DBMS Relational Algebra & Relational Calculus Normalization and Functional Dependency (FD) SQL (Structured Query Language) Transaction Management (ACID Properties, Commit & Rollback) Concurrency Control (Locking 2PL Protocols, Deadlock) Indexing & Hashing Recovery Management Distributed Databases NoSQL, Modern Databases & New Trends Data Warehousing, OLTP vs OLAP, & Data Mining (Basics) DBMS Miscellaneous

Q: 1 Which of following is not a data classification technique?

Bayesian belief networks
Support Vector Machine
KNN (K-Nearest Neighbours)
Principal component analysis

Show Answer

[ Option D ]

Data classification is a type of supervised learning where the goal is to assign data points to predefined classes based on training data.

TECHNIQUE NAME	DESCRIPTION
Bayesian Belief Networks	Probabilistic models that classify data based on conditional dependencies between variables; useful for uncertain or probabilistic scenarios.
Support Vector Machine (SVM)	Finds the optimal hyperplane that separates data points into different classes, works well for high-dimensional data.
K-Nearest Neighbors (KNN)	Classifies a data point based on the majority class of its nearest neighbors in the feature space, simple and intuitive.
Decision Trees	Builds a tree-like model of decisions and their possible consequences to classify data, interpretable and widely used.
Random Forest	An ensemble of decision trees that improves classification accuracy by aggregating multiple trees' predictions.
Neural Networks	Model’s complex relationships using layers of interconnected nodes, suitable for large and complex datasets.

Principal Component Analysis (PCA), is not a classification technique. PCA is a dimensionality reduction technique used to reduce the number of features while preserving variance.

Q: 2 What is the total number of non-empty subsets of a 100-item frequent itemset?

100
2¹⁰⁰
2¹⁰⁰ – 1
100!

For a set containing n items, the total number of subsets is given by 2ⁿ. This includes the empty set. To find the number of non-empty subsets, we subtract the empty set, i.e., 2ⁿ−1.
Here, the itemset has 100 items, so the total number of non-empty subsets is 2¹⁰⁰−1.

Q: 3 Which of the following statement is incorrect?

OLTP adopts Entity-Relationship model
OLAP adopts star or snowflake model
OLTP consists of read-only operations
OLTP consists of short, atomic transactions

Show Answer

[ Option C ]

OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) serve different purposes in database systems:
OLTP:

Focuses on day-to-day transactional operations such as insert, update, and delete.
Uses Entity-Relationship (ER) models to represent data.
Consists of short, atomic transactions to ensure consistency and quick response times.
Not limited to read-only operations, in fact, OLTP involves frequent updates.

OLAP:

Optimized for analysis and querying large volumes of data.
Uses star or snowflake schemas for multidimensional data modeling.
Operations are mostly read-only, like slicing, dicing, and aggregation.

Q: 4 In the context of data warehousing, the semantic heterogeneity and structure of data, are challenges in which of following?

Data reduction
Data integration
Data cleaning
Data transformation

Show Answer

[ Option B ]

In a Data Warehouse, data comes from multiple heterogeneous sources, such as different databases, formats, and structures. The process of combining this data into a single, consistent view is known as Data Integration.

Q: 5 In context of mining descriptive statistical measures of data, which of the following sets represents measures of the central tendency and measures of the dispersion of data respectively?

{Mean, Mode, Range}, {Median, Variance, Standard Deviation}
{Mean, Mode, Median}, {Range, Variance, Standard Deviation}
{Median, Variance, Standard Deviation}, {Mean, Mode, Range}
{Mean, Range, Variance}, {Mode, Median, Standard Deviation}

Show Answer

[ Option B ]

In descriptive statistics, data is analyzed using two main types of measures, measures of central tendency and measures of dispersion.
Measures of Central Tendency describe the center or average of a data set. They indicate where most of the data values lie.

Mean (Arithmetic Average).
Median (Middle Value).
Mode (Most Frequent Value).

Measures of Dispersion describe the spread or variability of data. They show how much the data values differ from the central value. Common examples include:

Range.
Variance.
Standard.

Q: 6 In context of multidimensional data models for a data warehouse, the fact table contains:

Names of the facts as well as keys to each of the related dimension tables.
List of dimensions.
List of users or expert.
Fact table is abstract and it remains empty.

Show Answer

[ Option A ]

In a multidimensional data model used in a data warehouse, data is organized into fact tables and dimension tables.
The Fact Table is the central table that contains quantitative data, also called measures or facts such as sales amount, quantity, or profit.
In addition to the facts, the fact table contains foreign keys that link it to the associated dimension tables.

Dimension Tables provide descriptive context for the facts, such as time, product, customer, or location.

Q: 7 If a decision tree classifier keeps expanding until every training instance is correctly classified but test error rate begin to increase what is the most likely outcome?

Underfitting
Generalization
Overfitting
Cross-validation error minimized

Show Answer

[ Option C ]

When a decision tree is grown to perfectly classify every training instance, it may start capturing noise and random fluctuations in the training data rather than just the underlying patterns. This results in a phenomenon called overfitting, where the model performs exceptionally well on the training set but fails to generalize to new, unseen data.

Q: 8 In data warehouse technology, a multiple dimensional view can be implemented using different OLAP storage models. Which of the following correctly distinguishes between ROLAP, MOLAP, and HOLAP?

ROLAP uses multidimensional arrays.
MOLAP uses relational tables.
HOLAP uses only materialized views.
ROLAP use a relational or extended-relational DBMS to store and manage warehouse data.
MOLAP uses array-based multidimensional storage engines.
HOLAP combines ROLAP and MOLAP technology.
ROLAP is faster than MOLAP in indexing of summarized data.
MOLAP supports sparse data better than ROLAP.
HOLAP doesn't support drill-down operations.
ROLAP can only co support numeric data types.
MOLAP supports single-level storage.
HOLAP requires columnar storage engines.

Show Answer

[ Option B ]

In Data Warehouse technology, OLAP (Online Analytical Processing) systems provide multidimensional views of data for fast analysis. There are three main OLAP storage models.

OLAP TYPE	STORAGE	DATA REPRESENTATION	ADVANTAGES	REMARK
ROLAP (Relational OLAP)	Relational or extended-relational databases	Data stored in tables;. Multidimensional views generated using SQL.	Scales well for large datasets, supports detailed data	Slower query performance on aggregated data.
MOLAP (Multidimensional OLAP)	Specialized multidimensional storage engines.	Data stored in arrays / cubes.	Fast query performance, efficient aggregation, and summary.	Handles sparse data well using compression.
HOLAP (Hybrid OLAP)	Combines relational tables and multidimensional cubes.	Detailed data in ROLAP, aggregated data in MOLAP.	Balances storage efficiency and query performance.	Provides both scalability and speed.

Q: 9 Which of the following statement(s) is/are true about OLAP?
I. These systems have very large number of users than that of database systems.
II. Accesses to these systems are mostly read-only operations.

Only I
Only II
Both I and II
Neither I nor II

Show Answer

[ Option B ]

OLAP (Online Analytical Processing) systems are designed for complex analysis of large volumes of data. They are optimized for query performance and analytical operations, rather than for handling large numbers of concurrent users.

OLAP systems primarily involve read-only operations, such as slicing, dicing, and aggregating data, rather than frequent updates or inserts.

Q: 10 Which of the following techniques cannot be used for removal of noise from data?

Smoothing by bin means
Smoothing by bin medians
Smoothing by bin compliment
Smoothing by bin boundariesSmoothing by bin boundaries

Show Answer

[ Option C ]

Noise removal in data preprocessing aims to reduce errors or random variations in datasets.
SMOOTHING BY BIN MEANS: Replaces each value in a bin with the mean of the bin to reduce variability.
SMOOTHING BY BIN MEDIANS: Replaces each value with the median of the bin, which is robust to outliers.
SMOOTHING BY BIN BOUNDARIES: Replaces values with the closest boundary (min or max) of the bin to limit extreme values.

Q: 11

Match the clustering approach (Column 1) with its correct description (Column 2):

Column 1 (Clustering Approach)	Column 2 (Description)
1. Agglomerative Method	A. Begins with each data object as its own cluster and merges them iteratively.
2. Divisive Method	B. Uses density rather than distance to form clusters, enabling discovery of arbitrary shapes.
3. Density Based Method	C. Starts with all data in one cluster and then recursively splits into smaller clusters.

1 – A, 2 – C, 3 – B
1 – C, 2 – A, 3 – B
1 – B, 2 – C, 3 – A
1 – C, 2 – B, 3 – A

Show Answer

[ Option A ]

Clustering approaches can be categorized based on how they form groups of data.

CLUSTERING APPROACH	DESCRIPTION
Agglomerative Method	Bottom-up hierarchical approach, begins with each data object as its own cluster and merges them iteratively.
Divisive Method	Top-down hierarchical approach, starts with all data in one cluster and recursively splits into smaller clusters.
Density-Based Method	Forms clusters based on density rather than distance, allowing discovery of arbitrarily shaped clusters and handling noise.

Q: 12 The 0-D cuboid, which holds the highest level of summarization is also known as:

Base cuboid
Apex cuboid
Intermediate cuboid
Multi-dimensional cube

Show Answer

[ Option B ]

In data warehousing and OLAP, a cuboid represents a specific level of aggregation in a multidimensional cube.

The 0-D cuboid is the highest level of summarization, meaning it aggregates data across all dimensions, providing only a single summarized value for the entire dataset. This cuboid is also called the Apex Cuboid because it sits at the top of the aggregation lattice.

Q: 13 In the context of data warehousing, let 'smoothing by bin boundaries' is applied for data cleaning on the data [4, 8, 15, 21, 21, 24, 25, 28, 34] with equal-frequency bins of size 3 (namely bin1, bin2 and bin3). After smoothing bin2 data is given by:

21, 21, 24
22, 22, 22
21, 24, 24
21, 22.5, 24

Show Answer

[ Option A ]

Smoothing by bin boundaries is a data cleaning technique used to reduce the effect of noise or outliers in a dataset. The process involves dividing data into bins and then replacing each value in a bin with the closest bin boundary value either minimum or maximum of the bin.
Given the data [4, 8, 15, 21, 21, 24, 25, 28, 34] and equal-frequency bins of size 3:

Bin1: [4, 8, 15]
Bin2: [21, 21, 24]
Bin3: [25, 28, 34]

Smoothing by bin boundaries for Bin2:
Bin boundaries: Min = 21, Max = 24

Replace each value in Bin2 with the nearest boundary:

21 is closer to 21. So, 21
21 is closer to 21. So, 21
24 is closer to 24. So, 24

Finally, the smoothed Bin2 is [21,21,24].

Q: 14 Which of the following defines the measure 'precision' in the context of metrics for evaluating classifier performance, if TP, TN, FP, FN refer to the number of true positive, true negative, false positive and false negative respectively?

TP/(TP+FP)
TN/(TN+FN)
TP/(TN+FN)
TN/(TP+FP)

Show Answer

[ Option A ]

In classification problems, evaluating the performance of a classifier involves several metrics, one of which is Precision. Precision measures the accuracy of positive predictions.
Mathematically, precision is defined as TP/(TP+FP).
Where:

TP (True Positives): Number of correctly predicted positive instances.
FP (False Positives): Number of instances incorrectly predicted as positive.

Q: 15 Which of the following statement(s) is/are true about schemas for multidimensional data models?
I. The dimension tables of the star schema model is kept in normalized form to reduce redundancies.
II. There are multiple fact tables to share dimension tables in snowflake schema.

Only I
Only II
Both I and II
Neither I nor II

Show Answer

[ Option D ]

Q: 16 In the context of data warehousing, which of the following is not a data transformation strategy?

Normalization
Discretization
Attribute construction
Wavelet transforms

Show Answer

[ Option D ]

In data warehousing, data transformation is the process of converting data from its original form into a format suitable for analysis.

Normalization: Scaling numeric data to a common range to eliminate bias due to magnitude differences.
Discretization: Converting continuous attributes into categorical or discrete intervals.
Attribute Construction: Creating new attributes from existing ones to provide more meaningful features for analysis.

Q: 17 How does the time complexity of the K-means algorithm change as the number of clusters K increases?

Time complexity decreases linearly with K.
Time complexity remains constant with respect to K.
Time complexity increases linearly with K.
Time complexity increases exponentially with K.

Show Answer

[ Option C ]

The K-Means algorithm works by repeatedly assigning data points to clusters and updating the cluster centroids until convergence. The time complexity of K-Means is approximately, O(n × K × I × d).
Where,
n = number of data points.
K = number of clusters.
I = number of iterations.
d = number of dimensions.

Q: 18 Which of the following techniques is NOT used to improve the efficiency of the Apriori algorithm?

Hash-based technique
Transaction reduction
Partitioning
Selection

Show Answer

[ Option D ]

The Apriori algorithm is a popular method for mining frequent itemsets in a dataset. Since generating candidate itemsets and scanning the database repeatedly can be computationally expensive, several techniques are used to improve its efficiency:

TECHNIQUE	DESCRIPTION
Hash-based technique	Uses hash tables to reduce the number of candidate itemsets counted.
Transaction reduction	Removes transactions that do not contain frequent items to reduce database scans.
Partitioning	Divides the dataset into smaller partitions to find local frequent itemsets and merge them.

Q: 19 OLAP stands for:

Online Analysis Processing
Online Analytical Processing
Online Application Program
Open Line Application Program

Show Answer

[ Option B ]

OLAP systems primarily involve read-only operations, such as slicing, dicing, and aggregating data, rather than frequent updates or inserts.

Q: 20 What does a quantile-quantile (Q-Q) plot display?

It displays all of the data for the given attribute and it plot quantile information.
It is a graphical method for summarizing the distribution of a given attribute.
The quantiles of one univariate distribution against the corresponding quantiles of another.
It is a useful method for providing a first look at bivariate data to see clusters of points and outliers.

Show Answer

[ Option C ]

A Q-Q plot is used to compare two distributions by plotting the quantiles of one distribution against the corresponding quantiles of another. If the points fall roughly along a straight line, the distributions are similar. Q-Q plots are commonly used to check whether data follows a theoretical distribution or to compare two datasets.

Q: 21 Suppose the original training set contains 100 positive and 1000 negative tuples. After under sampling, the new training set will contain:

1000 positive and 1000 negative tuples
100 positive and 100 negative tuples
550 positive and 550 negative tuples
All positive tuples only

Show Answer

[ Option B ]

Under-sampling is a technique used to address class imbalance in a training dataset. When one class significantly outnumbers the other class, under-sampling reduces the number of instances in the majority class to match the number of instances in the minority class.
In this case:

Original dataset, 100 positive tuples, 1000 negative tuples.
After under-sampling, the majority class (negative) is reduced to 100 tuples, equal to the minority class.
The new training set contains 100 positive and 100 negative tuples, making the dataset balanced for training.

Thank You for Reading!

Thank you so much for taking the time to read my Computer Science MCQs section carefully. Your support and interest mean a lot, and I truly appreciate you being part of this journey. Stay connected for more insights and updates! If you'd like to explore more tutorials and insights, check out my YouTube channel.

Don’t forget to subscribe and stay connected for future updates.

Student Name
College

CS Free MCQ Bank

Thank You for Reading!

Testimonial | What's Student Say?

Thank You