DS-200 Online Practice Questions and Answers

Questions 4

Why should stop an interactive machine learning algorithm as soon as the performance of the model on a test set stops improving?

A. To avoid the need for cross-validating the model

B. To prevent overfitting

C. To increase the VC (VAPNIK-Chervonenkis) dimension for the model

D. To keep the number of terms in the model as possible

E. To maintain the highest VC (Vapnik-Chervonenkis) dimension for the model

Browse 60 Q&As

Questions 5

What is default delimiter for Hive tables?

A. ^A (Control-A)

B. , (comma)

C. \t (tab)

D. : (colon)

Browse 60 Q&As

Questions 6

Refer to the exhibit.

Which point in the figure is the mode?

A. A

B. B

C. C

Browse 60 Q&As

Questions 7

You are building a k-nearest neighbor classifier (k-NN) on a labeled set of points in a high- dimensional space. You determine that the classifier has a large error on the training data. What is the most likely problem?

A. High-dimensional spaces effectively make local neighborhoods global

B. k-NN compotation does not coverage in high dimensions

C. k was too small

D. The VC-dimension of a k-NN classifier is too high

Browse 60 Q&As

Questions 8

Which best describes the primary function of Flume?

A. Flume is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with an infrastructure consisting of sources and sinks for importing and evaluating large data sets

B. Flume acts as a Hadoop filesystem for log files

C. Flume Imports data from SQL/relational database into your Hadoop cluster

D. Flume provides a query languages for Hadoop similar to SQL

E. Flume is a distributed server for collecting and moving large amount of data into HDFS as it's produced from streaming data flows

Browse 60 Q&As

Questions 9

You have a directory containing a number of comma-separated files. Each file has three columns and each filename has a .csv extension. You want to have a single tab-separated file (all .tsv) that contains all the rows from all the files.

Which command is guaranteed to produce the desired output if you have more than 20,000 files to process?

A. Find . name `*, CSV' print0 | sargs -0 cat | tr `,' `\t' > all.tsv

B. Find . name `name * .CSV' | cat | awk `BEGIN {FS = "," OFS = "\t"} {print $1, $2, $3}' > all.tsv

C. Find . name `*.CSV' | tr `,' `\t' | cat > all.tsv

D. Find . name `*.CSV' | cat > all.tsv

E. Cat *.CSV > all.tsv

Browse 60 Q&As

Questions 10

A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week. One engineer points out that some bugs are more difficult to fix than others. What metric should you use to estimate how hard a particular bug is to fix?

A. The tech lead's estimate of how many hours would be needed to fix the bug.

B. The priority of the bug according to the project manager

C. The number of years that the engineer who was assigned the bug has worked at the company

D. The number of bugs that had been found in each sub-component of the project

Browse 60 Q&As

Questions 11

You are building a system to perform outlier detection for a large online retailer. You need to build a system to detect if the total dollar value of sales are outside the norm for each U.S. state, as determined from the physical location of the buyer for each purchase. The retailer's data sources are scattered across multiple systems and databases and are unorganized with little coordination or shared data or keys between the various data sources.

Below are the sources of data available to you. Determine which three will give you the smallest set of data sources but still allow you to implement the outlier detector by state.

A. Database of employees that Includes only the employee ID, start date, and department

B. Database of users that contains only their user ID, name, and a list of every Item the user has viewed

C. Transaction log that contains only basket ID, basket amount, time of sale completion, and a session ID

D. Database of user sessions that includes only session ID, corresponding user ID, and the corresponding IP address

E. External database mapping IP addresses to geographic locations

F. Database of items that includes only the item name, item ID, and warehouse location

G. Database of shipments that includes only the basket ID, shipment address, shipment date, and shipment method

Browse 60 Q&As

Questions 12

How can the naiveté of the naive Bayes classifier be advantageous?

A. It does not require you to make strong assumptions about the data because it is a non- parametric

B. It significantly reduces the size of the parameter space, thus reducing the risk of over fitting

C. It allows you to reduce bias with no tradeoff in variance

D. It guarantees convergence of the estimator

Browse 60 Q&As

Questions 13

You want to understand more about how users browse your public website. For example, you war know which pages they visit prior to placing an order. You have a server farm of 200 web server hosting your website. Which is the most efficient process to gather these web servers access logs into your Hadoop cluster for analysis?

A. Sample the web server logs web servers and copy them into HDFS using curl

B. Channel these click streams into Hadoop using Hadoop Streaming

C. Write a MapReduce job with the web servers for mappers and the Hadoop cluster nodes for reducers

D. Import all user clicks from your OLTP databases Into Hadoop using Sqoop

E. Ingest the server web logs into HDFS using Flume

Browse 60 Q&As

Exam Code: DS-200

Exam Name: Data Science Essentials

Last Update: May 14, 2024

Questions: 60 Q&As