- Questions & Answers
- Accounting
- Computer Science
- Automata or Computationing
- Computer Architecture
- Computer Graphics and Multimedia Applications
- Computer Network Security
- Data Structures
- Database Management System
- Design and Analysis of Algorithms
- Information Technology
- Linux Environment
- Networking
- Operating System
- Software Engineering
- Big Data
- Android
- iOS
- Matlab

- Economics
- Engineering
- Finance
- Thesis
- Management
- Science/Math
- Statistics
- Writing
- Dissertations
- Essays
- Programming
- Healthcare
- Law

- Log in | Sign up

{

"cells": [

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## Question 1\n",

"\n",

"\n",

"This question is inspired from Exercise 3 in Chapter 5 in the textbook. \n",

"\n",

"On Canvas, you will see a CSV file named \"THA_diamonds.csv\". This file is a small subset of a real dataset on diamond prices in a [Kaggle competition](https://www.kaggle.com/shivam2503/diamonds). You will use this dataset for this question and the next question. \n",

"\n",

"**Some Background Information:** In our version of the dataset, the `price` feature has been discretized as `low`, `medium`, and `high`, and `premium`. If you are interested, these levels correspond to the following price ranges in the actual diamonds dataset:\n",

"- `low` price: price between \\\\$1000 and \\\\$2000\n",

"- `medium` price: price between \\\\$2000 and \\\\$3000\n",

"- `high` price: price between \\\\$3000 and \\\\$3500\n",

"- `premium` price: price between \\\\$3500 and \\\\$4000\n",

"\n",

"**Question Overview:** For this question, you will use the (unweighted) KNN algorithm for predicting the `carat` (numerical) target feature for the following single observation using the **Euclidean distance** metric with different number of neighbors:\n",

"- `cut` = good\n",

"- `color` = D\n",

"- `depth` = 60\n",

"- `price` = premium\n",

"- (`carat` = 0.71 but you will pretend that you do not have this information)\n",

"\n",

"In practice, you would use cross-validation or train-test split for determining optimal values of KNN hyperparameters. **However, as far as this assessment is concerned, you are to use entire data for training.**\n",

"\n",

"\n",

"### Part A (15 points)\n",

"Prepare your dataset for KNN modeling. Specifically, \n",

"1. Perform one-hot encoding of the categorical descriptive features in the input dataset.\n",

"2. Scale your descriptive features to be between 0 and 1.\n",

"3. Display the **last** 10 rows after one-hot encoding and scaling.\n",

"\n",

"**IMPORTANT NOTE: If your data preparation steps are incorrect, you will not get full credit for a correct follow-through.**"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"**NOTE:** For Parts (B), (C), and (D) below, you are **not** allowed to use the `KNeighborsRegressor()` in Scikit-Learn module, but rather use manual calculations (via either Python or Excel). That is, you will need to show and explain all your solution steps **without** using Scikit-Learn. The reason for this restriction is so that you get to learn how some things work behind the scenes. \n",

"\n",

"### Part B (5 points)\n",

"What is the prediction of the 1-KNN algorithm (i.e., k=1 in KNN) for the `carat` target feature using your manual calculations (using the Euclidean distance metric) for the single observation given above?\n",

"\n",

"### Part C (5 points)\n",

"What is the prediction of the 5-KNN algorithm?\n",

"\n",

"### Part D (5 points)\n",

"What is the prediction of the 10-KNN algorithm?\n",

"\n",

"\n",

"### Part E (15 points)\n",

"\n",

"This part (E) is an exception to the solution mode instructions for this question. In particular, you will need to use the `KNeighborsRegressor()` in Scikit-Learn to perform the same predictions in each Part (B) to (D). That is, \n",

"- What is the prediction of the 1-KNN algorithm using `KNeighborsRegressor()`?\n",

"- What is the prediction of the 5-KNN algorithm using `KNeighborsRegressor()`?\n",

"- What is the prediction of the 10-KNN algorithm using `KNeighborsRegressor()`?\n",

"\n",

"Are you able to get the same results as in your manual calculations? Please explain.\n",

"\n",

"\n",

"### Part F: Wrap-up (5 points)\n",

"\n",

"**IMPORTANT NOTE: This Wrap-up section is mandatory. That is, for Parts (B) to (E) (inclusive), you will not get any points for solutions not presented in the table format explained below.** \n",

"\n",

"Add and display two tables called **\"df_summary_manual\"** and **\"df_summary_sklearn\"** respectively:\n",

"- For the table **\"df_summary_manual\"**, you will report your results for Parts (B) to (D) using your manual calculations.\n",

"- For the table **\"df_summary_sklearn\"**, you will report your results for the 3 predictions in Part (E) using `KNeighborsRegressor()`.\n",

"\n",

"\n",

"Each of these tables need to have the following 3 columns:\n",

"- method\n",

"- prediction for the observation given (to be rounded to 3 decimal places)\n",

"- is_best (True or False - only the best prediction's is_best flag needs to be True and all the others need to be False)\n",

"\n",

"Your table needs to have 3 rows (one for each method) in each table that summarizes your results. These tables should look like below:\n",

"\n",

"|method | prediction | is_best |\n",

"|---|---|---\n",

"|1-KNN | ? | ? | ? |\n",

"|5-KNN | ? | ? | ? |\n",

"|10-KNN | ? | ? | ? |\n",

"\n",

"In case of a Pandas data frame, you can populate this data frame line by line by referring to Cell #6 in our [Pandas tutorial](https://www.featureranking.com/tutorials/python-tutorials/pandas/).\n"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## Question 2"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"This question is inspired from Exercise 3 in Chapter 4 in the textbook. \n",

"\n",

"You will use the same CSV file as in Question 1 named \"THA_diamonds.csv\". You will build a simple decision tree with **depth 1** using this dataset for predicting the `price` (categorical) target feature using the **Entropy** split criterion. \n",

"\n",

"To clarify, for Question 1, your target feature will be `carat` whereas for this Question 2, your target feature will be `price`."

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part A (10 points)\n",

"\n",

"The dataset for this question has 2 numerical descriptive features, `carat` and `depth`. \n",

"1. Discretize these 2 features separately as \"category_1\", \"category_2\", and \"category_3\" respectively using the *equal-frequency binning* technique. \n",

"2. Display the first 10 rows after discretization of these two features.\n",

"\n",

"After this discretization, all features in your dataset will be categorical (which we will assume to be **\"nominal categorical\"**). \n",

"\n",

"For this question, please do **NOT** perform any one-hot-encoding of the categorical descriptive features nor any scaling. Also, please do **NOT** perform any train-test splits.\n",

"\n",

"**IMPORTANT NOTE: If your discretizations are incorrect, you will not get full credit for a correct follow-through.**"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part B (5 points)\n",

"\n",

"Compute the impurity of the `price` target feature."

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part C (20 points)\n",

"\n",

"**IMPORTANT NOTE: For Parts C and D below, you will not get any points for solutions not presented in the required table format.** \n",

"\n",

"In this part, you will determine the root node for your decision tree.\n",

"\n",

"Your answer to this part needs to be a table and it needs to be called **\"df_splits\"**. Also, it needs to have the following 4 columns:\n",

"- split\n",

"- remainder\n",

"- info_gain\n",

"- is_optimal (True or False - only the optimal split's is_optimal flag needs to be True and the others need to be False)\n",

"\n",

"In your **\"df_splits\"** table, you should have **one row for each descriptive feature in the dataset**. As an example for your **\"df_splits\"** table, consider the `spam prediction` example in Table 4.2 in the textbook (**FIRST** Edition) on page 121, which was also covered in lectorials. The `df_splits` table would look something like the table below.\n",

"\n",

"|split| remainder | info_gain| is_optimal |\n",

"|---|---|---|---|\n",

"|suspicious words | ? | ? | True |\n",

"|unknown sender | ? | ? | False |\n",

"|contains images | ? | ? | False |\n",

"\n",

"**HINT:** Your `df_splits` table should have 4 rows."

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part D (15 points)\n",

"\n",

"In this part, you will **assume** the `carat` descriptive feature is at the root node (**NOTE:** This feature may or may not be the optimal root node, but you will just assume it is). Under this assumption, you will make predictions for the `price` target variable. \n",

"\n",

"Your answer to this part needs to be a table and it needs to be called **\"df_pred\"**. Also, it needs to have the following 6 columns:\n",

"- leaf_condition\n",

"- low_price_prob (probability)\n",

"- medium_price_prob\n",

"- high_price_prob\n",

"- premium_price_prob\n",

"- leaf_prediction\n",

"\n",

"As an example, continuing the spam prediction problem, assume the `suspicious words` descriptive feature is at the root node. The `df_pred` table would look something like the table below.\n",

"\n",

"|leaf_condition| spam_prob | ham_prob | leaf_prediction |\n",

"|---|---|---|---|\n",

"|suspicious words == true | ? | ? | ? |\n",

"|suspicious words == false | ? | ? | ? |\n",

"\n",

"**HINT:** Your `df_pred` table should have 3 rows."

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": []

}

],

"metadata": {

"kernelspec": {

"display_name": "Python 3",

"language": "python",

"name": "python3"

},

"language_info": {

"codemirror_mode": {

"name": "ipython",

"version": 3

},

"file_extension": ".py",

"mimetype": "text/x-python",

"name": "python",

"nbconvert_exporter": "python",

"pygments_lexer": "ipython3",

"version": "3.8.5"

}

},

"nbformat": 4,

"nbformat_minor": 4

}

"cells": [

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## Question 1\n",

"\n",

"\n",

"This question is inspired from Exercise 3 in Chapter 5 in the textbook. \n",

"\n",

"On Canvas, you will see a CSV file named \"THA_diamonds.csv\". This file is a small subset of a real dataset on diamond prices in a [Kaggle competition](https://www.kaggle.com/shivam2503/diamonds). You will use this dataset for this question and the next question. \n",

"\n",

"**Some Background Information:** In our version of the dataset, the `price` feature has been discretized as `low`, `medium`, and `high`, and `premium`. If you are interested, these levels correspond to the following price ranges in the actual diamonds dataset:\n",

"- `low` price: price between \\\\$1000 and \\\\$2000\n",

"- `medium` price: price between \\\\$2000 and \\\\$3000\n",

"- `high` price: price between \\\\$3000 and \\\\$3500\n",

"- `premium` price: price between \\\\$3500 and \\\\$4000\n",

"\n",

"**Question Overview:** For this question, you will use the (unweighted) KNN algorithm for predicting the `carat` (numerical) target feature for the following single observation using the **Euclidean distance** metric with different number of neighbors:\n",

"- `cut` = good\n",

"- `color` = D\n",

"- `depth` = 60\n",

"- `price` = premium\n",

"- (`carat` = 0.71 but you will pretend that you do not have this information)\n",

"\n",

"In practice, you would use cross-validation or train-test split for determining optimal values of KNN hyperparameters. **However, as far as this assessment is concerned, you are to use entire data for training.**\n",

"\n",

"\n",

"### Part A (15 points)\n",

"Prepare your dataset for KNN modeling. Specifically, \n",

"1. Perform one-hot encoding of the categorical descriptive features in the input dataset.\n",

"2. Scale your descriptive features to be between 0 and 1.\n",

"3. Display the **last** 10 rows after one-hot encoding and scaling.\n",

"\n",

"**IMPORTANT NOTE: If your data preparation steps are incorrect, you will not get full credit for a correct follow-through.**"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"**NOTE:** For Parts (B), (C), and (D) below, you are **not** allowed to use the `KNeighborsRegressor()` in Scikit-Learn module, but rather use manual calculations (via either Python or Excel). That is, you will need to show and explain all your solution steps **without** using Scikit-Learn. The reason for this restriction is so that you get to learn how some things work behind the scenes. \n",

"\n",

"### Part B (5 points)\n",

"What is the prediction of the 1-KNN algorithm (i.e., k=1 in KNN) for the `carat` target feature using your manual calculations (using the Euclidean distance metric) for the single observation given above?\n",

"\n",

"### Part C (5 points)\n",

"What is the prediction of the 5-KNN algorithm?\n",

"\n",

"### Part D (5 points)\n",

"What is the prediction of the 10-KNN algorithm?\n",

"\n",

"\n",

"### Part E (15 points)\n",

"\n",

"This part (E) is an exception to the solution mode instructions for this question. In particular, you will need to use the `KNeighborsRegressor()` in Scikit-Learn to perform the same predictions in each Part (B) to (D). That is, \n",

"- What is the prediction of the 1-KNN algorithm using `KNeighborsRegressor()`?\n",

"- What is the prediction of the 5-KNN algorithm using `KNeighborsRegressor()`?\n",

"- What is the prediction of the 10-KNN algorithm using `KNeighborsRegressor()`?\n",

"\n",

"Are you able to get the same results as in your manual calculations? Please explain.\n",

"\n",

"\n",

"### Part F: Wrap-up (5 points)\n",

"\n",

"**IMPORTANT NOTE: This Wrap-up section is mandatory. That is, for Parts (B) to (E) (inclusive), you will not get any points for solutions not presented in the table format explained below.** \n",

"\n",

"Add and display two tables called **\"df_summary_manual\"** and **\"df_summary_sklearn\"** respectively:\n",

"- For the table **\"df_summary_manual\"**, you will report your results for Parts (B) to (D) using your manual calculations.\n",

"- For the table **\"df_summary_sklearn\"**, you will report your results for the 3 predictions in Part (E) using `KNeighborsRegressor()`.\n",

"\n",

"\n",

"Each of these tables need to have the following 3 columns:\n",

"- method\n",

"- prediction for the observation given (to be rounded to 3 decimal places)\n",

"- is_best (True or False - only the best prediction's is_best flag needs to be True and all the others need to be False)\n",

"\n",

"Your table needs to have 3 rows (one for each method) in each table that summarizes your results. These tables should look like below:\n",

"\n",

"|method | prediction | is_best |\n",

"|---|---|---\n",

"|1-KNN | ? | ? | ? |\n",

"|5-KNN | ? | ? | ? |\n",

"|10-KNN | ? | ? | ? |\n",

"\n",

"In case of a Pandas data frame, you can populate this data frame line by line by referring to Cell #6 in our [Pandas tutorial](https://www.featureranking.com/tutorials/python-tutorials/pandas/).\n"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## Question 2"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"This question is inspired from Exercise 3 in Chapter 4 in the textbook. \n",

"\n",

"You will use the same CSV file as in Question 1 named \"THA_diamonds.csv\". You will build a simple decision tree with **depth 1** using this dataset for predicting the `price` (categorical) target feature using the **Entropy** split criterion. \n",

"\n",

"To clarify, for Question 1, your target feature will be `carat` whereas for this Question 2, your target feature will be `price`."

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part A (10 points)\n",

"\n",

"The dataset for this question has 2 numerical descriptive features, `carat` and `depth`. \n",

"1. Discretize these 2 features separately as \"category_1\", \"category_2\", and \"category_3\" respectively using the *equal-frequency binning* technique. \n",

"2. Display the first 10 rows after discretization of these two features.\n",

"\n",

"After this discretization, all features in your dataset will be categorical (which we will assume to be **\"nominal categorical\"**). \n",

"\n",

"For this question, please do **NOT** perform any one-hot-encoding of the categorical descriptive features nor any scaling. Also, please do **NOT** perform any train-test splits.\n",

"\n",

"**IMPORTANT NOTE: If your discretizations are incorrect, you will not get full credit for a correct follow-through.**"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part B (5 points)\n",

"\n",

"Compute the impurity of the `price` target feature."

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part C (20 points)\n",

"\n",

"**IMPORTANT NOTE: For Parts C and D below, you will not get any points for solutions not presented in the required table format.** \n",

"\n",

"In this part, you will determine the root node for your decision tree.\n",

"\n",

"Your answer to this part needs to be a table and it needs to be called **\"df_splits\"**. Also, it needs to have the following 4 columns:\n",

"- split\n",

"- remainder\n",

"- info_gain\n",

"- is_optimal (True or False - only the optimal split's is_optimal flag needs to be True and the others need to be False)\n",

"\n",

"In your **\"df_splits\"** table, you should have **one row for each descriptive feature in the dataset**. As an example for your **\"df_splits\"** table, consider the `spam prediction` example in Table 4.2 in the textbook (**FIRST** Edition) on page 121, which was also covered in lectorials. The `df_splits` table would look something like the table below.\n",

"\n",

"|split| remainder | info_gain| is_optimal |\n",

"|---|---|---|---|\n",

"|suspicious words | ? | ? | True |\n",

"|unknown sender | ? | ? | False |\n",

"|contains images | ? | ? | False |\n",

"\n",

"**HINT:** Your `df_splits` table should have 4 rows."

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"### Part D (15 points)\n",

"\n",

"In this part, you will **assume** the `carat` descriptive feature is at the root node (**NOTE:** This feature may or may not be the optimal root node, but you will just assume it is). Under this assumption, you will make predictions for the `price` target variable. \n",

"\n",

"Your answer to this part needs to be a table and it needs to be called **\"df_pred\"**. Also, it needs to have the following 6 columns:\n",

"- leaf_condition\n",

"- low_price_prob (probability)\n",

"- medium_price_prob\n",

"- high_price_prob\n",

"- premium_price_prob\n",

"- leaf_prediction\n",

"\n",

"As an example, continuing the spam prediction problem, assume the `suspicious words` descriptive feature is at the root node. The `df_pred` table would look something like the table below.\n",

"\n",

"|leaf_condition| spam_prob | ham_prob | leaf_prediction |\n",

"|---|---|---|---|\n",

"|suspicious words == true | ? | ? | ? |\n",

"|suspicious words == false | ? | ? | ? |\n",

"\n",

"**HINT:** Your `df_pred` table should have 3 rows."

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": []

}

],

"metadata": {

"kernelspec": {

"display_name": "Python 3",

"language": "python",

"name": "python3"

},

"language_info": {

"codemirror_mode": {

"name": "ipython",

"version": 3

},

"file_extension": ".py",

"mimetype": "text/x-python",

"name": "python",

"nbconvert_exporter": "python",

"pygments_lexer": "ipython3",

"version": "3.8.5"

}

},

"nbformat": 4,

"nbformat_minor": 4

}

Answered 4 days AfterMay 06, 2021

SOLUTION.PDF## Answer To This Question Is Available To Download

- https://ccle.ucla.edu/pluginfile.php/4747839/mod_resource/content/0/CSM51A%20F21%20Homework%204 CS M51A, Fall 2021, Assignment 4 (Total Mark: 90 points, 9% ) Due: Wed Oct 27th, 10:00 AM Pacific Time...SolvedOct 27, 2021
- This assignment is in R and the rmd file is attached and needs to answer seven questions that are in it. I need help solving all of them. The data file is also attached, I need to upload the rmd file...SolvedOct 25, 2021
- This week we will discuss the issues associated in the digital age with Privacy . Your assignment is to examine what is happening in the digital world with respect to Privacy . Find an example of an...Oct 22, 2021
- SAN FRANCISCO STATE UNIVERSITY Computer Science Department CSC510 Analysis of Algorithms– Algorithm Challenge 3: Dynamic Programming Instructor: Jose Ortiz Full Name: Student ID: Assignment...SolvedOct 22, 2021
- public class LinkedList { private class Node { public Object value; public Node next; public char[] data; } public Node head = null; public Node tail = null; public String About() { return "About me";...SolvedOct 21, 2021

- Humanâ€“robot collaborative assembly in cyber-physical production: Classification framework and implementation ifi- the lue- sing ices tc.). ical po- tive d of dle ven trol ling and f a and ans, heir...Oct 28, 2021
- https://www-awn.connectmath.com/alekscgi/x/Isl.exe/1o_u-IgNsIkr7j8P3jH-lBgHli-HrBcUOp4O_WjKjuO-vRBuh9fGHFLxrGhbaofJK2ns0mdKFd5vBLY6yASU8SoRSLW22OsHoz_7sbE3KAAs?1oB6GQHoNLuCvZb4MtoOct 28, 2021
- # Assignment #3 – The “File System 311” - FS3 Filesystem (Version XXXXXXXXXXCMPSC311 - Introduction to Systems Programming - Fall XXXXXXXXXXProf. McDaniel ## **Due date: Friday, November 12 (11:59pm EDT)**...Oct 28, 2021
- Assignment MUST be done in microsoft word documentOct 28, 2021
- Attached fileOct 28, 2021

Copy and Paste Your Assignment Here

Copyright © 2021. All rights reserved.