# Future Problem¶

Generalized neural net

# Future Problem¶

Unit tests in our shared game implementation

b. Do computation & modeling problem 86 -- You did this problem before using your custom KNN, but do it once more using sklearn's knn implementation. You can re-use code. I just want to make sure you know how to use sklearn's KNN

# Future Problem¶

Extra Credit

You can get 100 points worth of extra credit on this assignment (i.e. a full assignment's worth of extra credit) if you create your own heuristic scoring technique and demonstrate that it performs better than the basic scoring technique shown in this assignment.

To demonstrate that it performs better, you'll need to run the same experiments for it and plot it on the same graph as the

# Future Problem¶

Watch Elijah's video explaining the native game state implementation:

In [ ]:



# Problem 131¶

Fill out the end-of-year survey:

https://forms.gle/oo8ECxjN3XTnrv4j8

You can use the extra time to catch up on any old assignments and/or prepare for the final.

### Submission¶

Send a screenshot of the email receipt that you get after completing the survey

# Problem 130¶

Watch Prof. Wierman's videos:

Answer the following questions:

1. Fill in the blank: According to Paul Rothemund, synhetic biologists are $\_\_\_\_\_$-oriented while molecular programmers are $\_\_\_\_\_$-oriented.

2. In what kind of organism can you find a single-stranded loop of DNA?

3. What is the point of making smiley faces and other pictures out of DNA? Why are scientists researching this?

4. In computer science lingo, what can DNA tiles be used to represent?

5. What kind of unnatural DNA structure did Rothemund design in 2006? How big is the structure relative to the size of a bacterium?

6. In Qian's DNA neural network, how does the network determine whether a given array represents an "L" or a "T"?

7. In Qian's DNA neural network example, if the network were given an "I", what would the weighted sum be using the "L" weight, what would the weighted sum be using the "T" weight, and what would the network classify the "I" as?

If you have any missing assignments... in particular, titanic analysis assignments... start catching up on those.

Lastly, the last meeting with Prof. Wierman may need to be rescheduled, possibly Friday 6/4.

### The Final¶

The final will is supposed to take place on Wednesday 6/2 from 11am-1pm, but I hear a lot of you guys have AP tests then. If you have an AP test that day, then I'll just leave the time window open on Canvas so you can take it any time between Tuesday 6/1 and Thursday 6/3.

Any topic that appeared on an assignment this semester is fair game.

Here are the notes from class. (I'll update this with more notes as we do more review.)

https://photos.app.goo.gl/aWPbEveNBgoFURJ27

Here is a list of topics to help you focus your studying.

• basics of haskell & C++
• numpy, pandas, sklearn
• all the models we've covered (in particular: linear/logistic regression, polynomial regression, k-nearest neighbors, k-means clustering)
• breadth-first and depth-first search
• roulette probability selection
• hill climbing (as a general concept)
• logistic regression when the target variable has 0's and/or 1's
• fitting logistic regression via gradient descent
• integral estimation (left, right, midpoint, trapezoidal, Simpson's)
• Euler estimation
• predator-prey and SIR modeling
• interaction terms, indicator (dummy) variables
• underfitting/overfitting
• distance/shortest paths in graphs
• dijkstra's algorithm
• train/test datasets
• using linear regression with nonlinear functions
• titanic analysis
• cross-validation
• normalization
• clustering

# Problem 129¶

Create an elbow curve for k-means clustering on the titanic dataset, using min-max normalization.

Remember that the titanic dataset is provided here:

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/debugging-help/processed_titanic_data.csv

In your clustering, use all the rows of the data set, but only these columns:

["Sex", "Pclass", "Fare", "Age", "SibSp"]

The first few rows of the normalized data set should be as follows:

["Sex", "Pclass", "Fare", "Age", "SibSp"]
[0, 1, 0.01415106, 0.27117366, 0.125]
[1, 0, 0.13913574, 0.4722292,  0.125]
[1, 1, 0.01546857, 0.32143755, 0]

Then, just as before, make a plot of sum squared distance to cluster centers vs $k$ for k=[1,2,3,...,25].

Choose k to be at the elbow of the graph (looks like k=4). Then, fit a k-means model with k=4, add the cluster label as a column in your data set, and find the column averages.

Tip: Use groupby: df.groupby(['cluster']).mean()

Here is an example of the format for your output. Your numbers might be different.

              Sex    Pclass       Fare        Age     SibSp
cluster
0        1.000000  2.183908  38.759867  28.815940  0.000000
1        0.502110  2.092827  45.046011  29.253985  1.118143
2        0.456522  2.847826  52.115039  14.601963  4.369565
3        0.000000  2.419355  20.452848  31.896441  0.000000

To help us interpret the clusters, add a column for Survived (the mean survival rate in each cluster) and add a column for count (i.e. the number of data points in each cluster).

Note: We only include Survived AFTER the clustering. Later, we'll want to incorporate clustering into our predictive model, and we don't know the Survived values for the passengers we're trying to predict.

Here is an example of the format for your output. Your numbers might be different.

              Sex    Pclass       Fare        Age     SibSp  Survived  count
cluster
0        1.000000  2.183908  38.759867  28.815940  0.000000  0.787356  174.0
1        0.502110  2.092827  45.046011  29.253985  1.118143  0.527426  237.0
2        0.456522  2.847826  52.115039  14.601963  4.369565  0.152174   46.0
3        0.000000  2.419355  20.452848  31.896441  0.000000  0.168203  434.0

Then, interpret the clusters. Write down, roughly, what kind of passengers each cluster represents.

### Submission¶

Code that generates the plot and prints out the mean data grouped by cluster

Overleaf doc with the grouped data as a table, and your interpretation of what each cluster means

# Problem 128¶

Generate an elbow graph for the same data set as in the previous assignment, except using scikit-learn's k-means implementation. This problem will mainly be an exercise in looking up and using documentation.

It's possible that the sum squared error values may come out a bit different due to scikit-learn using a different method to assign initial clusters. That's okay. Just check that the elbow of the graph still occurs at k=3.

Submission: Code that generates the elbow plot using scikit-learn's implementation.

Note: For this problem, put your code in a separate file (don't just overwrite the file from the previous assignment). This way, when I grade assignments, I can still run the code from the previous assignment.

# Problem 127¶

Since AP tests are starting this week, the assignments will be shorter, starting with this assignment.

When clustering data, we often don't know how many clusters are in the data to begin with.

A common way to determine the number of clusters is using the "elbow method", which involves plotting the total "squared error" and then finding where the graph has an "elbow", i.e. goes from sharply decreasing to gradually decreasing.

Here, the "squared error" associated with any data point is its distance from its cluster center. If a data point $(1.1,1.8,3.5)$ is assigned to a cluster whose center is $(1,2,3),$ then the squared error associated with that data point would be

$$(1.1-1)^2 + (1.8-2)^2 + (3.5-3)^2 = 0.3.$$

The total squared error is just the sum of squared error associated with all the data points.

Watch the following video to learn about the elbow method:

Recall the following dataset of cookie ingredients:

columns = ['Portion Eggs',
'Portion Butter',
'Portion Sugar',
'Portion Flour']

data = [[0.14, 0.14, 0.28, 0.44],
[0.22, 0.1, 0.45, 0.33],
[0.1, 0.19, 0.25, 0.4],
[0.02, 0.08, 0.43, 0.45],
[0.16, 0.08, 0.35, 0.3],
[0.14, 0.17, 0.31, 0.38],
[0.05, 0.14, 0.35, 0.5],
[0.1, 0.21, 0.28, 0.44],
[0.04, 0.08, 0.35, 0.47],
[0.11, 0.13, 0.28, 0.45],
[0.0, 0.07, 0.34, 0.65],
[0.2, 0.05, 0.4, 0.37],
[0.12, 0.15, 0.33, 0.45],
[0.25, 0.1, 0.3, 0.35],
[0.0, 0.1, 0.4, 0.5],
[0.15, 0.2, 0.3, 0.37],
[0.0, 0.13, 0.4, 0.49],
[0.22, 0.07, 0.4, 0.38],
[0.2, 0.18, 0.3, 0.4]]

Use the elbow method to construct a graph of error vs k. For each value of k, you should do the following:

• To initialize the clusters, assign the first row in the dataset to the first cluster, the second row to second cluster, and so on, looping back to the first cluster after you assign a row to the $k$th cluster. So the cluster assignments will look like this:

{
1: [0, k-1, ...],
2: [1, k, ...],
3: [2, k+1, ...]
...
k: [k-1, ...]
}

Check the logs if you need some more concrete examples.

• For each value of k, you should run the k-means algorithm until it converges, and then compute the squared error.

You should get the following result:

Then, estimate the number of clusters in the data by finding the "elbow" in the graph.

Note: Here is a log to help you debug.

### Submission¶

Link to repl.it code that generates the plot

Github commit to machine-learning repository

In your submission, write down your estimated number of clusters in the data set.

# Problem 126¶

Previously, we ran into the issue that the Gobble game tree is too big to work with. So, what we can do instead is repeatedly generate a smaller local tree, and use that instead.

Minimax Algorithm usng Local Trees

Each time it's your player's turn to move, you can build a local tree as follows:

• Use the current game state as the root node

• Generate more nodes corresponding to $N$ turns of the game

• $N=1$ would mean you stop after generating the child nodes.

• $N=2$ would mean you stop after generating the grandchild nodes.

• and so on...

• Assign scores to the leaf nodes of the local tree (I'll explain this more after these bullet points).

• Propagate the scores up the tree using the standard minimax approach.

• Choose your action in accordance with the standard minimax strategy (i.e. choose the action which takes you to the highest-score child).

Scoring Non-Terminal States

How do we assign scores to the leaf nodes of the local tree? The local tree only tracks the possibilities of the game $N$ turns into the future, and at that point, it's unlikely that either player has won the game.

What we can do is use a heuristic technique to assign scores. The word "heuristic" refers to a technique that is intuitive and practical, though not necessarily optimal.

In our case, a good heuristic technique is to create a score that gets higher when you're in a better position to win (and is highest when you have won).

For tic-tac-toe-like games, we can create a heuristic score like this:

• ADD 100 for each row, column, or diagonal that contains 3 of YOUR OWN pieces.

• ADD 10 for each row, column, or diagonal that contains 2 of YOUR OWN pieces, and where the remaining spot has nobody's piece in it.

• SUBTRACT 100 for each row, column, or diagonal that contains 3 of your OPPONENT'S pieces.

• SUBTRACT 10 for each row, column, or diagonal that contains 2 of YOUR OPPONENT'S pieces, and where the remaining spot has nobody's piece in it.

Experiment: create a Gobble player that uses a local game tree approach, and match it up against a random player for 200 games (alternating who goes first).

Repeat the above experiment for $N=1,2,3,$ and so on, stopping at the value of $N$ for which the experiment takes more than 3 minutes to run.

Make a table of win rate & loss rate vs N in an Overleaf doc and submit it along with a replit link and github commit.

Note: This heuristic scoring technique is pretty basic so it might not perform super well. But I think it should at least do a bit better than the random player.

# Problem 125¶

Riley -- once you've cleaned up your code, pull your Gobble game to the shared repo. You can accept your own pull request for this. Please do this today (Wednesday) so that everyone can use it for this assignment.

George -- be ready present your gobble implementation on Friday. If you're stuck, that's okay, just present where you got stuck and what you tried to get around it.

Everyone -- create a branch of our shared repo called your-name-game-tree. Put your game tree in there, and then create a minimax player and test it using our shared repo. Have it play 200 games against a random player (100 as first player, 100 as second player) and post its win rate on Slack.

### Submission¶

Link to your branch with the minimax player

# Problem 124¶

### Clustering¶

Clustering in General

"Clustering" is the act of finding "groups" of similar records within data.

Watch this video to get a general sense of what clustering is and why we care about it. (Best to play it at 1.5 or 1.75x speed to save time)

K-Means Clustering

Your task will be to implement a basic clustering technique called "k-means clustering". Here is a video describing k-means clustering:

Here is a summary of k-means clustering:

1. Initialize the clusters

• Randomly divide the data into k parts. Each part represents an initial "cluster".

• Compute the mean of each part. Each mean represents an initial cluster center.

2. Update the clusters

• Re-assign each record to the cluster with the nearest center (using Euclidean distance).

• Compute the new cluster centers by taking the mean of the records in each cluster.

3. Keep repeating step 2 until the clusters don't change after the update.

Write a KMeans clustering class and use it to classify the following data.

# these column labels aren't necessary to use
# in the problem, but they make the problem more
# concrete when you're thinking about what the data
# means.
columns = ['Portion Eggs',
'Portion Butter',
'Portion Sugar',
'Portion Flour']

data = [[0.14, 0.14, 0.28, 0.44],
[0.22, 0.1, 0.45, 0.33],
[0.1, 0.19, 0.25, 0.4],
[0.02, 0.08, 0.43, 0.45],
[0.16, 0.08, 0.35, 0.3],
[0.14, 0.17, 0.31, 0.38],
[0.05, 0.14, 0.35, 0.5],
[0.1, 0.21, 0.28, 0.44],
[0.04, 0.08, 0.35, 0.47],
[0.11, 0.13, 0.28, 0.45],
[0.0, 0.07, 0.34, 0.65],
[0.2, 0.05, 0.4, 0.37],
[0.12, 0.15, 0.33, 0.45],
[0.25, 0.1, 0.3, 0.35],
[0.0, 0.1, 0.4, 0.5],
[0.15, 0.2, 0.3, 0.37],
[0.0, 0.13, 0.4, 0.49],
[0.22, 0.07, 0.4, 0.38],
[0.2, 0.18, 0.3, 0.4]]

# we usually don't know the classes, of the
# data we're trying to cluster, but I'm providing
# them here so that you can actually see that the
# k-means algorithm succeeds.

'Fortune',
'Sugar',
'Fortune',
'Sugar',
'Sugar',
'Sugar',
'Fortune',
'Fortune',
'Sugar',
'Sugar',
'Fortune',
'Shortbread']

Make sure your class passes the following test:

# initial_clusters is a dictionary where the key
# represents the cluster number and the value is
# a list of indices (i.e. row numbers in the data set)
# of records that are said to be in that cluster

>>> initial_clusters = {
1: [0,3,6,9,12,15,18],
2: [1,4,7,10,13,16],
3: [2,5,8,11,14,17]
}
>>> kmeans = KMeans(initial_clusters, data)
>>> kmeans.run()
>>> kmeans.clusters
{
1: [0, 2, 5, 7, 9, 12, 15, 18],
2: [3, 6, 8, 10, 14, 16],
3: [1, 4, 11, 13, 17]
}

Here are some step-by-step tests to help you along:

>>> initial_clusters = {
1: [0,3,6,9,12,15,18],
2: [1,4,7,10,13,16],
3: [2,5,8,11,14,17]
}
>>> kmeans = KMeans(initial_clusters, data)

### ITERATION 1
>>> kmeans.update_clusters_once()

>>> kmeans.clusters
{
1: [0, 3, 6, 9, 12, 15, 18],
2: [1, 4, 7, 10, 13, 16],
3: [2, 5, 8, 11, 14, 17]
}
>>> kmeans.centers
{
1: [0.113, 0.146, 0.324, 0.437],
2: [0.122, 0.115, 0.353, 0.427],
3: [0.117, 0.11, 0.352, 0.417]
}
>>> {n: [classes[i] for i in cluster_indices] \
for cluster_number, cluster_indices in kmeans.clusters.items()}
{
2: ['Fortune', 'Fortune', 'Shortbread', 'Sugar', 'Fortune', 'Sugar'],
3: ['Shortbread', 'Shortbread', 'Sugar', 'Fortune', 'Sugar', 'Fortune']
}

### ITERATION 2
>>> kmeans.update_clusters_once()

>>> kmeans.clusters
{
1: [0, 2, 5, 6, 7, 9, 10, 12, 15, 18],
2: [14, 16],
3: [1, 3, 4, 8, 11, 13, 17]
}

>>> kmeans.centers
{
1: [0.111, 0.158, 0.302, 0.448],
2: [0.0, 0.115, 0.4, 0.495],
3: [0.159, 0.08, 0.383, 0.379]
}

>>> {n: [classes[i] for i in cluster_indices] \
for cluster_number, cluster_indices in kmeans.clusters.items()}
{
2: ['Sugar', 'Sugar'],
3: ['Fortune', 'Sugar', 'Fortune', 'Sugar', 'Fortune', 'Fortune', 'Fortune']
}

### ITERATION 3
>>> kmeans.update_clusters_once()

>>> kmeans.clusters
{
0: [0, 2, 5, 7, 9, 12, 15, 18],
1: [3, 6, 8, 10, 14, 16],
2: [1, 4, 11, 13, 17]
}

>>> kmeans.centers
{
0: [0.133, 0.171, 0.291, 0.416],
1: [0.018, 0.1, 0.378, 0.51],
2: [0.21, 0.08, 0.38, 0.346]
}

>>> {n: [classes[i] for i in cluster_indices] \
for cluster_number, cluster_indices in kmeans.clusters.items()}
{
1: ['Sugar', 'Sugar', 'Sugar', 'Sugar', 'Sugar', 'Sugar'],
2: ['Fortune', 'Fortune', 'Fortune', 'Fortune', 'Fortune']
}

### Submission¶

Repl.it link to your k-means tests (and your github commit)

# Problem 123¶

### Gobble Implementation¶

Using our shared tic-tac-toe implementation as a starting point, implement the "Gobble" game that was described during class.

• 3x3 board, just like tic-tac-toe. Player wins when they have 3 pieces in a row.

• Pieces of 3 sizes: 1, 2, 3. You can use a larger-size piece to cover a smaller-size piece.

• Each player has $k$ pieces of each size. This is a parameter that we may want to vary.

I already copied the tic-tac-toe implementation into a "gobble" folder, so you just need to create a branch and modify the existing code to implement Gobble.

https://github.com/eurisko-us/games-cohort-1

Next class, be ready to present your Gobble implementation (i.e. what you changed in the existing tic-tac-toe implementation).

### Game Tree Analysis¶

Write some code to create game trees and answer the following questions:

a. How many nodes are in a full tic-tac-toe game tree, and how long does it take to construct?

b. How many nodes are in a full Gobble game tree with $k=2,$ and how long does it take to construct?

c. How many nodes are in a full Gobble game tree with $k=3,$ and how long does it take to construct?

d. How many nodes are in a full Gobble game tree with $k=4,$ and how long does it take to construct?

e. How many nodes are in a full Gobble game tree with $k=5,$ and how long does it take to construct?

### Submission¶

Link to gobble code

Link to overleaf doc with your answers to the game tree analysis questions

# Problem 122¶

a. Take your code from the previous problem and run it again, this time on the titanic dataset.

Remember that the titanic dataset is provided here:

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/debugging-help/processed_titanic_data.csv

Filter the above dataset down to the first 100 rows, and only these columns:

["Survived", "Sex", "Pclass", "Fare", "Age","SibSp"]

Then, just as before, make a plot of leave-one-out cross validation vs $k$ for k=[1,3,5,7,...,99]. Overlay the 4 resulting plots: "unscaled", "simple scaling", "min-max", "z-score". You should get the following result:

b. Compute the relative speed at which your code runs (relative to mine). The way you can do this is to run this code snippet 5 times and take the average time:

import time
start = time.time()

counter = 0
for _ in range(1000000):
counter += 1

end = time.time()
print(end - start)

When I do this, I get an average time of about 0.15 seconds. So to find your relative speed, divide your result by mine.

c. Speed up your code in part (a) so it runs in (your relative speed) * 45 seconds or less. I took a deeper dive into some code that was running slow for students, and it turns out the code just needs to be written more efficiently.

To make the code more efficient, you need to avoid unnessarily repeating expensive operations. Anything involving a dataset transformation is usually expensive.

• The very first thing you do should be processing all of your data and splitting it into your X and y arrays. DON'T do this every time you fit a model -- just do it once at the beginning.

• In general, avoid repeatedly processing the data set. If there's something you're doing to the data set over and over again, just do it once at the beginning.

You can time your code using the following setup:

import time
begin_time = time.time()

end_time = time.time()
print('time taken:', end_time - start_time)

REALLY IMPORTANT:

• While you make your code more efficient, you'll need to repeatedly run it to see if your actions are actually decreasing the time it takes to run. Instead of running the full analysis each time, just run a couple values of $k$. That way, you're not waiting a long time for your code to run each time. Once you've decreased this partial run time by a lot, you can run your entire analysis again.

• If you get stuck for more than 10 minutes without making progress, ping me on Slack so that I can take a look at your code and let you know if there's anything else that's making it slow.

d. Complete quiz corrections for any problems you missed. (I'll have the quizzes graded by tonight, 5/5.) That will either involve revising your free response answers or revising your code and sending me the revised version.

### Submission¶

Link to KNN code that runs in (your relative speed) * 45 seconds or less. When I run your code, it should print out the total time it took to run.

Quiz corrections

# Problem 121¶

Before fitting a k-nearest neighbors model, it's common to "normalize" the data so that all the features lie within the same range. Otherwise, variables with larger ranges are given greater distance contributions (which is usually not what we want).

The following video explains 3 different normalization techniques: simple scaling, min-max scaling, and z-scoring.

Consider the following dataset. The goal is to use the features to predict the book type (children's book vs adult book).

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/datasets/book-data.csv

First, read in this dataset and change the "book type" column to be numeric (1 if adult book, 0 if children's book).

a. Create a "leave-one-out accuracy vs k" curve for k=[1,3,5,...,99].

b. Repeat (a), but this time normalize the data using simple scaling beforehand.

c. Repeat (a), but this time normalize the data using min-max scaling beforehand.

d. Repeat (a), but this time normalize the data using z-scoring beforehand.

e. Overlay all 4 plots on the same graph. Be sure to include a legend that labels the plots as "unscaled", "simple scaling", "min-max", "z-score".

You should get the following result:

f. Answer the big question: why does normalization improve the accuracy? (Or equivalently, why did the model perform worse on the unnormalized data?)

### Submission¶

Overleaf doc with plot and explanation, as well as a link to the code that you wrote to generate the plot.

# Problem 120¶

### KNN - Titanic Survival Modeling¶

Note: Previously, this problem had consisted of a KNN model on the full titanic dataset along with normalization techniques. The analysis was taking too long on chromebooks, so I've reduced the size of the dataset. Also, the normalization techniques weren't having an effect on the result, so I took that off this assignment but will revise the normalization task and put it on the next assignment. Any code you wrote for the normalization techniques will be useful in the next assignment.

In this problem, your task is to use scikit-learn's k-nearest neighbors implementation to predict survival in a portion of the titanic survival modeling dataset.

Remember that the fully-processed dataset is here:

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/debugging-help/processed_titanic_data.csv

Take that fully-processed dataset and filter it down to the first 100 rows, and only these columns:

[
"Survived",
"Sex",
"Pclass",
"Fare",
"Age",
"SibSp"
]

Then, create a plot of leave-one-out accuracy vs $k$ for the following values of $k{:}$

[1,3,5,10,15,20,30,40,50,75]

You should get the following result:

### K-Fold Cross Validation¶

K-fold cross validation is similar to leave-one-out cross validation, except that instead of repeatedly leaving out one record, we split the dataset into $k$ sections or "folds" and repeatedly leave out one of those folds.

This video explains it pretty well, with a really good visual at the end:

Answer the following questions:

1. If we had a dataset with 800 records and we used 2-fold cross validation, how many models would we fit, how many records would each model be trained on, and how many records would each model be validated (i.e. tested) on?

2. If we had a dataset with 800 records and we used 8-fold cross validation, how many models would we fit, how many records would each model be trained on, and how many records would each model be validated (i.e. tested) on?

3. If we had a dataset with 800 records, for what value of $k$ would $k$-fold cross validation be equivalent to leave-one-out cross validation?

### Submission¶

• Link to your code that generates the plot

• Overleaf doc with the plot and the answers to the 3 questions

# Problem 119¶

### Minimax Strategy Player¶

a. Implement a minimax player for your tic-tac-toe game.

Remember that the minimax strategy works as follows:

1. Create a game tree with all the states of the tic tac toe game
2. Identify the nodes that represent terminal states and assign them 1, -1, or 0 depending on whether it corresponds to a win, loss, or tie for you
3. Repeatedly propogate those scores up the tree to parent nodes.

• If the game state of the parent node implies that it's your turn, then the score of that node is the maximum value of the child scores (since you want to maximize your score).

• If the game state of the parent node implies that it's the opponent's turn, then the score of that node is the minimum value of the child scores (since your opponent wants to minimize your score).

4. Always make the move that takes you to the highest-score child state. (If there are ties, then you can choose randomly.)

b. Check that your minimax strategy usually beats a random strategy. Run as many minimax vs random matchups as you can in 3 minutes, alternating who goes first. What percentage of the time does minimax win? Post your win percentage on Slack.

### Prep for Discussion Next Class¶

Heads up that next class, we'll discuss whether to use Riley's tic-tac-toe implementation or Colby's implementation. If you have strong feelings one way or the other, prep for it by creating a pro con list that you can reference during the discussion.

### Submission¶

Github commit for minimax code, and post your minimax win percentage on Slack.

Remember, quiz Friday! See the previous assignment for information on what's on it.

# Problem 118¶

I was going to have us create tic-tac-toe playing agents, but then I realized that creating the tic-tac-toe game is enough work for one assignment. So that will be the goal of this assignment.

### Tic-Tac-Toe Game¶

I invited everyone to a github team called cohort-1. Accept the invite and you will be granted write access to the following repository:

https://github.com/eurisko-us/games-cohort-1

In that repository, create a folder tic-tac-toe and create a basic tic-tac-toe game in there. There should be a Game class that accepts two Strategy classes, similar to how space-empires works. (You can make additional classes as you see fit.)

You should also include some basic tests to demonstrate that the game works properly.

Prepare a 3-5 minute presentation about your implementation for Wednesday. Don't exceed 5 minutes. As usual, the things to address are the following:

1. What is your general architecture? E.g. what classes, how is state stored/updated, how do players communicate with the game.
2. What are some things your game does elegantly?
3. What are some things that could be improved, and what kind of state are they in (are they just nice-to-haves, or are there things that will cause major issues down the road if not adjusted)?

You can show parts of your code, but DON'T go through it line-by-line. This is supposed to be a quick elevator pitch of your implementation.

(There's no submission for this assignment; your grade will be based on your presentation during class.)

### Quiz Friday¶

Forward/backward selection, basic manipulations with pandas / numpy / sklearn. Also, the videos that Prof. Wierman assigned:

There won't be any game tree stuff (we'll wait until we're further along that path).

# Problem 117¶

Overview: There are 2 parts to this assignment: backward selection and space empires.

### Backward Selection¶

In this assignment, you'll do "backward selection", which is very similar to forward selection except that we start with all features and remove features that don't improve the accuracy.

One key difference is that with backward selection, we'll just loop through all the features once and remove any features that don't improve the accuracy. This is different from forward selection (in forward selection, we looped through all the features repeatedly).

• The reason why we'll just loop through all the features once is that backward selection is expensive (it takes a long time to fit each model when we're using all the features).

A couple notes:

• Use 100 iterations and set random_state=0 (it's a parameter in the logistic regressor; check out the documentation for more info)

• 100 iterations isn't enough for the regressor to converge, but since things run slow on the chromebooks, we'll just do this exercise with 100 iterations regardless. To suppress convergence warnings, set the following:

from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

Results

Initially, using all the features, testing accuracy should be about 0.788

Then, after backwards selection, testing accuracy should have increased to 0.831

For your ease of debugging, all the features along with information about each iteration of backward selection are shown in the log below.

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/debugging-help/logistic-regressor-backward-selection-100-iterations.txt

### Space Empires¶

In class, decided on using George's implementation, with the following tweaks:

• allow it to run all tests at once if we want (like how Colby did)

• get the phase from the game state (like how David did)

Elijah - your implementation was clever, but George's seemed simpler for everyone to build off of.

This weekend:

• George - merge your pull request (do this ASAP, definitely by Saturday, because Colby and David's tasks depend on this). There are probably merge conflicts that you'll have to resolve.

• Colby - after George has merged his pull request, create a new branch to include the capability to run all unit tests at once. Then create a pull request and merge the code.

• David - after George has merged his pull request, create a new branch to and tweak the code to get the phase from the game state. Then create a pull request and merge the code.

• Riley - create a page on the wiki called "Unit Test Descriptions" and write a brief description for each of the existing unit tests. This way, we can look to this page to understand what we do and don't have unit tests for already.

• Elijah - everyone was having an issue translating between the native game state and the standard game state, so you'll need to either

• write documentation for the native game state and write functions for translating to and from the standard game state, or

• update the standard game state (and its documentation) so that we no longer need a native game state

Since you're the one who knows the most about the native game state, this can be your call.

# Problem 116¶

### Quiz Corrections¶

If there were any problems you didn't get right, fix them and show all your work (or all your code).

### Intro to Minimax Algorithm¶

To introduce the idea of how one can design intelligent agents, we'll implement an intelligent angent that solves tic-tac-toe using the minimax algorithm. But before we actually implement it, we need to understand it at a high level.

Watch the first 8 minutes of the following video that explains the minimax algorithm. (You can probably set it to 1.5x speed)

Then, answer the following questions:

1. What does the root of the game tree represent?

2. What does each edge of the game tree represent?

3. What are the scores of a win, a loss, a tie? (3 answers)

4. Is your opponent the maximizing player or the minimizing player?

5. If a node has a child with score +1 and a child with score -1, then what is the score of the node?

6. If a node has two children with score +1, one child with score 0, and one child with score -1, then what is the score of the node?

7. Draw the full game tree proceeding from the following root node, and label each node with its score according to the minimax algorithm. There should be 12 nodes in total.

X | O | X
---------
| O | O
---------
|   | X

You can do the drawing on paper, take a picture, and put that in your Overleaf doc.

### Unit Testing Presentation¶

On Friday, everyone will give a 5-minute presentation of their unit testing framework, and based on that, we will decide what kind of framework to use for our shared implementation.

Before class, run through your 5-minute presentation and make sure that you are able to do the following in 5 minutes:

1. Explain how your framework works (at a high level).

2. Show off key pieces of code. If there are any really elegant pieces, show them off. If there are any messy pieces, be forthcoming about it.

3. Show how you run your testing framework on the existing unit tests and show the output.

You don't have to make slides or anything super formal. You just need to describe things clearly and concisely.

### Submission¶

Overleaf doc with quiz corrections & minimax answers

# Problem 115¶

This is a catch-up assignment. Please prioritize problem 114 -- it's important to have that done by Wednesday's class.

# Problem 114¶

a. If there's anything that you find confusing about our game implementation, post on Slack for discussion.

• In particular, George -- you should ask what you were wondering about the game state and include what you printed out for the game state.

• This is now our shared implementation, so it's everyone's responsibility to maintain it. If there's anything that you find confusing, then take initiative to ask about what it means on Slack. If there's any part of the code that you don't like, kick off a discussion about changing it. It's everyone's code now.

b. We currently have 4 unit tests in the unit_tests folder: movement test 1, economic tests 1 and 2, and combat test 1. Create a file that executes these unit tests.

• You DON'T have to debug the game so that the tests pass. I just want you to create a unit test framework that runs each test and says whether it passes or not.

• You can change the structure of the test files if you want, e.g. structuring the test description in a more standard format, or whatever you want to do to make it easier to run the unit tests.

• Then, next week, we can compare the tradeoffs of everyone's frameworks and agree upon a format to use going forward.

• Develop your unit testing file on your own separate branch. But you don't have to make a pull request (we'll decide which branch to pull in during the next class).

### Submission¶

A link to your unit tests, and a link to the commit on your branch.

# Problem 113¶

### Space Empires¶

If you haven't already, get your strategy working in our shared game implementation and create a pull request so we can merge in class on Friday.

This is important. If you're confused by any part in the code, DON'T use that as an excuse for not having this done. Post on Slack and we'll clear up any confusions in the code.

Lastly, don't worry if the game doesn't work exactly as intended right now. We'll start with unit tests on Friday. I just want everyone's strategy to run on our shared game without giving errors.

### Forward Selection¶

Previously, you built a logistic model with 167 features, and got the following results using max_iter=10,000:

training: 0.848
testing: 0.811

It turned out that running that many iterations was taking a while (5 minutes) for some students, so let's use max_iter=1,000 instead. The logistic regressor might not fully converge, which means the model will probably be slightly worse, but that's okay because right now just going through this modeling process for educational purposes.

Using max_iter=1,000, I get the following results:

training: 0.846
testing: 0.808

Yours should be pretty similar.

Now, you'll notice that the training accuracy is quite a bit higher than the testing accuracy. This is because we now have a LOT of features in our dataset, and not all of them are useful, which means it's harder for the model to figure out what is useful. The model ends up fitting to some "noise" in the data (see https://en.wikipedia.org/wiki/Noisy_data) and that causes it to pick up on some random patterns that aren't actually meaningful. The model becomes paranoid!

To fix this issue, we need to carry out feature selection, in which we attempt to select only the features that are actually useful to the model.

One type of feature selection method is forward selection, in which we begin with an empty model and add in variables one by one. In each forward step, you add the one variable that gives the single best improvement to your model.

Your task is to carry out forward selection on those 167 features.

• Initially, you'll assume a model with no features. You don't actually build this model, but you assume its accuracy is 0.

• Each forward step, you'll need to create a new model for each possible feature you might add next.

• The next feature should always be the feature that gives you the largest accuracy when included in your model.

• If there are any ties, you can just use the feature that you checked first. That way, you'll be able to compare to the log I provide at the bottom of the assignment.
• Stopping Criterion: If the feature that gives the largest accuracy doesn't actually improve the accuracy of the model, then stop.

• In general, in the $n$th step of forward selection, you should be testing out models with $n$ features, $n-1$ of which are the same across all the models.

Put this problem in a separate file. I'll give you the processed data set so that you can be sure you're using the right starting point (it should match up with yours, but just in case it doesn't you can still do this problem without having to go down the rabbit hole of debuggin your data processing).

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/debugging-help/processed_titanic_data.csv

Your task is to take the processed data set and carry out forward selection. You should end up with the features and accuracies shown below.

['Sex', 'Pclass * SibSp', 'Pclass * Fare', 'Pclass * CabinType=E', 'Fare * CabinType=D', 'SibSp * CabinType=B', 'SibSp>0', 'Fare * CabinType=A']
training: 0.818
testing: 0.806

Print out a log like that given in the file below. This log is given to help you debug.

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/debugging-help/logistic-regressor-forward-selection.txt

IMPORTANT: While initially writing your code, change max_iter to a small number like 10 so that you're not waiting around for your log to generate each time. Once your code seems like it's working as intended, THEN update the iterations to 1000 and check that your results match up with those given in the log above.

You'll notice that we were able to remove a TON of the features, and get nearly the same testing accuracy. The training accuracy also got closer to the testing accuracy. That's good.

However, the testing accuracy didn't increase. It actually went down a bit. In a future assignment, we'll talk about another feature selection method that solves this issue.

### Submission¶

Just the repl.it link to your file and the commit link for Qithub.

Also, remember that there's a quiz on Friday (as outlined on the previous assignment).

# Problem 112¶

### Titanic Survival Prediction - Interaction Features¶

Put your code for this problem in the file that you've been using to do the titanic survival prediction using pandas, numpy, and sklearn.

Previously, we left off using a logistic regression with the following features:

['Sex', 'Pclass', 'Fare', 'Age', 'SibSp', 'SibSp>0', 'Parch>0', 'Embarked=C', 'Embarked=None', 'Embarked=Q', 'Embarked=S', 'CabinType=A', 'CabinType=B', 'CabinType=C', 'CabinType=D', 'CabinType=E', 'CabinType=F', 'CabinType=G', 'CabinType=None', 'CabinType=T']

We got the following accuracy:

training accuracy: 0.8260
testing accuracy: 0.7903

Now, let's introduce some interaction terms. You'll need to create another column for each non-redundant interaction between features. An interaction is redundant if the two features are derived from the same original feature.

• SibSp and SibSp>0 are redundant

• All the features that start with Embarked= are redundant with each other

• All the features that start with CabinType= are redundant with each other

I can't give you a list of all these features because then you could just copy over that list and use it as a starting point. But I can tell you that there will be 167 features in total, not including Survival (which is not actually a feature since that's what we're trying to predict). There are 20 non-interaction features and 147 interaction features for a total of 167 features.

There are many ways to accomplish this. My suggestion is to first just create a list of all the names of interaction terms between non-redundant features,

['Sex * Pclass', 'Sex * Fare', ...]

and then loop through that list to create the actual column in your dataframe for each interaction feature.

If you fit your regressor using all 167 features with max_iterations=10000, you should get the following result (rounded to 3 decimal places)

training: 0.848
testing: 0.811

Note that at this point, our model is probably overfitting a bit. In a future assignment, we'll fix that by introducing some basic "feature selection" methods.

### Submission¶

Just submit the repl.it link to your file along with the Github commit to your kaggle repository. Your file should print out your training and testing accuracy, which should match up with the given result.

### Quiz¶

We'll have a quiz on Friday on the following topics:

• logistic regression (pseudoinverse & gradient descent)

• basic data processing / model fitting with pandas / numpy / sklearn

Note that in class today, we reviewed the logistic regression part, but the questions I ask on the quiz aren't going to be exactly the same as the ones we went over in the review. The quiz will check whether you've developed intuition from really understanding the answers to those questions, and the intuition should carry over to similar but slightly different questions.

I may ask you to do some computations by hand, so make sure you're able to do that too (I'd suggest to work out the first iteration in problem 76 by hand and make sure that the gradient & updated weights you get match up with what's in the log).

# Problem 111¶

a. Get your level 3 strategy working against NumbersBerserker. Work on a separate branch and create a pull request when you're done. Also, post your win rate on Slack.

The strategies are in space-empires-cohort-1/src/strategies/level_3

https://github.com/eurisko-us/space-empires-cohort-1/tree/main/src/strategies

• If you want to further optimize your strategy, then create some other strategy players and make sure your strategy wins against them too. (After break, we'll run strategy matchups. The winner will receive 50% extra credit, 2nd place 30%, and 3rd place 10%.)

b. Watch the videos that Prof. Wierman assigned during the last meeting. Make sure you're watching them closely enough to talk about them afterwards.

# Problem 110¶

(This is a short ~30 minute assignment since we have Wednesday off.)

Now that you've built a logistic regressor that uses gradient descent, you've "unlocked" the privilege to use sklearn's LogisticRegressor.

Previously, you carried out a Titanic prediction problem using sklearn's linear regressor. For this problem, just tweak the code you wrote to use the logistic regressor instead.

After you replace LinearRegressor with LogisticRegressor in your code, you'll have to

• tweak a parameter of the regressor to get it to run long enough to converge

• update your code to support the format in which the logistic regressor returns information

I'm not going to tell you exactly how to fix those issues, because the point of this problem is to give you practice debugging and reading documentation.

Tip: To find the official documentation on sklearn's logistic regressor, do a google search with the query "sklearn logistic regression".

You should get the output below. The predictions with the logistic regressor turn out to be a little bit better than those with the linear regressor.

features: [
'Sex',
'Pclass',
'Fare',
'Age',
'SibSp', 'SibSp>0',
'Parch>0',
'Embarked=C', 'Embarked=None', 'Embarked=Q', 'Embarked=S',
'CabinType=A', 'CabinType=B', 'CabinType=C', 'CabinType=D', 'CabinType=E', 'CabinType=F', 'CabinType=G', 'CabinType=None', 'CabinType=T']

training accuracy: 0.8260
testing accuracy: 0.7903

coefficients:
{
'Constant': 1.894,
'Sex': 2.5874,
'Pclass': -0.6511,
'Fare': -0.0001,
'Age': -0.0398,
'SibSp': -0.545,
'SibSp>0': 0.4958,
'Parch>0': 0.0499,
'Embarked=C': -0.2078, 'Embarked=None': 0.0867, 'Embarked=Q': 0.479, 'Embarked=S': -0.3519,
'CabinType=A': -0.0498, 'CabinType=B': 0.0732, 'CabinType=C': -0.2125, 'CabinType=D': 0.7214, 'CabinType=E': 0.4258, 'CabinType=F': 0.6531, 'CabinType=G': -0.7694, 'CabinType=None': -0.5863, 'CabinType=T': -0.2496
}

### Submission Template¶

Just submit the repl.it link to your code. When I run it, it should print out the information above.

# Problem 109¶

### Refresher¶

Previously, we built a LogisticRegressor that worked by reducing the regression task down to the task of finding the least-squares solution to a linear system.

More precisely, the task of fitting the logistic function

$$y=\dfrac{1}{1+e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n}}$$

was reduced to the task of fitting the linear regression

$$\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n = \ln \left( \dfrac{1}{y} - 1 \right).$$

### Issue with LogisticRegressor¶

Although this is a slick way to solve the problem, it suffers from the fact that we have to do something "hacky" in order to fit any data points with $y=0$ or $y=1.$

In such cases, we can't just run the model as usual, because the $\ln \left( \dfrac{1}{y}-1 \right)$ term blows up -- so our "hack" has been to

• change any instances of $y=0$ to a small decimal like $y=0.1$ or $y=0.001,$ and

• change any instances of $y=1$ to $1$ minus the small decimal, like $y=0.9$ or $y=0.999,$

depending on the context of the problem.

But this isn't a great way to deal with the issue, because the resulting logistic function can change significantly depending on what small decimal we use. The difference between small decimals may seem like such a minor difference, but when we plug these values in the $\ln \left( \dfrac{1}{y} - 1 \right)$ term, we get wildly different results, which leads to quite different fits.

PART A. To illustrate the quite different fits, fit 4 instances of your current LogisticRegressor to the following dataset:

• one instance where you change all instances of y=0 to y=0.1 and all instances of y=1 to y=0.9

• another instance where you change all instances of y=0 to y=0.01 and all instances of y=1 to y=0.99

• another instance where you change all instances of y=0 to y=0.001 and all instances of y=1 to y=0.999

• another instance where you change all instances of y=0 to y=0.0001 and all instances of y=1 to y=0.9999

df = DataFrame(
[[1,0],
[2,0],
[3,0],
[2,1],
[3,1],
[4,1]],
columns = ['x', 'y'])

Put these all on the same plot, along with the data, and put them in an Overleaf doc. Be sure to label each curve with 0.1, 0.01, 0.001, or 0.0001 as appropriate.

If you need a refresher on plotting / labeling curves, see here:

https://www.eurisko.us/files/assignment_problems_cohort_2_10th.html#Problem-10-1

If you need a refresher on including data in plots, see here:

https://www.eurisko.us/files/assignment_problems_cohort_2_10th.html#Problem-33-1

Explain: How does the plot change as the small decimal is varied?

### Gradient Descent to the Rescue¶

Instead, we can use gradient descent to fit our logistic function. We want to choose the coefficients that minimize the sum of squared error (the RSS).

PART B. In your LogisticRegressor class, write the following methods:

• calc_rss() - calculates the sum of squared error for the regressor

• set_coefficients(coeffs) - allows you to manually set the coefficients of your regressor by passing in a dictionary of coefficients

• calc_gradient(delta) - computes the partial derivatives of the RSS with respect to each coefficient

• gradient_descent(alpha, delta, num_steps, debug_mode=False) - carries out a given number of steps of gradient descent. If debug_mode=True, then print out every step of the way.

Note that we wrote a gradient descent optimizer that a while back:

https://www.eurisko.us/files/assignment_problems_cohort_2_10th.html#Problem-34-2

You can use this as a refresher on how to code up gradient descent, and you might be able to copy/paste some code from here.

• Unfortunately, you can't just pass your logistic regressor into this gradient descent optimizer class -- we wrote optimizer to work on functions whose parameters were passed in as individual arguments, whereas our LogisticRegressor stores its coefficients in a dictionary.

Note that we will use the central difference approximation

$$f'(x) \approx \dfrac{f(x+\delta) - f(x-\delta)}{2\delta}.$$

Here is a test case:

df = DataFrame.from_array(
[[1,0],
[2,0],
[3,0],
[2,1],
[3,1],
[4,1]],
columns = ['x', 'y'])

reg = LogisticRegressor(df, dependent_variable='y')

reg.set_coefficients({'constant': 0.5, 'x': 0.5})

alpha = 0.01
delta = 0.01
num_steps = 20000

reg.coefficients

{'constant': 2.7911, 'x': -1.1165}

Here are logs for every step of the way:

Make a plot of the resulting logistic curve, along with the data, and put it in an Overleaf doc.Be sure to label your curve with "gradient descent".

### Submission Template¶

link to Overleaf doc (just contains 2 plots and the explanation of the first plot): ____
repl.it link to code that generated the plots: _____
commit link (machine-learning): ____

# Problem 108¶

Going forward, we need to to start using models from an external machine learning library after you build the initial versions of the corresponding models. Most of the learning comes from building the first version, and debugging these subtle issues takes up too much time. Plus, it's good to know how to work with external libraries.

So instead of "build everything from scratch and maintain it forever", our motto will be "build the first version from scratch and then switch to a popular library".

### Important Note¶

If you're behind on any machine learning problems, don't worry about catching up. Just start off with this problem. This problem doesn't depend on anything you've written previously.

### The Problem¶

Create a new repository called kaggle. Create a folder titanic, and put your dataset and analysis file in there. Remember that the dataset is here:

https://www.kaggle.com/c/titanic/data?select=train.csv

In this assignment, you will create an analysis.py file that carries out an analysis similar to that described in problem 107, using the libraries numpy, pandas, and sklearn. You should follow along with the relevant parts of the walkthrough in the class recording:

https://vimeo.com/529459397

Here are the relevant parts. (But read the rest of the assignment before starting.)

• [0:35-0:42] Set up the environment & read in the dataframe

• [0:42-0:50] Process Sex by changing male to 0 and female to 1

• [0:56-1:02] Process Age by replacing all NaNs with the mean age

• [1:02-1:09] Process SibSp and Parch. Keep SibSp, but also add the indicator variable (i.e. dummy variable) SibSp>0. Add the indicator variable Parch>0 as well, and get rid of Parch.

• [1:17-1:42] Split into train/test, fit the regressor, get the predictions, compute training/testing accuracy. (At this point, don't worry about checking your numbers match up with mine, since I wasn't showing exactly which columns were being used in the regressor.)

• [1:42-1:46] State the columns to be used in the regressor. (Here, your numbers should match up with mine, since I show exactly which columns are being used in the regressor.)

• [1:46-1:56] Process Cabin into CabinType and create the corresponding indicator variables. Also, create the corresponding indicator variables for Embarked. Make sure to delete Cabin, CabinType, and Embarked afterwards.

• [2:00-2:02] Run the final model. Your numbers should match up with mine.

You can just follow along with the walkthrough in the class recording and turn in the code you write as you followed along.

Note that watching me type and speak at normal (slow) pace is a waste of time, so play the video on 2x speed. You can access the speed controls by clicking on the gear icon in the bottom-right of the video.

I think this is a 90-minute problem. The relevant parts of the recording take up 70 minutes, and if you play at 2x speed, it's only 35 minutes. If we budget an equal or double time for you to write the code as you follow along, then we're up to 90 minutes. But if you find yourself taking longer or getting stuck anywhere, please let me know.

Here is the documentation for LinearRegressor():

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

At the end, your code should print out the following (where numbers are rounded to 4 decimal places):

features: [
'Sex',
'Pclass',
'Fare',
'Age',
'SibSp', 'SibSp>0',
'Parch>0',
'Embarked=C', 'Embarked=None', 'Embarked=Q', 'Embarked=S',
'CabinType=A', 'CabinType=B', 'CabinType=C', 'CabinType=D', 'CabinType=E', 'CabinType=F', 'CabinType=G', 'CabinType=None', 'CabinType=T']

training accuracy: 0.81
testing accuracy: 0.7749

coefficients:
{
'Constant': 0.696,
'Sex': 0.5283,
'Pclass': -0.0978,
'Fare': 0.0,
'Age': -0.0058,
'SibSp': -0.0585, 'SibSp>0': 0.0422,
'Parch>0': 0.0097,
'Embarked=C': -0.0547, 'Embarked=None': 0.052, 'Embarked=Q': 0.0709, 'Embarked=S': -0.0682,
'CabinType=A': 0.0447, 'CabinType=B': 0.0371, 'CabinType=C': -0.0124, 'CabinType=D': 0.1818, 'CabinType=E': 0.1088, 'CabinType=F': 0.2593, 'CabinType=G': -0.2797, 'CabinType=None': -0.0677, 'CabinType=T': -0.2717
}

### Submission Template:¶

Just submit 2 things:

1. the repl.it link to kaggle/titanic/analysis.py
2. the link to your github commit

# Problem 107¶

### Announcement¶

We're going to cut down on Eurisko assignment durations by a third. We've made a lot of progress, and most of you have AP tests coming up, so we're going to ease off the gas pedal a bit. We're going to hit the brakes on Haskell, C++, and code review, since you've had some basic exposure to those things and pursuing them further isn't going to be as valuable to the goals of the class as the space empires and machine learning stuff. Each assignment will consist of a single problem in one of the following areas:

• implementing something in space empires
• implementing part of a machine learning model
• implementing part of a data structure (e.g. Matrix, DataFrame)
• prepping/exploring some data for modeling
• carrying out a model and interpreting the results
• writeups (such as blog posts)

### Titanic Survival Modeling¶

For this problem, you'll need to turn in both your analysis code and an Overleaf writeup. The code should print out all the checks that are provided to you in this problem.

Note: after this problem was released, I realized I forgot to include a Constant column, as we should normally do for linear regression. However, the main things to be learned on this assignment don't really depend on the constant, so carry on without it.

a. Continue processing your data as follows:

• Sex - replace "male" with 0 and "female" with 1

• Age - replace any instances of None with the mean age (which should be about 29.699)

• SibSp - this was one of the variables that didn't have a clear positive or negative association with Survival. When SibSp=0, the survival was low; when SibSp>=1, the survival started higher but then decreased as SibSp decreased.

So, what we can do is create a dummy variable SibSp=0 that equals 1 when SibSp is equal to 0 (and 0 otherwise). And we'll keep SibSp as well. This way, the variable SibSp=0 can be given a negative coefficient that offsets the coefficient of SibSp in the case when SibSp equals 0.

• Parch - we'll replace this with a dummy variable Parch=0, because the only significant difference in the data is whether or not Parch is equal to 0. Among passengers who had Parch greater than 0, it doesn't look like there's much variation in survival.

• CabinType - replace this with dummy variables of the form CabinType=A, CabinType=B, CabinType=C, CabinType=D, CabinType=E, CabinType=F, CabinType=G, CabinType=None, CabinType=T.

• Embarked - replace this with dummy variables of the form Embarked=C, Embarked=None, Embarked=Q, Embarked=S.

Now, your data should all be numeric, and we can put it into linear regressor.

Note: To get predictions out of the linear regressor, we'll interpret the linear regression's output in the following way.

• if the linear regressor predicts a value less than 0.5, then it predicts the passenger did not survived (i.e. it predicts survival=0)

• if the linear regressor predicts a value greater than or equal to 0.5, then it predicts the passenger survived (i.e. it predicts survival=1)

b. Create train and test datasets. Use first 500 records for training, and the rest for testing. Start out just training a model which uses Sex as the only feature. This will be our baseline.

train accuracy: 0.8
test accuracy:  0.7698

{'Sex': 0.7420}

Note that accuracy is just the number of correct classifications divided by the total number of classifications.

c. Now, introduce Pclass. Uh oh! Why didn't our test accuracy get any better? Write your explanation in an Overleaf doc.

train accuracy: 0.8
test accuracy:  0.7698

{'Sex': 0.6514, 'Pclass': 0.0419}

Hint: Look at the Sex coefficient.

d. Bring in some more features: Fare, Age, SibSp, SibSp=0, Parch=0. The test accuracy still hasn't gotten any better. Why?

train accuracy: 0.796
test accuracy:  0.7698

{
'Sex': 0.5833,
'Pclass': -0.0123,
'Fare': 0.0012,
'Age': 0.0008,
'SibSp': -0.0152,
'SibSp=0': 0.0478,
'Parch=0': 0.0962
}

e. Bring in some more features: Embarked=C, Embarked=None, Embarked=Q, Embarked=S. Now the model actually got better. Why is the model more accurate now?

train accuracy: 0.806
test accuracy:  0.7902813299232737

{
'Sex': 0.4862,
'Pclass': -0.1684,
'Fare': 0.0002,
'Age': -0.0056,
'SibSp': -0.0719,
'SibSp=0': -0.0784,
'Parch=0': -0.0269,
'Embarked=C': 0.9179,
'Embarked=None': 1.0522,
'Embarked=Q': 0.9282,
'Embarked=S': 0.8544
}

f. Bring in some more features: CabinType=A, CabinType=B, CabinType=C, CabinType=D, CabinType=E, CabinType=F, CabinType=G, CabinType=None. The model is continuing to get better.

train accuracy: 0.816
test accuracy:  0.8005

{
'Sex': 0.4840,
'Pclass': -0.1313,
'Fare': 0.0003,
'Age': -0.0058,
'SibSp': -0.0724,
'SibSp=0': -0.0823,
'Parch=0': -0.0187,
'Embarked=C': 0.5446,
'Embarked=None': 0.6773,
'Embarked=Q': 0.5522,
'Embarked=S': 0.4829,
'CabinType=A': 0.3830,
'CabinType=B': 0.3360,
'CabinType=C': 0.2686,
'CabinType=D': 0.4311,
'CabinType=E': 0.4973,
'CabinType=F': 0.4679,
'CabinType=G': 0.0858,
'CabinType=None': 0.2634
}

g. Now, introduce CabinType=T. You'll probably see the accuracy go down. I won't include a check because different people will get different results for this one. Why did the accuracy go down?

This is subtle, so I'll give a hint. Look at the entries of $X^TX$ and compare to what the entries looked like before you introduced CabinType=T. The entries get extremely large/small.

So, there are really two questions:

1. Why are the extremely large/small entries of $(X^TX)^{-1}$ leading to lower classification accuracy?
2. (Harder) Why are the entries of $(X^TX)^{-1}$ getting extremely large/small in the first place?

### Space Empires¶

Our shared game implementation is here:

https://github.com/eurisko-us/space-empires-cohort-1

Here is a high-level guide of the process for making changes to our shared repository:

https://guides.github.com/introduction/flow/

To check out a new branch:

>>> git checkout -b justin-comment
Switched to a new branch 'justin-comment'

Add a comment to yourname-comment.txt

>>> git status
On branch justin-comment
Untracked files:
(use "git add <file>..." to include in what will be committed)

justin-comment.txt

nothing added to commit but untracked files present (use "git add" to track)

>>> git add justin-comment.txt
>>> git commit -m "create Justin's comment"
[justin-comment 542f30e] create Justin's comment
1 file changed, 1 insertion(+)
create mode 100644 justin-comment.txt

Push to your branch

>>> git push origin justin-comment
Username for 'https://github.com': jpskycak
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 309 bytes | 309.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
remote:
remote: Create a pull request for 'justin-comment' on GitHub by visiting:
remote:      https://github.com/eurisko-us/space-empires-cohort-1/pull/new/justin-comment
remote:
To https://github.com/eurisko-us/space-empires-cohort-1.git
* [new branch]      justin-comment -> justin-comment

On GitHub, it will show that your branch is a commit ahead, and possibly even commits behind (if other people have made commits in the time since you first created your branch).

Click "Pull request", and create the pull request. Don't merge it yet, though. We'll do that during class.

### Submission Template¶

For your submission, copy and paste your links into the following template:

overleaf link to explanations: _____

repl.it link to file that prints out
the results of your model (it should
match up with the checks in the
assignment): _____

commit link (machine-learning): ____

# Problem 106-1¶

### Space Empires¶

I had a chat with Jason this morning about our approach to space empires. We've been building the games separately so that each person gets a maximum learning experience, but now we're at a point where this method of development is becoming so time consuming that it keeps us from making progress down other avenues (neural nets, sql parser, etc). Not to mention, we have a deadline that we need to have level 3 working in 3 weeks for our next meeting with Prof. Wierman.

So, he's ok with the idea of everyone in the class working on the same game implementation. We'll discuss that more during the next class, but for now, hit the brakes on space empires and focus on getting the problems on this current assignment done.

### Neural Networks¶

Now that you've had plenty of practice computing weight gradients, let's go back to implementations.

Consider the following dataset, whose points follow the function $y=A \sin (Bx)$ for some constants $A,B.$

[(0, 0.0),
(1, 1.44),
(2, 2.52),
(3, 2.99),
(4, 2.73),
(5, 1.8),
(6, 0.42),
(7, -1.05),
(8, -2.27),
(9, -2.93),
(10, -2.88),
(11, -2.12),
(12, -0.84),
(13, 0.65),
(14, 1.97),
(15, 2.81),
(16, 2.97),
(17, 2.4),
(18, 1.24),
(19, -0.23)]

Consider the following neural network:

$$\begin{matrix} & & n_2 \\ & & \uparrow \\ & & n_1 \\ & & \uparrow \\ & & n_0 \\ \end{matrix}$$

Let the activation functions be as follows: $f_0(x) = x,$ $f_1(x) = \sin(x),$ $f_2(x) = x.$

Then $a_2 = w_{12} \sin( w_{01} i_0 ),$ so we can use this network to fit our function $y=A \sin (Bx).$

Use this neural network to fit the dataset, starting with $w_{01} = w_{12} = 1$ and using a learning rate of $0.001.$ Loop through the dataset $1000$ times, applying a gradient descent update at each point (i.e. $20$ gradient descent updates per loop). So, there will be $20\,000$ gradient descent updates in total.

Your final weights should be $w_{01} = 0.42, w_{12} = 2.83$ rounded to $2$ decimal places.

Here is a log to help you debug. The numbers are rounded to 4 decimal places.

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/debugging-help/neural-net-106.txt

Here's the weight updates worked out for the second data point:

# Problem 106-2¶

### Titanic¶

In the Titanic dataset, let's get a sense of how the continuous variables (Age and Fare) relate to Survived.

a. For Age, filter the records down to age categories (0-10, 10-20, 20-30, ..., 70-80) and compute the survival rate (i.e. mean survival) in each category. Exclude any Nones from the analysis.

• Put a table in an overleaf document. Round the survival rate to $2$ decimal places (otherwise it's difficult to read.)

• In the table, include the counts in parentheses. So each table entry should look like survivalRate (count). So if the survival rate were 0.13 and the count were 27 people, then you'd put 0.13 (27).

• What does the table tell you about the relationship between age and survival?

• Give a plausible explanation for why this is.

b. For Fare, filter the records down to fare categories (0-5, 5-10, 10-20, 20-50, 50-100, 100-200, 200+) and compute the survival rate (i.e. mean survival) in each category. Exclude any Nones from the analysis.

• Put a table in the overleaf document and answer the same questions that you did for part (a).

### SQL Parser¶

Update your query method to support ORDER BY. The query

df.query("SELECT selectColname1, selectColname2, selectColname3 ORDER BY orderColname1 order1, orderColname2 order2, orderColname3 order3")

should be parsed and read into the following primitive operations:

df.order_by(orderColname3, order3)
.order_by(orderColname2, order2)
.order_by(orderColname1, order1)
.select([selectColname1, selectColname2, selectColname3])

Assert that your method passes the following tests:

>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)

>>> df.query("SELECT lastname, firstname, age ORDER BY age DESC").to_array()
[['Trapp', 'Charles', 17],
['Smith', 'Anna', 13],
['Mendez', 'Sylvia', 9],
['Fray', 'Kevin', 5]]

>>> df.query("SELECT firstname ORDER BY lastname ASC").to_array()
[['Kevin'],
['Sylvia'],
['Anna'],
['Charles']]

Assert that your method passes these tests as well:

>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Melvin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Carl', 'Trapp', 17],
['Anna', 'Smith', 13],
['Hannah', 'Smith', 13],
['Sylvia', 'Mendez', 9],
['Cynthia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)

>>> df.query("SELECT lastname, firstname, age ORDER BY age ASC, firstname DESC").to_array()
[['Fray', 'Melvin', 5],
['Fray', 'Kevin', 5],
['Mendez', 'Sylvia', 9],
['Mendez', 'Cynthia', 9],
['Smith', 'Hannah', 13],
['Smith', 'Anna', 13],
['Trapp', 'Charles', 17],
['Trapp', 'Carl', 17]]

# Problem 106-3¶

### Commit + Review¶

• Commit your code to Github.

• Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)

### Submission Template¶

For your submission, copy and paste your links into the following template:

repl.it link to neural net implementation that prints out the final weights: _____
overleaf link to titanic analysis: _____
repl.it link to sql parser: _____

link to resolved issue: ____
Commit links (machine-learning): ____

# Problem 105-1¶

This will be a "consolidation problem." Your task is to make sure that you have Problem 104-1 completed by the end of the weekend, with the exception that you don't have to run your classmates' unit tests. You just have to get movement test 1 working and write your own unit test as assigned in Problem 104-1.

Remember that to initialize your game, you may need to loop through your game state to initialize some Player and Unit objects accordingly. If you get stuck or confused, please post on Slack.

Remember that to push your unit tests up to Github, you'll need to clone the repo, make your changes, and commit and push your changes. Here is how to do this:

• Clone the repo:

>>> git clone https://github.com/eurisko-us/space-empires-cohort-1.git
• Create your new unit tests. A fast way to make the necessary files is to cd into the desired location and then touch some files, like this:

>>> ls
space-empires-cohort-1
>>> cd space-empires-cohort-1/
>>> ls
>> cd unit_tests/
>>> ls
movement_test_1
>>> mkdir combat_test_1
>>> cd combat_test_1
>>> touch description.txt initial_state.py final_state.py strategies.py
>>> ls
description.txt initial_state.py final_state.py strategies.py
• Commit and push your unit tests.

>>> git status
(will show the files you modified)
>>> git add *
(add all the files you modified)
>>> git commit -m "add combat test 1"
>>> git push origin
• Check that the repo was updated successfully. Go to https://github.com/eurisko-us/space-empires-cohort-1 and make sure your unit tests are there.

# Problem 105-2¶

### Quiz Corrections¶

Correct any errors on your quiz (if you got a score under 100%). You can just submit corrected code and/or explanations (you don't have to explain why you got it wrong in the first place).

Remember that we went through the quiz during class, so if you have any questions or need any help, look at the recording first.

### C++¶

Write a C++ program that creates an array {11, 12, 13, 14} and prints out the memory address of the array and of each element.

Format your output like this:

array has address 0x7fff58f44160
index 0 has value 11 and address 0x7fff58f44160
index 1 has value 12 and address 0x7fff58f44164
index 2 has value 13 and address 0x7fff58f44168
index 3 has value 14 and address 0x7fff58f4416c

Note that your memory addresses will not be the same as those above. (Each time you run the program, the memory addresses will be different.)

Note: If you're having trouble figuring out where to start, remember that we've answered conceptual questions about pointers and the syntax of pointers using this resource:

https://www.learncpp.com/cpp-tutorial/introduction-to-pointers/

# Problem 105-3¶

### Commit + Review¶

• Commit your code to Github.

• Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)

### Submission Template¶

For your submission, copy and paste your links into the following template:

github link to space empires unit test that you created: ____
link to repl.it file in which you run movement test 1: ____
link to quiz corrections (if applicable): _____
link to c++ problem: _____

link to resolved issue: ____
Commit links (space-empires, assignment-problems): ____

# Problem 104-1¶

This is the new repo where we'll store our logs, unit tests, and wiki pages:

https://github.com/eurisko-us/space-empires-cohort-1

You should all have write access to the repo.

### Create Unit Tests¶

Each person will create 1 unit test. You can use Colby's sheet for inspiration, or make up your own unit test.

Before you write your unit test, though, check in with the other person who's doing a test for the same phase to make sure that your test is different from theirs.

• David: create movement test 2

• George: create combat test 1

• Colby: create combat test 2

• Elijah: create economic test 1

• Riley: create economic test 2

Post on slack if you run into any trouble pushing your tests up to the repo.

### Run Unit Tests¶

Clone eurisko-us/space-empires-cohort-1

Create a file to run all the unit tests. You can start making progress on this right away, since movement test 1 already exists.

Once your classmates push their tests, you can run them.

# Problem 104-2¶

### SQL Parser¶

We're going to write a method in our DataFrame called query, that will take a string with SQL-like syntax as input and execute the corresponding operations on our dataframe.

Let's start off simple, with the select statement only.

Write a function query that takes a select query of the form

df.query("SELECT colname1, colname2, colname3")

and returns a dataframe with the appropriate select statement applied:

df.select([selectColname1, selectColname2, selectColname3])

Here is a concrete example that you should write a test for:

>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)

>>> df.query('SELECT firstname, age').to_array()
[['Kevin', 5],
['Charles', 17],
['Anna', 13],
['Sylvia', 9]]

Make sure your function is general (it should not be tailored to a specific number of columns).

### Titanic Survival Exploration¶

Now that we are able to use our group_by and aggregate methods in our dataframes, let's return to the Titanic dataset.

We now have the following columns in our dataframe, and our current task is to figure out how each of these columns are related to survival (if at all).

[
"Pclass",
"Surname",
"Sex",
"Age",
"SibSp",
"Parch",
"TicketType",
"TicketNumber",
"Fare",
"CabinType",
"CabinNumber",
"Embarked"
]

Let's start with the columns that consist of few categories and are therefore relatively easy to analyze.

Put your answers to the following questions in an overleaf doc. Include a table for each answer, and be sure to explain what the data tells you about how that variable is related to survival (if anything), as well as why you think that relationship happens.

Note that there is not always a single correct answer regarding why the relationship happens, but you should try to come up with a plausible explanation.

To look up what a variable actually represents, check the data dictionary here: https://www.kaggle.com/c/titanic/data

a. Group your dataframe by Pclass and find the survival rate (i.e. the mean of the survival variable) and the count of records for each Pclass.

You should get the following result. What does this result tell you about how Pclass is related to survival? Why do you think this is?

Pclass  meanSurvival  count
1       0.629630      216
2       0.472826      184
3       0.242363      491

b. Group your dataframe by Sex and find the survival rate and count of records for each sex.

You should get the following result. What does this result tell you about how Sex is related to survival? Why do you think this is?

Sex     meanSurvival count
female  0.742038     314
male    0.188908     577

c. Continuing the same analysis method as in parts (a) and (b): what is the table for SibSp, what does it tell you about how SibSp is related to survival, and why do you think this is?

d. Continuing the same analysis method: what is the table for Parch, what does it tell you about how Parch is related to survival, and why do you think this is?

e. Continuing the same analysis method: what is the table for CabinType, what does it tell you about how CabinType is related to survival, and why do you think this is?

f. Continuing the same analysis method: what is the table for Embarked, what does it tell you about how Embarked is related to survival, and why do you think this is?

In case you're interested, here is what we'll be doing in future assignments:

• exploring some of the continuous variables (e.g. Age and Fare)
• fitting models to the data
• featurizing "messier" data like Surname, TicketType, etc and seeing if it improves our models

# Problem 104-3¶

### Commit + Review¶

• Commit your code to Github.

• Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)

### Submission Template¶

For your submission, copy and paste your links into the following template:

github link to space empires unit test that you created: ____
link to repl.it file in which you run the unit tests: ____
link to DataFrame.query test: ____
overleaf writeup for titanic survival exploration: _____
link to resolved issue: ____
Commit links (space-empires, machine-learning): ____

# Problem 103-1¶

### Summary¶

The current primary problem can be to finish up the slinky development thing from last time, and also write a method that initializes the game with a given game state.

And then on Wednesday's assignment, we can write unit tests that @Colby put in the doc (as well as any other unit tests you guys want).

And then after we've got those unit tests working, we can do a couple rounds of slinky development to scale up to the level 3 game that we were trying to implement before. I'm sure we can succeed in getting it done before the next meeting with Prof. Wierman.

Also, remember to watch the lectures that Prof. Wierman put in the chat sometime before our nexrt meeting.

### Finish Up Slinky Development¶

You guys can have 2 more days to work on getting your game to match up with the logs.

Important! The logs have been updated! As of Sunday, the combat shown in the logs was screwed up.

The rule I implemented was that you score a hit if

die roll >= (attack strength) + (attack technology) - (defense strength) - (defense technology)

But actually, it should be that you get a hit if

die roll = 1
or die roll <= (attack strength) + (attack technology) - (defense strength) - (defense technology)

I also implemented some additional information suggested by George, such as survivors after combat, and removing Homeworld from the combat order.

So, get your logs to match up with mine from Problem 102-1, and submit your diffs. The logs are in cohort-1/102-pre-level-3-game of the slinky-development repo:

### Prep for Unit Tests¶

We can do unit tests without too much additional infrastructure. We can just make a method in our game that initializes the game with a given game state, and then we can run the appropriate phase (movement, combat, economic) and make sure the game state afterwards is as we expect it to be.

For this assignment, just make the method that initializes your game with a given game state, and make sure that you will be able to update that state incrementally (i.e. by running the appropriate phase).

# Problem 103-2¶

### SQL Primitives: Group By & Aggregate¶

The next thing we need to do in our titanic prediction modeling is to determine which features are useful for predicting survival. However, this will involve some extensive data processing, and it will be much easier to do this if we first build some SQL primitives.

You should already have methods select, where, and order_by implemented in your DataFrame class. Check to make sure you have these methods and that they pass the following tests.

• Note: You may have previously written these methods under slightly different names. You may need to rename select_columns to just select, and select_rows_where to just where.
>>> df = DataFrame.from_array(
[['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]],
columns = ['firstname', 'lastname', 'age']
)

>>> df.select(['firstname','age']).to_array()
[['Kevin', 5],
['Charles', 17],
['Anna', 13],
['Sylvia', 9]]

>>> df.where(lambda row: row['age'] > 10).to_array()
[['Charles', 'Trapp', 17],
['Anna', 'Smith', 13]]

>>> df.order_by('firstname').to_array()
[['Anna', 'Smith', 13],
['Charles', 'Trapp', 17],
['Kevin', 'Fray', 5],
['Sylvia', 'Mendez', 9]]

>>> df.order_by('firstname', ascending=False).to_array()
[['Sylvia', 'Mendez', 9],
['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13]]

>>> df.select(['firstname','age']).where(lambda row: row['age'] > 10).order_by('age').to_array()
[['Anna', 13],
['Charles', 17]]

At this point, writing a "select-where-order" SQL statement in terms of the primitives seems obvious. Just apply the select, where, and order primitives in that order. Right?

Not exactly. The intuitive order only works when the columns referenced in where and order_by also appear in the select statement. So, to carry out a "select-where-order" SQL statement, we really need to apply the primitives in the order where, order, select.

A concrete example is shown below.

# this query FAILS because we filtered out the 'age' column
# before applying the where condition, and the where condition
# references the 'age' column

>>> df.select(['firstname']).where(lambda row: row['age'] > 10).order_by('age').to_array()
ERROR

# this query SUCCEEDS because we apply the where condition
# before filtering out the 'age' column

>>> df.where(lambda row: row['age'] > 10).order_by('age').select(['firstname']).to_array()
[['Anna'],
['Charles']]

Your task on this problem is to implement another primitive we will need: group_by. Make sure your implementation passes the test below.

>>> df = DataFrame.from_array(
[
['Kevin Fray', 52, 100],
['Charles Trapp', 52, 75],
['Anna Smith', 52, 50],
['Sylvia Mendez', 52, 100],
['Kevin Fray', 53, 80],
['Charles Trapp', 53, 95],
['Anna Smith', 53, 70],
['Sylvia Mendez', 53, 90],
['Anna Smith', 54, 90],
['Sylvia Mendez', 54, 80],
],
columns = ['name', 'assignmentId', 'score']
)

>>> df.group_by('name').to_array()
[
['Kevin Fray', [52, 53], [100, 80]],
['Charles Trapp', [52, 53], [75, 95]],
['Anna Smith', [52, 53, 54], [50, 70, 90]],
['Sylvia Mendez', [52, 53, 54], [100, 90, 80]],
]

Also, implement a method called aggregate(colname, how) that aggregates colname according to the way that is specified in how (count, max, min, sum, avg). Make sure your implementation passes the tests below.

>>> df.group_by('name').aggregate('score', 'count').to_array()
[
['Kevin Fray', [52, 53], 2],
['Charles Trapp', [52, 53], 2],
['Anna Smith', [52, 53, 54], 3],
['Sylvia Mendez', [52, 53, 54], 3],
]

>>> df.group_by('name').aggregate('score', 'max').to_array()
[
['Kevin Fray', [52, 53], 100],
['Charles Trapp', [52, 53], 95],
['Anna Smith', [52, 53, 54], 90],
['Sylvia Mendez', [52, 53, 54], 100],
]

>>> df.group_by('name').aggregate('score', 'min').to_array()
[
['Kevin Fray', [52, 53], 80],
['Charles Trapp', [52, 53], 75],
['Anna Smith', [52, 53, 54], 50],
['Sylvia Mendez', [52, 53, 54], 80],
]

>>> df.group_by('name').aggregate('score', 'sum').to_array()
[
['Kevin Fray', [52, 53], 180],
['Charles Trapp', [52, 53], 170],
['Anna Smith', [52, 53, 54], 210],
['Sylvia Mendez', [52, 53, 54], 270],
]

>>> df.group_by('name').aggregate('score', 'avg').to_array()
[
['Kevin Fray', [52, 53], 90],
['Charles Trapp', [52, 53], 85],
['Anna Smith', [52, 53, 54], 70],
['Sylvia Mendez', [52, 53, 54], 90],
]

### SQL¶

The goal of this problem is to find the number of missing assignments for each student (across all classes) for the following data:

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/sql-tables/4.sql

This problem will involve the use of subqueries. Since this is our problem involving subqueries (other than some simple stuff on SQL Zoo), I've scaffolded it a bit for you.

First, write a query to get the number of assignments that were assigned in each class. Let's call this Query 1. (Tip: use "count distinct")

classId numAssigned
2307    3
3110    2
4990    3

Then, get the number of assignments that each student has completed in each class. Let's call this query 2. (Tip: group by both studentId and classId)

studentId   classId numCompleted
1   2307    3
1   3110    2
1   4990    2
2   2307    2
2   3110    2
2   4990    3
3   2307    1
3   3110    2
3   4990    1
4   2307    3
4   3110    1
4   4990    3
5   2307    1
5   3110    2
5   4990    3

Join the results of queries 1 and 2 so that you can compute each student's number of missing assignments. (Tip: use queries 1 and 2 as subqueries)

studentId   classId numMissing
1   2307    0
1   3110    0
1   4990    1
2   2307    1
2   3110    0
2   4990    0
3   2307    2
3   3110    0
3   4990    2
4   2307    0
4   3110    1
4   4990    0
5   2307    2
5   3110    0
5   4990    0

Then, use the previous query to find the total number of missing assignments.

name    totalNumMissing
Franklin Walton 1
Sylvia Sanchez  1
Harry Ng    4
Ishmael Smith   1
Kinga Shenko    2

# Problem 103-3¶

### Commit + Review¶

• Commit your code to Github.

• Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to code that generates your space-empires logs: ____
Link to diff that shows your logs are the same as the given logs: ____
Repl.it link to group_by and aggregate tests: ____
Resolved issue: _____
Commit links (space-empires, machine-learning): ____

# Problem 102-1¶

Note: We have a meeting with Prof. Wierman on Monday from 11:30am-12:30pm. Put this on your calendar and set some kind of alarm so you don't forget. I'll paste the meeting link in Slack when it's time. Also, please prepare to turn on your video for this meeting (it's a more formal setting).

The Space Empires game is posing a bit of a challenge in that we need a "log of truth" to match up against when we're reconciling games. It's inefficient when we all try to reconcile at the same time, because that's an $n \times n$ problem. It's also not feasible for me to create the log of truth due to the magnitude of the additional time committment that would be needed to code up the game.

TI talked to Jason last night about this, and he had a pretty good idea. What we can do instead is repeatedly have 2 people create logs for some strategy matchup, reconcile their logs to form the log of truth, and then have the rest of the class reconcile against that log of truth. Then, 2 new people will create the next log of truth, resulting in a slinky-like effect. (So we'll refer to this as "slinky development".)

I'll start off the first round. Check out cohort-1/102-pre-level-3-game of the slinky-development repo:

The rules for the game are in rules.txt and the logs show the simulation results for several random seeds.

Your task is to replicate these logs with your game, exactly the way they appear in the repo, using the exact strategies in strategies.py. Note that a tab is written as \t. (But be sure to post on Slack if you think there are any errors.)

Then, copy your logs into https://www.diffchecker.com/ to verify that your logs match up perfectly with the logs in the repo. You'll save/submit the link to your diffs (example: https://www.diffchecker.com/57HDK3vO) along with a link to the code that you used to generate your logs.

# Problem 102-2¶

### Titanic Survival Modeling¶

The first step towards building our models is deciding which independent variables to in our model (i.e. which variables might be useful for predicting survival?). There is a data dictionary at https://www.kaggle.com/c/titanic/data that describes what each variable means. Here are the first couple rows, for reference:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S

Some variables will not be useful in our model:

• PassengerId is just the row number of the dataset. It has nothing to do with the actual properties of passengers. We can discard it.

Other variables may not be useful as-is, but they may be useful after some additional processing:

• Name has too many categories to be useful in its entirety. However, the surname alone may be useful, given that passengers in the same family likely stuck together and took similar paths leading to survival or death.

• Ticket appears to be formatted as a ticket type and ticket number. If we split those up into two variables (ticket type and ticket number), then we may be able to find some use in those.

• Cabin appears to be formatted as a cabin type and cabin number. If we split those up into two variables, then we may be able to find some use in those.

Other variables seem like they may be useful with minimal processing: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked.

Your task is to split Name, Ticket, and Cabin into the sub-variables mentioned above (Surname, TicketType, TicketNumber, CabinType, CabinNumber). Next time, we'll analyze all the variables to determine how much they tell us about survival, but for now, let's just worry about creating those sub-variables that we want to investigate.

• (Note that we also want to investigate Pclass, Sex, Age, SibSp, Parch, Fare, and Embarked, but these variables won't need to be split like Name, Ticket, and Cabin do, so we don't need to worry about them right now)

Note: In the following problems, your dataframe method apply will be useful (see problem 28-2) and so will Python's split method (https://www.geeksforgeeks.org/python-string-split/)

a. Get the Surname from Name. In the way the names are formatted, it appears that the surname always consists of the characters preceding the first comma.

• While we're at it, let's get rid of that awkward quote at the beginning of the surname. You can do this by just ignoring the first character.

b. Split Cabin into CabinType and CabinNumber, e.g. the cabin B51 has type B and number 51.

• If you look at the dataset, you'll see that Cabin sometimes has multiple cabin numbers, e.g. B51 B53 B55. The cabin types appear to all be the same, while the cabin number is incremented by a small amount for each cabin. So, we can get a decent approximation by just considering the first entry (in the case of B51 B53 B55, we'll just consider B51).

• Keep CabinType as a string but set CabinNumber to be an integer. (You may wish to write a method in your DataFrame that converts a column to a desired type.)

c. Split Ticket into TicketType and TicketNumber, e.g. the ticket SOTON/O.Q. 3101312 has type SOTON and number 3101312.

• Watch out! Some tickets don't have a type, so it would be None. For example, the ticket 19877 would have type None and number 19877.

• Keep TicketType as a string but set TicketNumber to be an integer.

Here's an example of what the output should look like. First, read in the data as usual:

>>> import parse_line from somefile
>>> data_types = {
"PassengerId": int,
"Survived": int,
"Pclass": int,
"Name": str,
"Sex": str,
"Age": float,
"SibSp": int,
"Parch": int,
"Ticket": str,
"Fare": float,
"Cabin": str,
"Embarked": str
}
>>> df = DataFrame.from_csv("data/dataset_of_knowns.csv", data_types=data_types, parser=parse_line)
>>> df.columns
["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"]

>>> df.to_array()[:5]
[[1, 0, 3, '"Braund, Mr. Owen Harris"', "male", 22.0, 1, 0, "A/5 21171", 7.25, None, "S"],
[2, 1, 1, '"Cumings, Mrs. John Bradley (Florence Briggs Thayer)"', "female", 38.0, 1, 0, "PC 17599", 71.2833, "C85", "C"],
[3, 1, 3, '"Heikkinen, Miss. Laina"', "female", 26.0, 0, 0, "STON/O2. 3101282", 7.925, None, "S"]
[4, 1, 1, '"Futrelle, Mrs. Jacques Heath (Lily May Peel)"', "female", 35.0, 1, 0, "113803", 53.1, "C123", "S"]
[5, 0, 3, '"Allen, Mr. William Henry"', "male", 35.0, 0, 0, "373450", 8.05, None, "S"]]

Then, process your df. You don't have to write generalized code for this part. This can be a one-off thing.

After processing, your dataframe should look like this:

>>> df.columns
["PassengerId", "Survived", "Pclass", "Surname", "Sex", "Age", "SibSp", "Parch", "TicketType", "TicketNumber", "Fare", "CabinType", "CabinNumber", "Embarked"]

>>> df.to_array()[:5]
[[1, 0, 3, "Braund", "male", 22.0, 1, 0, "A/5", 21171, 7.25, None, None, "S"],
[2, 1, 1, "Cumings", "female", 38.0, 1, 0, "PC", 17599, 71.2833, "C", 85, "C"],
[3, 1, 3, "Heikkinen", "female", 26.0, 0, 0, "STON/O2.", 3101282, 7.925, None, None, "S"]
[4, 1, 1, "Futrelle", "female", 35.0, 1, 0, None, 113803, 53.1, "C", 123, "S"]
[5, 0, 3, "Allen", "male", 35.0, 0, 0, None, 373450, 8.05, None, None, "S"]]

# Problem 102-3¶

### Commit + Review¶

• Commit your code to Github.

• Resolve 1 GitHub issue on one of your own repositories. (If you don't have any issues to resolve, just write a note in your submission that that's the case.)

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to code that generates your space-empires logs: ____
Link to diff that shows your logs are the same as the given logs: ____
Repl.it link to titanic dataset processing: ____
Resolved issue: _____
Commit links (space-empires, machine-learning): ____

# Problem 101-1¶

### Space Empires¶

Per our discussion in class, we'll refactor our level 3 games before returning to resolving discrepancies.

I've updated the wiki

https://github.com/eurisko-us/eurisko-us.github.io/wiki/Space-Empires-Rules-(Cohort-1,-Level-3)

with the following changes:

• Eliminate hidden_game_state_for_combat per George's suggestion -- put all that information in combat_state instead.

• In the game state:

• instead of using an array of players, make it a dictionary where the key is the player number.

• store turn_created for all units, as this might make debugging easier

• Store an num for each ship. In the game, the units are identified in the form type-number, like scout-2, where the number is only unique to a particular type of ship, i.e. there can be a scout-2 and a destroyer-2. So we should probably take care of that now. And that allows us to keep our units in the form of an array, which is what we had before.

• Due to the change in the game state, there are some resulting changes in the format of outputs in the strategy template

If you have any more refactoring ideas or disagree with any of the above refactorings, post on #machine-learning and we can discuss

Also, in the wiki, there's now a section called "Gotchas" where you can write down any subtle rules that you've encountered, that you think others may not have implemented. For example:

During combat, you can't attack a colony until all ships have been destroyed

### Neural Nets¶

In Problem 94-1, you needed to create a logistic regressor neural network. Previously, this was a bit difficult because we hadn't had enough practice computing weight gradients. But now, we've had much more practice, so I think it should be within reach.

Make sure this logistic regressor neural net is working. If you managed to get it working on assignment 94, then you can just submit the code that you already wrote. Otherwise, if you didn't manage to get it working before, then your task is to get it working now.

# Problem 101-2¶

### Titanic Survival Modeling - Loading the Data¶

Location: machine-learning/kaggle/titanic/data_loading.py

a. Make an account on Kaggle.com so that we can walk through a Titanic prediction task.

• Go to https://www.kaggle.com/c/titanic/data, scroll down to the bottom, and click "download all". You'll get a zip file called titanic.zip.

• Upload titanic.zip into machine-learning/kaggle/titanic/data. Then, run unzip machine-learning/kaggle/data/titanic.zip in the command line to unzip the file.

• This gives us 3 files: train.csv, test.csv, and gender_submission.csv. The file train.csv contains data about a bunch of passengers along with whether or not they survived. Our goal is to use train.csv to build a model that will predict the outcome of passengers in test.csv (for which the survival data is not given).

• IMPORTANT: To prevent confusion, rename train.csv to dataset_of_knowns.csv, rename test.csv to unknowns_to_predict.csv, and rename gender_submission.csv to predictions_from_gender_model.csv.

b. In your DataFrame, update your method read_csv so that it accepts the following (optional) arguments:

• a line parser

• a dictionary of data types

If you encounter any empty strings, then save those as None rather than the type given in the dictionary of data types.

>>> import parse_line from somefile
>>> data_types = {
"PassengerId": int,
"Survived": int,
"Pclass": int,
"Name": str,
"Sex": str,
"Age": float,
"SibSp": int,
"Parch": int,
"Ticket": str,
"Fare": float,
"Cabin": str,
"Embarked": str
}
>>> df = DataFrame.from_csv("data/dataset_of_knowns.csv", data_types=data_types, parser=parse_line)
>>> df.columns
["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"]

>>> df.to_array()[:5]
[[1, 0, 3, '"Braund, Mr. Owen Harris"', "male", 22.0, 1, 0, "A/5 21171", 7.25, None, "S"],
[2, 1, 1, '"Cumings, Mrs. John Bradley (Florence Briggs Thayer)"', "female", 38.0, 1, 0, "PC 17599", 71.2833, "C85", "C"],
[3, 1, 3, '"Heikkinen, Miss. Laina"', "female", 26.0, 0, 0, "STON/O2. 3101282", 7.925, None, "S"]
[4, 1, 1, '"Futrelle, Mrs. Jacques Heath (Lily May Peel)"', "female", 35.0, 1, 0, "113803", 53.1, "C123", "S"]
[5, 0, 3, '"Allen, Mr. William Henry"', "male", 35.0, 0, 0, "373450", 8.05, None, "S"]]

# Problem 101-3¶

### Commit¶

• Commit your code to Github.

(You don't have to make or resolve any issues on this assignment)

### Submission Template¶

For your submission, copy and paste your links into the following template:

repl.it link to logistic neural net: _____

commits: _____
(machine-learning, space-empires)

# Problem 100-1¶

Announcement: There will be a quiz on Friday. Topics will include SQL, C++, and neural net gradient computations.

### Space Empires¶

Note: I put some information on the wiki here:

https://github.com/eurisko-us/eurisko-us.github.io/wiki/Space-Empires-Rules-(Cohort-1,-Level-3)

If you see any mistakes or any information that you think should be added, post about it on Slack. If your classmates agree, then you can go ahead and edit the wiki entry with your updates.

a. If your strategy assumes that unit arrays are ordered in any particular way, refactor your strategy so that it doesn't. Then, send it to me again so that I can upload it into the submissions folder.

For example, David's strategy intends to take some action with half of its scouts by checking if ship_index % 2 == 1. But there is no guarantee that this will be true for any scouts.

• For example, the unit array in one person's game might be

[Scout, Shipyard, Scout, Shipyard, Scout, Shipyard, Scout, Colony]

which would result in no scouts taking the desired action.

• On the other hand, the unit array in another person's game might be

[Shipyard, Scout, Shipyard, Scout, Shipyard, Scout, Colony, Scout]

which would result in all scouts taking the desired action.

As our games work right now, we can't assume that the index tells us anything about what type of ship it is. Rather, to check what type of ship it is, we need to look at game_state['players'][self.player_index]['units'][ship_index] to check if it's actually a scout.

If you wanted to send half of your scouts, to the enemy, this is how you could do it:

units = game_state['players'][self.player_index]['units']
scout_indices = [i for i, unit in enumerate(units) if unit['type'] == 'Scout']
half_scout_indices = [item for item in scout_indices if item % 2 == 1]

if ship_index in half_scout_indices:
unit = units[ship_index]
translation_towards_enemy = get_translation_towards_enemy(unit)
return translation_towards_enemy
else:
return (0,0)

b. Make sure you have two different game modes:

• "debug mode" - stop the game whenever a player makes an invalid decision, such as moving out of bounds or trying to buy something that they can't afford.

• "competition mode" - if a player makes an invalid decision, ignore it and move on.

When you're building your strategy, you should use "debug mode" to make sure your strategy is doing what you intend for it to do.

But when we run the matchups, we should use "competition mode". This way, as long as everyone's games are effectively the same, we won't have any discrepancies to debug, and we can spend more time moving the needle forward instead of debugging people's strategies (which isn't a great use of time).

• Even if the strategies make invalid decisions, those decisions will be ignored by everyone's games, which means the games will give the same results.

c. Continue debugging level 3. If you're not able to get the debugging done due to it taking a long time, then come to the next class with some ideas for how to speed up the process.

### Neural Networks¶

Note: Next time we do neural networks, we'll switch back to implementing them in code.

Compute $\dfrac{\partial E}{\partial w_{47}},$ $\dfrac{\partial E}{\partial w_{14}},$ and $\dfrac{\partial E}{\partial w_{01}}.$

• $y_\textrm{actual}=1,$

• $a_k=k+11$ and $f'_k(i_k) = k+1$ for all $k,$

• $w_{ab} = a+b$ for all $a,b.$

You should get the following:

\begin{align*} \dfrac{\partial E}{\partial w_{47}} &= 897,600 \\[5pt] \dfrac{\partial E}{\partial w_{14}} &= 156,024,000 \\[5pt] \dfrac{\partial E}{\partial w_{01}} &= 6,925,962,560 \\[5pt] \end{align*}

# Problem 100-2¶

### Titanic Survival Modeling - Line Parser¶

Location: machine-learning/kaggle/titanic/parse_line.py

Write a function parse_line that parses a comma-delimited line into its respective entries. For now, return all the entries as strings.

There are a couple "gotchas" to be aware of:

• If two commas appear in sequence, it means that the entry between them is empty. So, the line "7.25,,S" would be read as three entries, ['7.25', '', 'S'].

• If a comma appears within quotes, then the comma is part of that entry. For example:

• the line "'Braund', 'Mr. Owen Harris', male" would be three entries: ['Braund', '"Mr. Owen Harris"', 'male']

• the line "'Braund, Mr. Owen Harris', male" would be two entries: ["'Braund, Mr. Owen Harris'", "male"]

Here is a template for the recommended implementation:

def parse_line(line):
entries = []   # will be our final output

entry_str = ""   # stores the string of the current entry
# that we're building up

inside_quotes = False   # true if we're inside quotes

quote_symbol = None   # stores the type of quotes we're inside,
# i.e. single quotes "'" or
# double quotes '"'

for char in line:
# if we're at a comma that's not inside quotes,
# store the current entry string. In other words,
# append entry_str to our list of entries and reset
# the value of entry_str

# otherwise, if we're not at a comma or we're at a
# comma that's inside quotes, then keep building up
# the entry string (i.e. append char to entry_str)

# if the char is a single or double quote, and is equal
# to the quote symbol or there is no quote symbol,
# then flip the truth value of inside_quotes and
# change the quote symbol to the current character

# append the current entry string to entries and return entries

Here are some tests:

>>> line_1 = "1,0,3,'Braund, Mr. Owen Harris',male,22,1,0,A/5 21171,7.25,,S"
>>> parse_line(line_1)
['1', '0', '3', "'Braund, Mr. Owen Harris'", 'male', '22', '1', '0', 'A/5 21171', '7.25', '', 'S']

>>> line_2 = '102,0,3,"Petroff, Mr. Pastcho (""Pentcho"")",male,,0,0,349215,7.8958,,S'
>>> parse_line(line_2)
['102', '0', '3', '"Petroff, Mr. Pastcho (""Pentcho"")"', 'male', '', '0', '0', '349215', '7.8958', '', 'S']

>>> line_3 = '187,1,3,"O\'Brien, Mrs. Thomas (Johanna ""Hannah"" Godfrey)",female,,1,0,370365,15.5,,Q'
['187', '1', '3', '"O\'Brien, Mrs. Thomas (Johanna ""Hannah"" Godfrey)"', 'female', '', '1', '0', '370365', '15.5', '', 'Q']

### C++¶

https://www.learncpp.com/cpp-tutorial/dynamic-memory-allocation-with-new-and-delete/

Then, answer the following questions in an overleaf doc:

1. What are the differences between static memory allocation, automatic memory allocation, and dynamic memory allocation?

2. The following statement is false. Correct it.

To dynamically allocate an integer and assign the address to a pointer so we can access it later, we use the syntax int *ptr{ new int };. This tells our program to download some new memory from the internet and store a pointer to the new memory.

3. The following statement is false. Correct it.

The syntax destroy ptr; destroys the dynamically allocated memory that was accessible through ptr. Because it was destroyed, this memory address can no longer be used by the computer in the future.

4. What does a bad_alloc exception mean?

5. What is a null pointer? What makes it different from a normal pointer? What can we use it for, that we can't use a normal pointer for?

6. What is a memory leak, and why are memory leaks bad?

7. Does the following bit of code cause a memory leak? If so, why?

int value = 5;
int *ptr{ new int{} };
ptr = &value;
8. Does the following bit of code cause a memory leak? If so, why?

int value{ 5 };
int *ptr{ new int{} };
delete ptr;
ptr = &value;
9. Does the following bit of code cause a memory leak? If so, why?

int *ptr{ new int{} };
ptr = new int{};
10. Does the following bit of code cause a memory leak? If so, why?

int *ptr{ new int{} };
delete ptr;
ptr = new int{};

# Problem 100-3¶

### Commit¶

• Commit your code to Github.

(You don't have to make or resolve any issues on this assignment)

### Submission Template¶

For your submission, copy and paste your links into the following template:

Neural net overleaf: _____
repl.it link to parser: _____
C++ overleaf link: _____

commits: _____
(machine-learning, space-empires)

# Problem 99-1¶

If you haven't already, turn in your Titanic prediction writeup on Canvas. It's important.

### Space Empires¶

Meet with each classmate over the weekend to resolve level 3 discrepancies.

Once you've resolved discrepancies with a classmate, check the corresponding box in the spreadsheet.

### Neural Networks¶

Note: We've been using the symbol $\textrm d$ for our derivative, i.e. $\dfrac{\textrm dE}{\textrm dw_{ij}}.$ However, it would be more clear to write this as a partial derivative, since the error $E$ depends on all of our weights (not just one weight). So we will use the convention $\dfrac{\partial E}{\partial w_{ij}}$ going forward.

Your task: Compute $\dfrac{\partial E}{\partial w_{35}},$ $\dfrac{\partial E}{\partial w_{45}},$ $\dfrac{\partial E}{\partial w_{13}},$ $\dfrac{\partial E}{\partial w_{23}},$ $\dfrac{\partial E}{\partial w_{14}},$ $\dfrac{\partial E}{\partial w_{24}},$ $\dfrac{\partial E}{\partial w_{01}},$ and $\dfrac{\partial E}{\partial w_{02}}$ for the following network. (It's easiest to do it in that order.) Put your work in an Overleaf doc.

$$\begin{matrix} & n_5 \\ & \nearrow \hspace{1.25cm} \nwarrow \\ n_3 & & n_4 \\ \uparrow & \nwarrow \hspace{1cm} \nearrow & \uparrow \\[-10pt] | & \diagdown \diagup & | \\[-10pt] | & \diagup \diagdown & | \\[-10pt] | & \diagup \hspace{1cm} \diagdown & | \\ n_1 & & n_2\\ & \nwarrow \hspace{1.25cm} \nearrow \\ & n_0 \\ \end{matrix}$$

Show ALL your work! (If some work is the same as what you've already wrote down for a previous gradient computation, you can just put dot-dot-dot. But if you get stuck, then go back and write down all intermediate steps.) Also, make sure to use the simplest notation possible (for example, instead of writing $f_k(i_k),$ write $a_k$)

Check your answer by substituting the following values:

$$y_\textrm{actual}=1 \qquad \begin{matrix} a_0 = 2 \\ a_1 = 3 \\ a_2 = 4 \\ a_3 = 5 \\ a_4 = 6 \\ a_5 = 7 \end{matrix} \qquad \begin{matrix} f_0'(i_0) = 8 \\ f_1'(i_1) = 9 \\ f_2'(i_2) = 10 \\ f_3'(i_3) = 11 \\ f_4'(i_4) = 12 \\ f_5'(i_5)=13 \end{matrix} \qquad \begin{matrix} w_{01} = 14 \\ w_{02} = 15 \\ w_{13} = 16 \\ w_{14} = 17 \\ w_{23} = 18 \\ w_{24} = 19 \\ w_{34} = 20 \\ w_{35} = 21 \\ w_{45} = 22 \end{matrix}$$

You should get the following:

\begin{align*} \dfrac{\partial E}{\partial w_{35}} &= 780 \\[5pt] \dfrac{\partial E}{\partial w_{45}} &= 936 \\[5pt] \dfrac{\partial E}{\partial w_{13}} &= 108108 \\[5pt] \dfrac{\partial E}{\partial w_{23}} &= 144144 \\[5pt] \dfrac{\partial E}{\partial w_{14}} &= 123552 \\[5pt] \dfrac{\partial E}{\partial w_{24}} &= 164736 \\[5pt] \dfrac{\partial E}{\partial w_{01}} &= 22980672 \\[5pt] \dfrac{\partial E}{\partial w_{02}} &= 28622880 \end{align*}

# Problem 99-2¶

Note: I was going to have us load the Titanic survival data, but I think we need to talk about the parsing algorithm during class beforehand. So, this will need to wait until next week. Instead, we'll do some C++ and SQL.

### SQL¶

On sqltest.net, create a sql table by copying the following script:

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/sql-tables/assignments-2.sql

Then, compute the average assignment score of each student, along with the number of assignments they've completed. List the results from highest average score to lowest average score, and include the full names of the students.

This is what your output should look like:

name            avgScore  numCompleted
Sylvia Sanchez  95.0000       2
Ishmael Smith   91.2500       4
Franklin Walton 90.0000       1
Kinga Shenko    83.3333       3
Harry Ng        72.5000       4

### C++¶

Observe that the following code can be used to increase the entries in an array by some amount, via a helper function:

# include <iostream>

void incrementArray(int arr[], int length, int amt)
{
for (int i = 0; i < length; i++)
arr[i] += amt;
}

int main()
{

int array[] = {10, 20, 30, 40};
int length = sizeof(array) / sizeof(array[0]);
int amt = 3;

incrementArray(array, length, amt);

for (int i = 0; i < 4; i++)
std::cout << array[i] << " ";

return 0;
}

--- output ---
11 12 13 14

Write a function dotProduct that computes the dot product of two input arrays. (You'll need to include the length as the input, too.)

# include <iostream>
# include <cassert>

// write dotProd here

int main()
{

int array1[] = {1, 2, 3, 4};
int array2[] = {5, 6, 7, 8};
int length = sizeof(array1) / sizeof(array1[0]);
int ans = dotProduct(array1, array2, length);

std::cout << "Testing...\n";
assert(ans == 70);
std::cout << "Success!";

return 0;
}

# Problem 99-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

(You don't have to resolve any issues on this assignment)

### Submission Template¶

For your submission, copy and paste your links into the following template:

Neural net overleaf: _____
C++ repl.it link: _____

commits: _____
(assignment-problems, space-empires)

Created issue: _____

# Problem 98-1¶

### Space Empires¶

a. If you haven't already, fix any issues in your level 3 strategy and send it to me. During class, I heard that Colby, George, and Elijah may have things to fix:

• Colby -- can't use tactics in level 3

• George -- only colonies have "turn created"

• Elijah -- can't get "type" from the combat dictionary. Instead, you need to get the player number and unit number from the combat dictionary, and then look up "type" in hidden_game_state_for_combat.

• George mentioned that this is pretty convoluted, and that we should have all this info in combat state, and just pass in the normal hidden game state. I agree! We'll refactor this before moving on to level 4. But to avoid any confusion, let's hold off on refactoring until we finish these level 3 matchups.

b. Once everyone has finalized their strategies, simulate the matchups.

### Titanic Prediction¶

It sounds like a lot of our models were breaking on the Titanic dataset, which caused the previous assignment to require waaaaaay more debugging time than I had budgeted for. My apologies; I didn't intend for it to become a day-long problem.

You can have another 2 days to finish up your writeup. Let's focus on just getting the writeup done, even if the numbers aren't looking right. If some model breaks, and you're not able to fix it after a couple minutes, just move on to the rest of the models, even if the numbers look wrong.

Over the next several weeks, we'll step through the same modeling process more carefully, one step at a time, one model at a time, fixing any errors that arise along the way. Your task right now is just to run your existing models on the dataset and write up what you get, even if the numbers don't look right. In order to plan out the step-by-step approach, I need to know where we stand right now (i.e. what results you're currently getting for the models).

# Problem 98-2¶

### Quiz Corrections¶

Submit corrections for any problem you got wrong. Try to do these corrections without looking at the recording of what we went over in class.

You don't have to explain what you got wrong or why. Just send in the correct results.

### C++¶

Put the answers to these questions in an overleaf doc.

In C++, you can think of strings as arrays of numbers that represent characters.

char myString[]{ "hello world" };
int length = sizeof(myString) / sizeof(myString[0]);
for(int i=0; i<length; i++) {
std::cout << myString[i];
}
std::cout << "\n";
std::cout << "the length of this string is " << length;

--- output ---
hello world
the length of this string is 12

Note that the length of the string is always one more than the number of characters (including spaces) in the string. This is because, under the hood, C++ needs to add a "null terminator" to the end of the string so that it knows where the string stops.

So the array contains all the numeric codes of the letters in the string, plus a null terminator at the end (which you don't see when the string is printed out).

Question. Suppose you create an array that contains all the lowercase letters of the English alphabet in alphabetical order. What would the length of this array be? (If your answer is 26, please re-read the paragraphs above.)

Then, answer the following questions:

1. Suppose you use int x{ 5 } to set the variable x to have the value of 5. What is the difference between x and &x?

2. Suppose you want to make a pointer p that points to the memory address of x (from question 1). How do you initialize p?

3. Suppose you have

int v{ 5 };
int* ptr{ &v };

Without using the symbol v, what notation can you use to get the value of v? (Hint: get the value stored at the memory address of v)

4. Suppose you initialize a pointer as an int. Can you use it to point to the memory address of a variable that is a char?

# Problem 98-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

(You don't have to resolve any issues on this assignment)

### Submission Template¶

For your submission, copy and paste your links into the following template:

Modeling writeup: _____
Overleaf (C++ answers): _____

commit: _____
(machine-learning)

Created issue: _____

# Problem 97-1¶

This problem is the beginning of some more involved machine learning tasks. To ease the transition, this will be the only problem on this assignment.

This problem is just as important as space empires and neural nets, and the modeling techniques covered will 100% be on future quizzes and the final. Be sure to do this problem well. If you've run into any issues with your space empires simulations, DO THIS PROBLEM FIRST before you go back to space empires.

Make an account on Kaggle.com so that we can walk through a Titanic prediction task.

• Go to https://www.kaggle.com/c/titanic/data, scroll down to the bottom, and click "download all". You'll get a zip file called titanic.zip.

• Upload titanic.zip into machine-learning/datasets/titanic/. Then, run unzip machine-learning/datasets/titanic/titanic.zip in the command line to unzip the file.

• This gives us 3 files: train.csv, test.csv, and gender_submission.csv. The file train.csv contains data about a bunch of passengers along with whether or not they survived. Our goal is to use train.csv to build a model that will predict the outcome of passengers in test.csv (for which the survival data is not given).

• IMPORTANT: To prevent confusion, rename train.csv to dataset_of_knowns.csv, rename test.csv to unknowns_to_predict.csv, and rename gender_submission.csv to predictions_from_gender_model.csv.

• The file predictions_from_gender_model.csv is an example of predictions from a really, really basic model: if the passenger is female, predict that they survived; if the passenger is male, predicte that they did not survive.

To build a model, we will proceed with the following steps:

• Feature Selection - deciding which variables we want in our model. This is usually a subset of the original number of features.

• Model Selection - ranking our models from best to worst, based on cross-validation performance. (We'll train each model on half the data, use it to predict the other half of the data, and see how accurate it is.)

• Submission - taking our best model, training it on the full dataset_of_knowns.csv, running it on unknowns_to_predict.csv, generating a predictions.csv file, and uploading it to Kaggle.com for scoring.

For this problem, you will need to write what you did for each of these steps in an Overleaf doc (kind of like you would in a lab journal). So, open up one now and let's continue.

### Feature Engineering¶

In your Overleaf doc, create a section called "Feature Selection". Make a bulleted list of all the features along with your justification for using or not using the feature in your model.

Important: There is a data dictionary at https://www.kaggle.com/c/titanic/data that describes what each feature means.

It will be helpful to look at the actual values of the variables as well. For example, here are the first 5 records in the dataset:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S

For every feature that you decide to keep, give a possible theory for how the feature may help predict whether a passenger survived.

• For example, you should keep the Age feature because it's likely that younger passengers were given priority when boarding lifeboats.

For every feature that you decide to remove, explain why it 1) is irrelevant to the prediction task, or 2) would take too long to transform into a worthwhile feature.

• For example, you can remove the ticket feature because it's formatted so weirdly (e.g. A/5 21171). It's possible that there may be some information here, but it would take a while to figure out how to turn this into a worthwhile feature that we could actually plug into our model. (There's multiple parts to the ticket number and it's a combination of letters and numbers, so it's not straightforward how to use it.)

### Model Selection¶

Split dataset_of_knowns.csv in half, selecting every other row for training and leaving the leftover rows for testing.

Fit several models to the training dataset:

• Linear regressor - if output is greater than or equal to 0.5, predict category 1 (survived); if output is less than 0.5, predict category 0 (didn't survive). You may wish to include interaction terms that you think would be important, since linear regressors do not capture interactions by default.

• Logistic regressor - same notes as above (for the linear regressor)

• Gini decision tree - conveniently, decision trees predict categorical variables by default ("survived" has 2 categories, 0 and 1), and they also capture interactions by default (so don't include any interaction terms when you feed the data into the Gini tree). We'll try out 2 different models, one with max_depth=5 and another with max_depth=10.

• Random forest - same notes as above (for the Gini decision tree). We'll try out 2 different models, one with max_depth=3 and num_trees=1000, and another with max_depth=5 and num_trees=1000.

• Naive Bayes - note that you'll need to take any features that are quantitative and re-label their values by categories. By default, you can just use 2 categories: "low" and "high", where a value is labeled as "low" if it's less than or equal to the mean and "high" if it's above the mean.

• k-Nearest Neighbors - for any variables that are quantitative, transform them as $$x \to \dfrac{x - \min(x)}{\max(x) - \min(x)}$$ so that they fit into the interval [0,1]. For any variables that are categorical, leave them be. Use a "Mahnattan" distance metric (the sum of absolute differences). Note that if a variable is categorical, then their distance between 2 values should be counted as $0$ if they are the same and $1$ if they are different. We'll try 2 different models, k=5 and k=10.

• The reason for the Manhattan distance metric instead of the Euclidean distance metric is so that differences between categorical variables do not drastically overpower differences between quantitative variables.

• For example, suppose you had two data points (0.2, 0.7, "dog", "red") and (0.5, 0.1, "cat", "red"). Then the distance would be as follows:

distance
= |0.2-0.5| + |0.7-0.1| + int("dog"!="cat") + int("red"!="red")
= 0.3 + 0.6 + 1 + 0
= 1.9

Then, use these models to predict survival for the training dataset and the testing dataset separately. Make a table in your Overleaf doc that contains the resulting accuracy rates.

• For example, suppose there are 100 rows in the training dataset and 100 rows in the testing dataset (to be clear, the actual number of rows in your dataset will probably be different). You train your Gini decision tree with max_depth=10 on the training dataset, and then use it to predict on the testing dataset. You get 98 correct predictions on the training dataset and 70 correct predictions on the testing dataset (which, by the way, is an indication that you're overfitting -- your max_depth is probably too high). Then your table looks like this:
Model         | Training Accuracy | Testing Accuracy
----------------------------------------------------------
Gini depth 10 | 98%               | 70%

To be clear, your table should have 9 rows, one for each model: linear, logistic, Gini depth 5, Gini depth 10, random forest depth 5, random forest depth 10, naive Bayes, 5-nearest-neighbors, 10-nearest-neighbors.

### Submission¶

Take your best model (i.e. the one with the highest testing accuracy), re-train it on the entire dataset_of_knowns.csv, and evaluate its predictions on unknowns_to_predict.csv.

Save your results as predictions.csv, and make sure they follow the exact same format as predictions_from_gender_model.csv.

• By "the exact same format", I mean THE EXACT SAME FORMAT. Make sure the header is exactly the same. Make sure that you're writing the 0's and 1s as integers, not strings. Make sure that you include the PassengerId column. Make sure that the values in the PassengerId column match up exactly with those in predictions_from_gender_model.csv. The only thing that should be different is the values in the Survived column.

Click on the "Submit Predictions" button on the right side of the screen and submit your file predictions.csv. You should get a screen that looks like the image below, but has your predictions.csv instead of gender-submissions.csv. You should hopefully get a higher score than 0.76555 (which is the baseline accuracy of the "all women survive" model).

Take a screenshot of this screen, post it on #machine-learning, and include it in your Overleaf writeup.

### What to Turn In¶

Just the Overleaf writeup and a commit link to your machine-learning repo. That's it.

# Problem 96-1¶

### Space Empires¶

Once your strategy is finalized, Slack it to me and I'll upload it here.

https://github.com/eurisko-us/eurisko-us.github.io/tree/master/files/strategies/cohort-1/level-3

Then, once everyone's strategies are submitted, I'll make an announcement, and you can download the strategies from the above folder and run all pairwise battles for 100 games.

• Go through max_turns=100 before declaring a draw. I think this should run quick enough, since we decreased from 500 games to 100 games, but if any 100-game matchups are taking longer than a couple minutes to run, then post about it and we'll figure something out.

• Put your data in the spreadsheet:

• Remember to switch the order of the players halfway through the simulation so that each player goes first an equal number of times.

• Seed the games: game 1 has seed 1, game 2 has seed 2, and so on. This way, we should all get exactly the same results.

As usual, there will be prizes:

• 1st place: 50 pts extra credit in the assignments category
• 2nd place: 30 pts extra credit in the assignments category
• 3rd place: 10 pts extra credit in the assignments category

### Neural Nets¶

Compute $\dfrac{\textrm dE}{\textrm dw_{34}},$ $\dfrac{\textrm dE}{\textrm dw_{24}},$ $\dfrac{\textrm dE}{\textrm dw_{13}},$ $\dfrac{\textrm dE}{\textrm dw_{12}},$ and $\dfrac{\textrm dE}{\textrm dw_{01}}$ for the following network. (It's easiest to do it in that order.) Put your work in an Overleaf doc.

$$\begin{matrix} & & n_4 \\ & \nearrow & & \nwarrow \\ n_2 & & & & n_3 \\ & \nwarrow & & \nearrow \\ & & n_1 \\ & & \uparrow \\ & & n_0 \\ \end{matrix}$$

Show ALL your work! Also, make sure to use the simplest notation possible (for example, instead of writing $f_k(i_k),$ write $a_k$)

Check your answer by substituting the following values:

$$y_\textrm{actual}=1 \qquad \begin{matrix} a_0 = 2 \\ a_1 = 3 \\ a_2 = 4 \\ a_3 = 5 \\ a_4 = 6 \end{matrix} \qquad \begin{matrix} f_0'(i_0) = 7 \\ f_1'(i_1) = 8 \\ f_2'(i_2) = 9 \\ f_3'(i_3) = 10 \\ f_4'(i_4) = 11 \end{matrix} \qquad \begin{matrix} w_{01} = 12 \\ w_{12} = 13 \\ w_{13} = 14 \\ w_{24} = 15 \\ w_{34} = 16 \end{matrix}$$

You should get $$\dfrac{\textrm dE}{\textrm d w_{34}} = 550, \qquad \dfrac{\textrm dE}{\textrm d w_{24}} = 440, \qquad \dfrac{\textrm dE}{\textrm d w_{13}} = 52800, \qquad \dfrac{\textrm dE}{\textrm d w_{12}} = 44550, \qquad \dfrac{\textrm dE}{\textrm d w_{01}} = 7031200.$$

# Problem 96-2¶

Write a recursive function merge that merges two sorted lists. To do this, you can check the first elements of each list, and make the lesser one the next element, then merge the lists that remain.

merge (x:xs) (y:ys) = if x < y
then _______
else _______
merge [] xs = ____
merge xs [] = ____

main = print(merge [1,2,5,8] [3,4,6,7,10])
-- should return [1,2,3,4,5,6,7,8,10]

### SQL¶

On sqltest.net, create a sql table by copying the following script:

https://raw.githubusercontent.com/eurisko-us/eurisko-us.github.io/master/files/sql-tables/assignments-1.sql

Then, compute the average assignment score of each student. List the results from highest to lowest, along with the full names of the students.

This is what your output should look like:

fullname    avgScore
Ishmael Smith   90.0000
Sylvia Sanchez  86.6667
Kinga Shenko    85.0000
Franklin Walton 80.0000
Harry Ng    78.3333

Hint: You'll have to use a join and a group by.

# Problem 96-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

(You don't have to resolve any issues on this assignment)

### Submission Template¶

For your submission, copy and paste your links into the following template:

neural nets overleaf: _____
put your game results in the spreadsheet (but you don't have to paste the link)

commits: _____
(space-empires, assignment-problems)

Created issue: _____

# Problem 95-1¶

### Neural Nets¶

Notation

• $n_k$ - the $k$th neuron

• $a_k$ - the activity of the $k$th neuron

• $i_k$ - the input to the $k$th neuron. This is the weighted sum of activities of the parents of $n_k.$ If $n_k$ has no parents, then $i_k$ comes from the data directly.

• $f_k$ - the activation function of the $k$th neuron. Note that in general, we have $a_k = f_k(i_k)$

• $w_{k \ell}$ - the weight of the connection $n_k \to n_\ell.$ In your code, this is weights[(k,l)].

• $E = (y_\textrm{predicted} - y_\textrm{actual})^2$ is the squared error that results from using the neural net to predict the value of the dependent variable, given values of the independent variables

• $w_{k \ell} \to w_{k \ell} - \alpha \dfrac{\textrm dE}{\textrm dw_{k\ell}}$ is the gradient descent update, where $\alpha$ is the learning rate

Example

For a simple network $$\begin{matrix} & & n_2 \\ & \nearrow & & \nwarrow \\ n_0 & & & & n_1,\end{matrix}$$ we have:

\begin{align*} y_\textrm{predicted} &= a_2 \\ &= f_2(i_2) \\ &= f_2(w_{02} a_0 + w_{12} a_1) \\ &= f_2(w_{02} f_0(i_0) + w_{12} f_1(i_1) ) \\ \\ \dfrac{\textrm dE}{\textrm dw_{02}} &= \dfrac{\textrm d}{\textrm dw_{02}} \left[ (y_\textrm{predicted} - y_\textrm{actual})^2 \right] \\ &= \dfrac{\textrm d}{\textrm dw_{02}} \left[ (a_2 - y_\textrm{actual})^2 \right] \\ &= 2(a_2 - y_\textrm{actual}) \dfrac{\textrm d}{\textrm dw_{02}} \left[ a_2 - y_\textrm{actual} \right] \\ &= 2(a_2 - y_\textrm{actual}) \dfrac{\textrm d }{\textrm dw_{02}} \left[ a_2 \right] \\ &= 2(a_2 - y_\textrm{actual}) \dfrac{\textrm d }{\textrm dw_{02}} \left[ f_2(i_2) \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) \dfrac{\textrm d }{\textrm dw_{02}} \left[ i_2 \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) \dfrac{\textrm d }{\textrm dw_{02}} \left[ w_{02} a_0 + w_{12} a_1 \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) \dfrac{\textrm d }{\textrm dw_{02}} \left[ w_{02} a_0 + w_{12} a_1 \right] \\ &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) a_0 \\ \\ \dfrac{\textrm dE}{\textrm dw_{12}} &= 2(a_2 - y_\textrm{actual}) f_2'(i_2) a_1 \end{align*}

THE ACTUAL PROBLEM STATEMENT

Compute $\dfrac{\textrm dE}{\textrm dw_{23}},$ $\dfrac{\textrm dE}{\textrm dw_{12}},$ and $\dfrac{\textrm dE}{\textrm dw_{01}}$ for the following network. (It's easiest to do it in that order.) Put your work in an Overleaf doc.

$$\begin{matrix} n_3 \\ \uparrow \\ n_2 \\ \uparrow \\ n_1 \\ \uparrow \\ n_0 \end{matrix}$$

Show ALL your work! Also, make sure to use the simplest notation possible (for example, instead of writing $f_k(i_k),$ write $a_k$)

Check your answer by substituting the following values:

$$y_\textrm{actual}=1 \qquad \begin{matrix} a_0 = 2 \\ a_1 = 3 \\ a_2 = 4 \\ a_3 = 5 \end{matrix} \qquad \begin{matrix} f_0'(i_0) = 6 \\ f_1'(i_1) = 7 \\ f_2'(i_2) = 8 \\ f_3'(i_3) = 9 \end{matrix} \qquad \begin{matrix} w_{01} = 10 \\ w_{12} = 11 \\ w_{23} = 12 \end{matrix}$$

You should get $$\dfrac{\textrm dE}{\textrm d w_{23}} = 288, \qquad \dfrac{\textrm dE}{\textrm d w_{12}} = 20736, \qquad \dfrac{\textrm dE}{\textrm d w_{01}} = 1064448.$$

Note: On the next couple assignments, we'll do the same exercise with progressively more advanced networks. This problem is relatively simple so that you have a chance to get used to working with the notation.

### Space Empires¶

Finish creating your game level 3 strategy. (See problem 93-1 for a description of game level 3, which you should have implemented by now.) Then, implement the following strategy and run it against your level 3 strategy:

• NumbersBerserkerLevel3 - always buys as many scouts as possible, and each time it buys a scout, immediately sends it on a direct route to attack the opponent.

Post on #machine-learning with your strategy's stats against these strategies:

MyStrategy vs NumbersBerserker
- MyStrategy win rate: __%
- MyStrategy loss rate: __%
- draw rate: __%

On the next assignment, we'll have the official matchups.

# Problem 95-2¶

### C++¶

Write a function calcSum(m,n) that computes the sum of the matrix product of an ascending $m \times n$ and a descending $n \times m$ array, where the array entries are taken from $\{ 1, 2, ..., mn \}.$ For example, if $m=2$ and $n=3,$ then

\begin{align*} \textrm{ascending} &= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \\[3pt] \textrm{descending} &= \begin{bmatrix} 6 & 5 \\ 4 & 3 \\ 2 & 1 \end{bmatrix} \\[3pt] (\textrm{ascending})(\textrm{descending}) &= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 6 & 5 \\ 4 & 3 \\ 2 & 1 \end{bmatrix} \\[3pt] &= \begin{bmatrix} 20 & 14 \\ 56 & 41 \end{bmatrix} \\[3pt] \textrm{sum} \Big( (\textrm{ascending})(\textrm{descending}) \Big) &= 131 \end{align*}
#include <iostream>
#include <cassert>

// define calcSum

int main() {
// write an assert for the test case m=2, n=3
}

### SQL¶

On sqltest.net, create the following tables:

CREATE TABLE age (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
lastname VARCHAR(30),
age VARCHAR(30)
);

INSERT INTO age (id, lastname, age)
VALUES ('1', 'Walton', '12');

INSERT INTO age (id, lastname, age)
VALUES ('2', 'Sanchez', '13');

INSERT INTO age (id, lastname, age)
VALUES ('3', 'Ng', '14');

INSERT INTO age (id, lastname, age)
VALUES ('4', 'Smith', '15');

INSERT INTO age (id, lastname, age)
VALUES ('5', 'Shenko', '16');

CREATE TABLE name (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(30),
lastname VARCHAR(30)
);

INSERT INTO name (id, age, lastname)
VALUES ('1', 'Franklin', 'Walton');

INSERT INTO name (id, firstname, lastname)
VALUES ('2', 'Sylvia', 'Sanchez');

INSERT INTO name (id, firstname, lastname)
VALUES ('3', 'Harry', 'Ng');

INSERT INTO name (id, firstname, lastname)
VALUES ('4', 'Ishmael', 'Smith');

INSERT INTO name (id, firstname, lastname)
VALUES ('5', 'Kinga', 'Shenko');

Then, write a query to get the full names of the people, along with their ages, in alphabetical order of last name. The output should look like this:

Harry Ng is 14.
Sylvia Sanchez is 13.
Kinga Shenko is 16.
Ishmael Smith is 15.
Franklin Walton is 12.

Tip: You'll need to use string concatenation and a join.

# Problem 95-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

(You don't have to resolve any issues on this assignment)

### Submission Template¶

For your submission, copy and paste your links into the following template:

Overleaf: _____
Repl.it link to C++ file: _____

assignment-problems commit: _____
space-empires commit: _____

Created issue: _____

# Problem 94-1¶

### Space Empires¶

Reconcile remaining discrepancies in game level 2 so we can crown the winners:

Then, write your first custom strategy for the level 3 game. We'll start matchups on Wednesday. We'll go through several rounds of matchups on this level since the game is starting to become more rich.

(We'll have the same extra credit prizes for 1st / 2nd / 3rd place)

Note: In decide_which_unit_to_attack, be sure to use 'player' and 'unit' instead of 'player_index' and 'unit_index'.

# combat_state is a dictionary in the form coordinates : combat_order
# {
#    (1,2): [{'player': 1, 'unit': 0},
#            {'player': 0, 'unit': 1},
#            {'player': 1, 'unit': 1},
#            {'player': 1, 'unit': 2}],
#    (2,2): [{'player': 2, 'unit': 0},
#            {'player': 3, 'unit': 1},
#            {'player': 2, 'unit': 1},
#            {'player': 2, 'unit': 2}]
# }

### Neural Net-Based Logistic Regressor¶

Make sure you get this problem done completely. Neural nets have a very steep learning curve and they're going to be sticking with us until the end of the semester.

a. Given $\sigma(x) = \dfrac{1}{1+e^{-x}},$ prove that $\sigma'(x) = \sigma(x) (1-\sigma(x)).$ Write this proof in an Overleaf doc.

b. In neural networks, neurons are often given "activation functions", where

node.activity = node.activation_function(weighted sum of inputs to node)

In this problem, you'll extend your neural net to include activation functions. Then, you'll equip the neurons with activations so as to implement a logistic regressor.

>>> weights = {(0,2): -0.1, (1,2): 0.5}

>>> def linear_function(x):
return x
>>> def linear_derivative(x):
return 1
>>> def sigmoidal_function(x):
return 1/(1+math.exp(-x))
>>> def sigmoidal_derivative(x):
s = sigmoidal_function(x)
return s * (1 - s)

>>> activation_types = ['linear', 'linear', 'sigmoidal']
>>> activation_functions = {
'linear': {
'function': linear_function,
'derivative': linear_derivative
},
'sigmoidal': {
'function': sigmoidal_function,
'derivative': sigmoidal_derivative
}
}

>>> nn = NeuralNetwork(weights, activation_types, activation_functions)

>>> data_points = [
{'input': [1,0], 'output': [0.1]},
{'input': [1,1], 'output': [0.2]},
{'input': [1,2], 'output': [0.4]},
{'input': [1,3], 'output': [0.7]}
]
>>> for i in range(1,10001):
err = 0
for data_point in data_points:
nn.update_weights(data_point)
err += nn.calc_squared_error(data_point)
if i < 5 or i % 1000 == 0:
print('iteration {}'.format(i))
print('    updated weights: {}'.format(nn.weights))
print('    error: {}'.format(err))
print()

iteration 1
gradient: {(0, 2): 0.03184692266577955, (1, 2): 0.09554076799733865}
updated weights: {(0, 2): -0.10537885784041535, (1, 2): 0.4945789883636697}
error: 0.40480006957774683

iteration 2
gradient: {(0, 2): 0.031126202300065627, (1, 2): 0.09337860690019688}
updated weights: {(0, 2): -0.11072951375555531, (1, 2): 0.48919868238711295}
error: 0.3989945995186133

iteration 3
gradient: {(0, 2): 0.030367826123201307, (1, 2): 0.09110347836960392}
updated weights: {(0, 2): -0.11605116651884796, (1, 2): 0.4838609744178689}
error: 0.3932640005281893

iteration 4
gradient: {(0, 2): 0.029572207383720784, (1, 2): 0.08871662215116236}
updated weights: {(0, 2): -0.12134303561025003, (1, 2): 0.4785677220228999}
error: 0.3876106111541695

iteration 1000
gradient: {(0, 2): -0.04248103992359947, (1, 2): -0.12744311977079842}
updated weights: {(0, 2): -1.441870816044744, (1, 2): 0.6320712307086241}
error: 0.03103391055967604

iteration 2000
gradient: {(0, 2): -0.026576913835657988, (1, 2): -0.07973074150697396}
updated weights: {(0, 2): -1.8462575194764488, (1, 2): 0.8112377281576201}
error: 0.010469324799663702

iteration 3000
gradient: {(0, 2): -0.019389915442213898, (1, 2): -0.058169746326641694}
updated weights: {(0, 2): -2.0580006793189596, (1, 2): 0.903267622168482}
error: 0.004993174823452696

iteration 4000
gradient: {(0, 2): -0.01536481706566838, (1, 2): -0.04609445119700514}
updated weights: {(0, 2): -2.187017035077964, (1, 2): 0.9588032475551099}
error: 0.002982405174006053

iteration 5000
gradient: {(0, 2): -0.012858896793162088, (1, 2): -0.038576690379486266}
updated weights: {(0, 2): -2.2717393677429842, (1, 2): 0.995065996436664}
error: 0.00211991513136444

iteration 6000
gradient: {(0, 2): -0.011201146193726709, (1, 2): -0.033603438581180124}
updated weights: {(0, 2): -2.3298248394321606, (1, 2): 1.0198377357361068}
error: 0.0017156674543843792

iteration 7000
gradient: {(0, 2): -0.010062009597155228, (1, 2): -0.030186028791465685}
updated weights: {(0, 2): -2.370740520022862, (1, 2): 1.037244660012689}
error: 0.0015153961429219282

iteration 8000
gradient: {(0, 2): -0.009259319779522148, (1, 2): -0.027777959338566444}
updated weights: {(0, 2): -2.400083365137227, (1, 2): 1.0497070597284772}
error: 0.0014124679719747604

iteration 9000
gradient: {(0, 2): -0.008683873946383038, (1, 2): -0.026051621839149115}
updated weights: {(0, 2): -2.4213875864199608, (1, 2): 1.058744505427183}
error: 0.0013582149901490035

iteration 10000
gradient: {(0, 2): -0.00826631063707707, (1, 2): -0.024798931911231212}
updated weights: {(0, 2): -2.4369901278483534, (1, 2): 1.065357551487286}
error: 0.001329102258719855

>>> nn.weights
should be close to
{(0,2): -2.44, (1,2): 1.07}

because the data points all lie approximately on the sigmoid
output = 1/(1 + e^(-(input[0] * -2.44 + input[1] * 1.07)) )

Super Important: You'll have to update your gradient descent to account for the activation functions. This will require using the chain rule. In our case, we'll have

squared_error = (y_predicted - y_actual)^2

d(squared_error)/d(weights)
= 2 (y_predicted - y_actual) d(y_predicted - y_actual)/d(weights)
= 2 (y_predicted - y_actual) [ d(y_predicted)/d(weights) - 0]
= 2 (y_predicted - y_actual) d(y_predicted)/d(weights)

y_predicted
= nodes[2].activity
= nodes[2].activation_function(nodes[2].input)
= nodes[2].activation_function(
weights[(0,2)] * nodes[0].activity
+ weights[(1,2)] * nodes[1].activity
)
= nodes[2].activation_function(
weights[(0,2)] * nodes[0].activation_function(nodes[0].input)
+ weights[(1,2)] * nodes[1].activation_function(nodes[1].input)
)

d(y_predicted)/d(weights[(0,2)])
= nodes[2].activation_derivative(nodes[2].input)
* d(nodes[2].input)/d(weights[(0,2)])
= nodes[2].activation_derivative(nodes[2].input)
* d(weights[(0,2)] * nodes[0].activity + weights[(1,2)] * nodes[1].activity)/d(weights[(0,2)])
= nodes[2].activation_derivative(nodes[2].input)
* nodes[0].activity

by the same reasoning as above:

d(y_predicted)/d(weights[(1,2)]
= nodes[2].activation_derivative(nodes[2].input)
* nodes[1].activity

Note: If no activation_functions variable is passed in, then assume all activation functions are linear.

# Problem 94-2¶

### HashTable¶

Write a class HashTable that generalizes the hash table you previously wrote. This class should store an array of buckets, and the hash function should add up the alphabet indices of the input string and mod the result by the number of buckets.

>>> ht = HashTable(num_buckets = 3)
>>> ht.buckets
[[], [], []]
>>> ht.hash_function('cabbage')
2    (because 2+0+1+1+0+6+4 mod 3 = 14 mod 3 = 2)

>>> ht.insert('cabbage', 5)
>>> ht.buckets
[[], [], [('cabbage',5)]]

>>> ht.insert('cab', 20)
>>> ht.buckets
[[('cab', 20)], [], [('cabbage',5)]]

>>> ht.insert('c', 17)
>>> ht.buckets
[[('cab', 20)], [], [('cabbage',5), ('c',17)]]

>>> ht.insert('ac', 21)
>>> ht.buckets
[[('cab', 20)], [], [('cabbage',5), ('c',17), ('ac', 21)]]

>>> ht.find('cabbage')
5
>>> ht.find('cab')
20
>>> ht.find('c')
17
>>> ht.find('ac')
21

### SQL¶

This is a really quick problem, mostly just getting you to learn the ropes of the process we'll be using for doing SQL problems going forward (now that we're done with SQL Zoo).

On https://sqltest.net/, create table with the following script:

CREATE TABLE people (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(30) NOT NULL,
age VARCHAR(50)
);

INSERT INTO people (id, name, age)
VALUES ('1', 'Franklin', '12');

INSERT INTO people (id, name, age)
VALUES ('2', 'Sylvia', '13');

INSERT INTO people (id, name, age)
VALUES ('3', 'Harry', '14');

INSERT INTO people (id, name, age)
VALUES ('4', 'Ishmael', '15');

INSERT INTO people (id, name, age)
VALUES ('5', 'Kinga', '16');

Then select all teenage people whose names do not start with a vowel, and order by oldest first.

In order to run the query, you need to click the "Select Database" dropdown in the very top-right corner (so top-right that it might partially run off your screen) and select MySQL 5.6.

This is what your result should be:

id  name    age
5   Kinga   16
3   Harry   14
2   Sylvia  13

Copy the link where it says "Link for sharing your example:". This is what you'll submit for your assignment.

# Problem 94-3¶

There will be a quiz on Friday over things that we've done with C++, Haskell, SQL, and Neural Nets.

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

(You don't have to resolve any issues on this assignment)

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to custom level 3 strategy: ____
Overleaf link to proof of derivative of sigmoid: ____
Repl.it link to neural network: ____
Repl.it link to hash table: ____

Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____
Commit link for machine-learning repo: _____

Created issue: _____

# Problem 93-1¶

### Space Empires¶

Reconcile higlighted discrepancies

Implement game level 3

• Regular (repeated) economic phases -- once every turn

• Change the starting CP back to 0 (now that we have repeated economic phases, we no longer need the extra CP boost at the beginning).

• 3 movement rounds on each turn

• 7x7 board - starting positions are now (3,0) and (3,6)

Since we had to postpone the neural net problem, you can use the extra time to begin implementing your custom player for the level 3 game (we'll have the level 3 battles soon).

# Problem 93-2¶

### Hash Tables¶

Location: assignment-problems/hash_table.py

Under the hood, Python dictionaries are hash tables.

The most elementary (and inefficient) version of a hash table would be a list of tuples. For example, if we wanted to implement the dictionary {'a': [0,1], 'b': 'abcd', 'c': 3.14}, then we'd have the following:

list_of_tuples = [('a', [0,1]), ('b', 'abcd'), ('c', 3.14)]

To add a new key-value pair to the dictionary, we'd just append the corresponding tuple to list_of_tuples, and to look up the value for some key, we'd just loop through list_of_tuples until we got to the tuple with the key we wanted (and return the value).

But searching through a long array is very slow. So, to be more efficient, we use several list_of_tuples (which we'll call "buckets"), and we use a hash_function to tell us which bucket to put the new key-value pair in.

Complete the code below to implement a special case of an elementary hash table. We'll expand on this example soon, but let's start with something simple.

array = [[], [], [], [], []] # has 5 empty "buckets"

def hash_function(string):
# return the sum of character indices in the string
# (where "a" has index 0, "b" has index 1, ..., "z" has index 25)
# modulo 5

# for now, let's just assume the string consists of lowercase
# letters with no other characters or spaces

def insert(array, key, value):
# apply the hash function to the key to get the bucket index.
# then append the (key, value) pair to the bucket.

def find(array, key):
# apply the hash function to the key to get the bucket index.
# then loop through the bucket until you get to the tuple with the desired key,
# and return the corresponding value.

Here's an example of how the hash table will work:

>>> print(array)
array = [[], [], [], [], []]

>>> insert(array, 'a', [0,1])
>>> insert(array, 'b', 'abcd')
>>> insert(array, 'c', 3.14)
>>> print(array)
[[('a',[0,1])], [('b','abcd')], [('c',3.14)], [], []]

>>> insert(array, 'd', 0)
>>> insert(array, 'e', 0)
>>> insert(array, 'f', 0)
>>> print(array)
[[('a',[0,1]), ('f',0)], [('b','abcd')], [('c',3.14)], [('d',0)], [('e',0)]]

Test your code as follows:

alphabet = 'abcdefghijklmnopqrstuvwxyz'
for i, char in enumerate(alphabet):
key = 'someletters'+char
value = [i, i**2, i**3]
insert(array, key, value)

for i, char in enumerate(alphabet):
key = 'someletters'+char
output_value = find(array, key)
desired_value = [i, i**2, i**3]
assert output_value == desired_value

### Shell¶

Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.

https://www.hackerrank.com/challenges/text-processing-in-linux-the-sed-command-3/problem

### SQL¶

Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.

https://sqlzoo.net/wiki/Using_Null (queries 7, 8, 9, 10)

# Problem 93-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

• Resolve 1 GitHub issue on one of your own repositories.

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to neural network: ____
Repl.it link to hash table: ____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____

Commit link for assignment-problems repo: _____
Commit link for machine-learning repo: _____

Created issue: _____
Resolved issue: _____

# Problem 92-1¶

a. Once your strategy is finalized, Slack it to me and I'll upload it here.

https://github.com/eurisko-us/eurisko-us.github.io/tree/master/files/strategies/cohort-1/level-2

If your strategy is getting crushed by NumbersBerserker, keep in mind that it's okay to copy NumbersBerserker and then tweak it a little bit with your own spin. Your strategy should have some original component, but it does not need to be 100% original (or even mostly original).

Then, once everyone's strategies are submitted, download the strategies from the above folder and run all pairwise battles for 500 games.

Assuming our games match up so that we can actually agree about who won, there will be prizes:

• 1st place: 50% extra credit on the assignment
• 2nd place: 30% extra credit on the assignment
• 3rd place: 10% extra credit on the assignment

b. Time for an introduction to neural nets! In this problem, we'll create a really simple neural network that is essentially a "neural net"-style implementation of linear regression. We'll start off with something simple and familiar, but we'll implement much more advanced models in the near future.

Note: It seems like we need to merge our graph library into our machine-learning library. So, let's do that. The src your machine-learning library should now look like this:

src/
- models/
- linear_regressor.py
- neural_network.py
- ...
- graphs/
- weighted_graph.py
- ...

(If you have a better idea for the structure of our library, feel free to do it your way and bring it up for discussion during the next class)

Create a NeuralNetwork class that inherits from weighted graph. Pass in dictionary of weights to determine connectivity and initial weights.

>>> weights = {(0,2): -0.1, (1,2): 0.5}
>>> nn = NeuralNetwork(weights)

This is a graphical representation of the model:

nodes[2]                   ("output layer")
^            ^
/              \
weights[(0,2)]    weights[(1,2)]
^                    ^
/                      \
nodes[0]                      nodes[1]     ("input layer")

To make a prediction, our simple neural net computes a weighted sum of the input values. (Again, this will become more involved in the future, but let's not worry about that just yet.)

>>> nn.predict([1,3])
1.4

behind the scenes:

assign nodes[0] a value of 1 and nodes[1] a value of 3,
and then return the following:

weights[(0,2)] * nodes[0].value + weights[(1,2)] * nodes[1].value
= -0.1 * 1 + 0.5 * 3
= 1.4

If we know the output that's supposed to be associated with a given input, we can compute the error in the prediction.

We'll use the squared error, so that we can frame the problem of fitting the neural network as "choosing weights which minimize the squared error".

To find the weights which minimize the squared error, we can perform gradient descent. As we'll see in the future, calculating the gradient of the weights can get a little tricky (it requires a technique called "backpropagation"). But for now, you can just hard-code the process for this particular network.

>>> data_point = {'input': [1,3], 'output': [7]}
>>> nn.calc_squared_error(data_point)
31.36     [ because (7-1.4)^2 = 5.6^2 = 31.36 ]

{(0,2): -11.2, (1,2): -33.6}

behind the scenes:

squared_error = (y_actual - y_predicted)^2

d(squared_error)/d(weights)
= 2 (y_actual - y_predicted) d(y_actual - y_predicted)/d(weights)
= 2 (y_actual - y_predicted) [ 0 - d(y_predicted)/d(weights) ]
= -2 (y_actual - y_predicted) d(y_predicted)/d(weights)

remember that
y_predicted = weights[(0,2)] * nodes[0].value + weights[(1,2)] * nodes[1].value

so
d(y_predicted)/d(weights[(0,2)]) = nodes[0].value
d(y_predicted)/d(weights[(1,2)]) = nodes[1].value

Therefore

d(squared_error)/d(weights[(0,2)])
= -2 (y_actual - y_predicted) d(y_predicted)/d(weights[(0,2)])
= -2 (y_actual - y_predicted) nodes[0].value
= -2 (7 - 1.4) (1)
= -11.2

d(squared_error)/d(weights[(1,2)])
= -2 (y_actual - y_predicted) d(y_predicted)/d(weights[(1,2)])
= -2 (y_actual - y_predicted) nodes[1].value
= -2 (7 - 1.4) (3)
= -33.6

Once we've got the gradient, we can update the weights using gradient descent.

>>> nn.update_weights(data_point, learning_rate=0.01)

new_weights = old_weights - learning_rate * gradient
= {(0,2): -0.1, (1,2): 0.5}
- 0.01 * {(0,2): -11.2, (1,2): -33.6}
= {(0,2): -0.1, (1,2): 0.5}
+ {(0,2): 0.112, (1,2): 0.336}
= {(0,2): 0.012, (1,2): 0.836}

If we repeatedly loop through a dataset and update the weights for each data point, then we should get a model whose error is minimized.

Caveat: the minimum will be a local minimum, which is not guaranteed to be a global minimum.

Here is a test case with some data points that are on the line $y=1+2x.$ Our network is set up to fit any line of the form $y = \beta_0 \cdot 1 + \beta_1 \cdot x,$ where $\beta_0 =$ weights[(0,2)] and $\beta_1=$ weights[(1,2)].

Note that this line can be written as

output = 1 * input[0] + 2 * input[1]

In this particular case, the weights should converge to the true values (1 and 2).

>>> weights = {(0,2): -0.1, (1,2): 0.5}
>>> nn = NeuralNetwork(weights)
>>> data_points = [
{'input': [1,0], 'output': [1]},
{'input': [1,1], 'output': [3]},
{'input': [1,2], 'output': [5]},
{'input': [1,3], 'output': [7]}
]
>>> for _ in range(1000):
for data_point in data_points:
nn.update_weights(data_point)

>>> nn.weights
should be really close to
{(0,2): 1, (1,2): 2}

because the data points all lie on the line
output = input[0] * 1 + input[1] * 2

Once you've got your final weights, post them on #results.

# Problem 92-2¶

### Quiz Corrections¶

Originally I was going to put the hash table problem here, but I figured we should discuss it in class first. Also, we should do quiz corrections. So it will be on the next assignment instead.

For this assignment, please correct any errors on your quiz (if you got a score under 100%). You'll just need to submit your repl.it links again, with the corrected code.

Remember that we went through the quiz during class, so if you have any questions or need any help, look at the recording first.

Note: Since this quiz corrections problem is much lighter than the usual problem that would go in its place, there will be a couple more Shell and SQL problems than usual.

### Shell¶

Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.

Resources:

https://www.robelle.com/smugbook/regexpr.html

https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html

Problems:

https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-4/problem

https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-5/problem

https://www.hackerrank.com/challenges/text-processing-in-linux-the-sed-command-1/problem

https://www.hackerrank.com/challenges/text-processing-in-linux-the-sed-command-2/problem

### SQL¶

Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.

https://sqlzoo.net/wiki/Using_Null (queries 1, 2, 3, 4, 5, 6)

# Problem 92-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

• Resolve 1 GitHub issue on one of your own repositories.

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to neural network: ____
Repl.it links to quiz corrections (if applicable): _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____

Commit link for machine-learning repo: _____

Created issue: _____
Resolved issue: _____

# Problem 91-1¶

a. Re-run your decision tree on the sex prediction problem. Make 5 train-test splits of 80% train and 20% test, like we originally did. Now that your Gini trees match up from the previous assignment, they should match up here. Also, make sure to propagate any changes in your Gini tree to your random tree. Our random forest results should be pretty close as well.

b. Create a custom strategy for the level 2 game. Test it against NumbersBerserkerLevel2 and FlankerLevel2. On Wednesday's assignment, we'll have our strategies battle against each other.

Put your results in the usual spreadsheet:

# Problem 91-2¶

### Commit¶

• Commit your code to Github.

• We'll skip reviews on this assignment, to save you a bit of time.

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to hash table: _____

Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____

# Problem 90-1¶

This weekend, your only primary problem is to resolve discrepancies in your Gini decision tree & games (both level 1 and level 2).

Please be sure to get the game discrepancies resolved, so that we can have our custom level 2 strategies battle next week. Then, I'll let Jason know we're ready to speak with Prof. Wierman about designing optimal strategies for our level 2 game.

# Problem 90-2¶

### C++¶

At the beginning of the year, we wrote a Python function called simple_sort that sorts a list by repeatedly finding the smallest element and appending it to a new list.

Now, you will sort a list in C++ using a similar technique. However, because working with arrays in C++ is a bit trickier, we will modify the implementation so that it only involves the use of a single array. The way we do this is by swapping:

• Find the smallest element in the array
• Swap it with the first element of the array
• Find the next-smallest element in the array
• Swap it with the second element of the array
• ...

For example:

array: [30, 50, 20, 10, 40]
indices to consider: 0, 1, 2, 3, 4
elements to consider: 30, 50, 20, 10, 40
smallest element: 10
swap with first element: [10, 50, 20, 30, 40]

---

array: [10, 50, 20, 30, 40]
indices to consider: 1, 2, 3, 4
elements to consider: 50, 20, 30, 40
smallest element: 20
swap with second element: [10, 20, 50, 30, 40]

---

array: [10, 20, 50, 30, 40]
indices to consider: 2, 3, 4
elements to consider: 50, 30, 40
smallest element: 30
swap with second element: [10, 20, 30, 50, 40]

...

final array: [10, 20, 30, 40, 50]

Write your code in the template below.

# include <iostream>
# include <cassert>
int main()
{
int array[5]{ 30, 50, 20, 10, 40 };

// your code here

std::cout << 'Testing...\n';

assert(array[0]==10);
assert(array[1]==20);
assert(array[2]==30);
assert(array[3]==40);
assert(array[4]==50);

std::cout << 'Succeeded';

return 0;
}

### Shell¶

Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.

Resources:

https://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/

Problems:

https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-1/problem

https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-2/problem

https://www.hackerrank.com/challenges/text-processing-in-linux-the-grep-command-3/problem

### SQL¶

Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.

https://sqlzoo.net/wiki/More_JOIN_operations (queries 13, 14, 15)

# Problem 90-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

• Resolve 1 GitHub issue on one of your own repositories.

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to C++ code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____

Commit link for space-empires repo: _____
Commit link for machine-learning repo: _____
Commit link for assignment-problems repo: _____

Created issue: _____
Resolved issue: _____

# Problem 89-1¶

On this problem, we'll do some debugging based on the results from our spreadsheet:

### Debugging¶

a. Compare your results to your classmates' results for Indices of misclassified data points (zero-indexed: the index of the first data point in the dataset would be index 0). If you and a classmate have different results, do some pair debugging to figure out what caused the difference and how you guys need to reconcile it.

b. Compare your results to your classmates' results for Flanker vs Berserker | Simulate 10 games with random seeds 1-10; list game numbers on which Flanker wins. If you and a classmate have different results, do some pair debugging to figure out what caused the difference and how you guys need to reconcile it.

c. Modify your level 2 game so that each player starts with 4 shipyards in addition to 3 scouts. (If a player doesn't start out with shipyards, then the NumberBerserker strategy can't actually do what it's intended to do.)

Then, re-run the game level 2 matchups and put your results in the sheet (put them on the sheet for the current assignment, #89).

### Random Forest¶

d. Make the following adjustment to your random forest:

• In your random decision tree, create a training_percentage parameter that governs the percent of the training data that you actually use to fit the model.

• In our case, we have about 70 records, and in each test-train split, we're using 80% as training data, so that's about 56 records. Now, if we set training_percentage = 0.3, then we randomly choose $0.3 \times 56 \approx 17$ records from the training data to actually fit the decision tree.

• When randomly selecting the records, use random selection with replacement. In other words, it's okay to select duplicate data records.

• When you initialize the random forest, pass a training_percentage parameter that, in turn, gets passed to the random decision trees.

• The reason why choosing training_percentage < 1 can be useful is that it speeds up the time to train the random forest, and also, it allows different models to get different "perspectives" on the data, thereby creating a more diverse "hive mind" (and higher diversity generally leads to higher performance when it comes to ensemble models, i.e. models consisting of many smaller sub-models)

e. On the sex prediction dataset, train the following models on the first half of the data and test on the second half of the data.

• A single random decision tree with max_depth = 4 and training_percentage = 0.3.

• Random forest with 10 trees with max_depth = 4 and training_percentage = 0.3.

• Random forest with 100 trees with max_depth = 4 and training_percentage = 0.3.

• Random forest with 1,000 trees with max_depth = 4 and training_percentage = 0.3.

• Random forest with 10,000 trees with max_depth = 4 and training_percentage = 0.3.

Paste the accuracy into the spreadsheet.

# Problem 89-2¶

First, observe the following Haskell code which computes the sum of all the squares under 1000:

>>> sum (takeWhile (<1000) (map (^2) [1..]))
10416

(If you don't see why this works, then run each part of the expression: first map (^2) [1..], and then takeWhile (<1000) (map (^2) [1..]), and then the full expression sum (takeWhile (<1000) (map (^2) [1..])).)

Now, recall the Collatz conjecture (if you don't remember it, ctrl+F "collatz conjecture" to jump to the problem where we covered it).

The following Haskell code can be used to recursively generate the sequence or "chain" of Collatz numbers, starting with an initial number n.

chain :: (Integral a) => a -> [a]
chain 1 = [1]
chain n
| even n =  n:chain (n div 2)
| odd n  =  n:chain (n*3 + 1)

Here are the chains for several initial numbers:

>>> chain 10
[10,5,16,8,4,2,1]
>>> chain 1
[1]
>>> chain 30
[30,15,46,23,70,35,106,53,160,80,40,20,10,5,16,8,4,2,1]

Your problem: Write a Haskell function firstNumberWithChainLengthGreaterThan n that finds the first number whose chain length is at least n.

Check: firstNumberWithChainLengthAtLeast 15 should return 7.

To see why this check works, observe the first few chains shown below:

1: [1] (length 1)
2: [2,1] (length 2)
3: [3,10,5,16,8,4,2,1] (length 8)
4: [4,2,1] (length 3)
5: [5,16,8,4,2,1] (length 6)
6: [6,3,10,5,16,8,4,2,1] (length 9)
7: [7,22,11,34,17,52,26,13,40,20,10,5,16,8,4,2,1] (length 17)

7 is the first number whose chain is at least 15 numbers long.

### Shell¶

Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.

Problems:

### SQL¶

Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.

https://sqlzoo.net/wiki/More_JOIN_operations (queries 9, 10, 11, 12)

# Problem 89-3¶

### Commit + Review¶

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

• Resolve 1 GitHub issue on one of your own repositories.

### Submission Template¶

For your submission, copy and paste your links into the following template:

Repl.it link to Haskell code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____

Commit link for space-empires repo: _____
Commit link for machine-learning repo: _____
Commit link for assignment-problems repo: _____

Created issue: _____
Resolved issue: _____

# Problem 89-4¶

There will be a 45-minute quiz that you can take any time on Thursday. (We don't have school Friday.)

The quiz will cover C++ and Haskell.

• For C++, you will need to be comfortable working with arrays.

• For Haskell, you'll need to be comfortable working with list comprehensions and compositions of functions.

You will need to write C++ and Haskell functions to calculate some values. It will be somewhat similar to the meta-Fibonacci sum problem, except the computation will be different (and simpler).

# Problem 88-1¶

### Gini Tree¶

On the sex prediction dataset, train a Gini decision tree on the first half of the data and test on the second half of the data.

(If there's an odd number of data points, then round so that the first half of the data will have one more record than the second half)

Paste your prediction accuracy into the spreadsheet, along with the indices of any misclassified data points (zero-indexed: the index of the first data point in the dataset would be index 0).

### Space Empires Level 1¶

Note the following from the rulebook:

(7.6) There is no limit to the number of Ship Yards that may occupy the same system

(8.2.2) Building Ship Yards: Ship Yards may only be built at planets that produced income (not new colonies) in the Economic Phase. Ship Yards may be purchased and placed at multiple planets, but no more than one per planet. Additional Ship Yards may be purchased at those planets in future Economic Phases. Ship Yards are produced by the Colony itself and therefore do not require Ship Yards to build them.

So, only a colony can buy a shipyard, and only once per economic phase. Shipyards cannot build other shipyards. If a colony with existing shipyards builds a shipyard, the building of the new shipyard does not it affect how much hullsize the other shipyards can build on that turn. That is to say, building a shipyard at a colony uses up CP but not hullsize building capacity.

Problem: Using your level 1 game, simulate 20 games of Flanker vs Berserker with random seeds 1 through 20.

• Define the random seed at the beginning of the game. Make sure your die rolls match up with those shown in the demonstration below.

• Let Flanker go first on games 1-10, and let Berserker go first on games 11-20.

• For each of the 20 games, store the game log in space-empires/logs/21-02-05-flanker-vs-berserker.txt. In the game log, on each turn, you should log any ship movements, any battle locations, the combat order, the dice rolls on each attack during combat, and whether or not each attack resulted in a hit.

• In the spreadsheet, paste the game numbers on which the Flanker won. For example, if Flanker won on games 2, 3, 5, 8, 9, 13, 15, 19, then you'd paste 2, 3, 5, 8, 9, 13, 15, 19 into the spreadsheet.

• Check the game numbers you pasted in against those of your classmates. Any discrepancy corresponds to a game on which you and your classmate had different outcomes. So, for any discrepancies, inspect your game logs against your classmate's, and figure out where your game logs started to differ.

import random
import math

for game_num in range(1,6):
random.seed(game_num)
first_few_die_rolls = [math.ceil(10*random.random()) for _ in range(7)]
print('first few die rolls of game {}'.format(game_num))
print('\t',first_few_die_rolls,'\n')

---

first few die rolls of game 1
[2, 9, 8, 3, 5, 5, 7]

first few die rolls of game 2
[10, 10, 1, 1, 9, 8, 7]

first few die rolls of game 3
[3, 6, 4, 7, 7, 1, 1]

first few die rolls of game 4
[3, 2, 4, 2, 1, 5, 10]

first few die rolls of game 5
[7, 8, 8, 10, 8, 10, 1]

### Space Empires Level 2¶

Implement toggles that you can use to set level 2 of the game:

• Change initial CP to 10. So really, the players start with 10 CP, and then get 20 CP income, for a total of 30 CP that they're able to spend on ships / technology / maintenance.

• Allow players to buy technology (but as for ships -- they can still only buy scouts)

• Have 1 economic phase and that's it.

In the level 2 game, we will have matchups between several strategies.

• NumbersBerserkerLevel2 - spends all its CP buying more scouts. This Berserker thinks that the best way to win is to bring in a bunch of unskilled reinforcements. Sends all the scouts directly towards the enemy home base.

• MovementBerserkerLevel2 - buys movement technology first and then buys another scout. Then sends all the scouts directly towards the enemy home base.

• AttackBerserkerLevel2 - buys attack technology first and then buys another scout. Then sends all the scouts directly towards the enemy home base.

• DefenseBerserkerLevel2 - buys attack technology first and then buys another scout. Then sends all the scouts directly towards the enemy home base.

• FlankerLevel2 - buys movement technology then buys another scout. Then uses that fast scout to perform the flanking maneuver.

Perform 1000 simulations for each matchup, just like you did with level 1. Remember to randomize the die rolls and switch who goes first at game 500. Put your results in the spreadsheet.

When doing the 1000 simulations, set random.seed(game_num) like you are now doing with the level 1 game. This way, we'll be able to backtrack any discrepancies to the individual game number.

# Problem 88-2¶

### C++¶

Implement the metaFibonacciSum function in C++:

# include <iostream>
# include <cassert>

int metaFibonacciSum(int n)
{
// return the result immediately if n<2

// otherwise, construct a an array called "terms"
// that contains the Fibonacci terms at indices
// 0, 1, ..., n

// construct an array called "extendedTerms" that
// contains the Fibonacci terms at indices
// 0, 1, ..., a_n (where a_n is the nth Fibonacci term)

// when you fill up this array, many of the terms can
// simply copied from the existing "terms" array. But
// if you need additional terms, you'll have to compute
// them the usual way (by adding the previous 2 terms)

// then, create an array called "partialSums" that
// contains the partial sums S_0, S_1, ..., S_{a_n}

// finally, add up the desired partial sums,
// S_{a_0} + S_{a_1} + ... + S_{a_n},
// and return this result

}

int main()
{
std::cout << "Testing...\n";

assert(metaFibonacciSum(6)==74);

std::cout << "Success!";

return 0;
}

### Shell¶

Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.

### SQL¶

Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.

https://sqlzoo.net/wiki/More_JOIN_operations (queries 5, 6, 7, 8)

# Problem 88-3¶

Review; 10% of assignment grade; 15 minutes estimate

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

• Resolve 1 GitHub issue on one of your own repositories.

### SUBMISSION TEMPLATE¶

For your submission, copy and paste your links into the following template:

Repl.it link to space-empires/logs: ___
Repl.it link to space empires game lvl 1 simulation runner: ___
Repl.it link to space empires game lvl 2 simulation runner: ___

Repl.it link to C++ code: _____
Link to Shell/SQL screenshots (Overleaf or Google Doc): _____

Commit link for space-empires repo: _____

Created issue: _____
Resolved issue: _____

# Problem 87-1¶

a.

• It's possible that unit 0 might be a colony (not a scout), which would be problematic for the current implementation of the Flanker strategy. Fix the implementation so that the flanking unit is chosen as the first scout, not just the unit at index 0 (since there is no guarantee this is a scout). Check your game logs to make sure that the scout is actually doing the flanking, as intended.

• Re-specify hidden_game_state_for_combat. Currently, it shows all of the opponent's units, but it should really only show those unit involved in the particular combat that's taking place.

• hidden_game_state_for_combat - like hidden_game_state, but reveal the type / hits_left / technology of the opponent's ships that are in the particular combat.
• Run 1000 random simulations for each of the following matchups. Remember to have both strategies get an equal number of games as the player who goes first. Paste your results into the spreadsheet.

• Berserker vs Dumb
• Berserker vs Random
• Flanker vs Random
• Flanker vs Berserker
• Note that there should be no ties. (If you're getting a tie, post on Slack so we can clear up what's going wrong.)

• Make sure you're using a 10-sided die.

b. Re-run the sex prediction problem (Problem 77-1) and paste your results in the spreadsheet. Now that our decision trees and random forests are passing tests, we should get very similar accuracy results.

c. Submit quiz corrections -- say what you got wrong, why you got it wrong, what the correct answer is, and why it's correct.

# Problem 87-2¶

Supplemental problems; 30% of assignment grade; 60 minutes estimate

Location: assignment-problems

Let $a_k$ be the $k$th Fibonacci number and let $S_k$ be the sum of the first $k$ Fibonacci numbers. Write a function metaFibonacciSum that takes an input $n$ and computes the sum

$$\sum\limits_{i=k}^n S_{a_k} = S_{a_0} + S_{a_1} + ... + S_{a_n}.$$

For example, if we wanted to compute the result for n=6, then we'd need to

• compute the first $6$ Fibonacci numbers: $$a_0=0, a_1=1, a_2=1, a_3=2, a_4=3, a_5=5, a_6=8$$

• compute the first $8$ Fibonacci sums: \begin{align*} S_0 &= 0 \\ S_1 &= 0 + 1 = 1 \\ S_2 &= 0 + 1 + 1 = 2 \\ S_3 &= 0 + 1 + 1 + 2 = 4 \\ S_4 &= 0 + 1 + 1 + 2 + 3 = 7 \\ S_5 &= 0 + 1 + 1 + 2 + 3 + 5 = 12 \\ S_6 &= 0 + 1 + 1 + 2 + 3 + 5 + 8 = 20 \\ S_7 &= 0 + 1 + 1 + 2 + 3 + 5 + 8 + 13 = 33 \\ S_8 &= 0 + 1 + 1 + 2 + 3 + 5 + 8 + 13 + 21 = 54 \\ \end{align*}

Add up the desired sums:

\begin{align*} \sum\limits_{k=0}^6 S_{a_k} &= S_{a_0} + S_{a_1} + S_{a_2} + S_{a_3} + S_{a_4} + S_{a_5} + S_{a_6} \\ &= S_{0} + S_{1} + S_{1} + S_{2} + S_{3} + S_{5} + S_{8} \\ &= 0 + 1 + 1 + 2 + 4 + 12 + 54 \\ &= 74 \end{align*}

Here's a template:

-- first, define a recursive function "fib"
-- to compute the nth Fibonacci number

-- once you've defined "fib", proceed to the
-- steps below
firstKEntriesOfSequence k = -- your code here; should return the list [a_0, a_1, ..., a_k]
kthPartialSum k = -- your code here; returns a single number
termsToAddInMetaSum n = -- your code here; should return the list [S_{a_0}, S_{a_1}, ..., S_{a_k}]
metaSum n = -- your code here; returns a single number

main = print (metaSum 6) -- should come out to 74

### Shell¶

Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.

### SQL¶

Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query.

https://sqlzoo.net/wiki/More_JOIN_operations (queries 1, 2, 3, 4)

# Problem 87-3¶

Review; 10% of assignment grade; 15 minutes estimate

• Commit your code to Github.

• Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate.

• Resolve 1 GitHub issue on one of your own repositories.

### SUBMISSION TEMPLATE¶

For your submission, copy and paste your links into the following template:

REMEMBER TO PASTE YOUR RESULTS IN HERE:

Quiz corrections: ____

Link to Shell/SQL screenshots (Overleaf or Google Doc): _____

Commit link for space-empires repo: _____
Commit link for assignment-problems repo: _____

Created issue: _____
Resolved issue: _____

# Problem 86-1¶

Primary problems; 60% of assignment grade; 90 minutes estimate

a. Assert that your decision trees pass some tests. (They likely will, so this problem will likely only take 10 minutes or so, I just to make sure we're all clear before we go back to improving our random forest, modeling real-world datasets, and moving on to neural nets.)

(i) Assert that BOTH your gini decision tree and random decision tree pass the following test.

• Create a dataset consisting of 100 points $$\Big[ (x,y,\textrm{label}) \mid x,y \in \mathbb{Z}, \,\, -5 \leq x,y \leq 5, \,\, xy \neq 0 \big],$$ where $$\textrm{label} = \begin{cases} \textrm{positive}, \quad x>0, y > 0 \\ \textrm{negative}, \quad \textrm{otherwise} \end{cases}$$

• Predict the label of this dataset. Train on 100% of the data and test on 100% of the data.

• You should get an accuracy of 100%.

• You should have exactly 2 splits

Note: Your tree should look exactly like one of these:

           split y=0
/          \
y < 0         y > 0
pure neg      split x=0
/   \
x < 0   x > 0
pure neg  pure pos
.
or
.
split x=0
/          \
x < 0         x > 0
pure neg      split y=0
/   \
y < 0   y > 0
pure neg  pure pos

(ii) Assert that your gini decision tree passes Tests 1,2,3,4 from problem 84-1.

(iii) Assert that your random forest with 10 trees passes Tests 1,2,3,4 from problem 84-1.

b. Run each Level1 player against each other for 100 random games. Do this in space-empires/analysis/level_1_matchups.py. Then, post your results to #results:

(I've included FlankerStrategyLevel1 at the bottom of this problem.)

Simulation results for 100 games, level 1:

Random vs Dumb:
- Random wins __% of the time
- Dumb wins __% of the time

Berserker vs Dumb:
- Berserker wins __% of the time
- Dumb wins __% of the time

Berserker vs Random:
- Berserker wins __% of the time
- Random wins __% of the time

Flanker vs Random:
- Sidestep wins __% of the time
- Random wins __% of the time

Flanker vs Berserker:
- Sidestep wins __% of the time
- Random wins __% of the time

Important simulation requirements:

• Use (at least) 100 simulated games to generate the win percentages for each matchup. This way, we can guarantee that we should all get similar win percentages.

• Use actual random rolls (not just increasing or decreasing rolls). We want each of the 100 simulated games to occur under different rolling conditions.

• Randomize who goes first. For example, in Flanker vs Berserker, Flanker should go first on 50 games and Berserker should go first on 50 games.

• Use a 10-sided die (this is what's used in the official game)

class FlankerStrategyLevel1:
# Sends 2 of its units directly towards the enemy. home colony
# Sends 1 unit slightly to the side to avoid any combat
# that happens on the direct path between home colonies.

def __init__(self, player_index):
self.player_index = player_index
self.flank_direction = (1,0)

def decide_ship_movement(self, unit_index, hidden_game_state):
myself = hidden_game_state['players'][self.player_index]
opponent_index = 1 - self.player_index
opponent = hidden_game_state['players'][opponent_index]

unit = myself['units'][unit_index]
x_unit, y_unit = unit['coords']
x_opp, y_opp = opponent['home_coords']

translations = [(0,0), (1,0), (-1,0), (0,1), (0,-1)]

# unit 0 does the flanking
if unit_index == 0:
dist = abs(x_unit - x_opp) + abs(y_unit - y_opp)
delta_x, delta_y = self.sidestep_direction
reverse_flank_direction = (-delta_x, -delta_y)

# at the start, sidestep
if unit['coords'] == myself['home_coords']:
return self.flank_direction

# at the end, reverse the sidestep to get to enemy
elif dist == 1:
reverse_flank_direction

# during the journey to the opponent, don't
# reverse the sidestep
else:
translations.remove(self.flank_direction)

best_translation = (0,0)
smallest_distance_to_opponent = 999999999999
for translation in translations:
delta_x, delta_y = translation
x = x_unit + delta_x
y = x_unit + delta_y
dist = abs(x - x_opp) + abs(y - y_opp)
if dist < smallest_distance_to_opponent:
best_translation = translation
smallest_distance_to_opponent = dist

return best_translation

def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index):
# attack opponent's first ship in combat order

combat_order = combat_state[coords]
player_indices = [unit['player_index'] for unit in combat_order]

opponent_index = 1 - self.player_index
for combat_index, unit in enumerate(combat_order):
if unit['player_index'] == opponent_index:
return combat_index

# Problem 86-2¶

Supplemental problems; 30% of assignment grade; 60 minutes estimate

Location: assignment-problems

a. Skim the following section of http://learnyouahaskell.com/higher-order-functions.

Function composition

Consider the function $$f(x,y) = \max \left( x, -\tan(\cos(y)) \right)$$

This function can be implemented as

>>> f x y = negate (max (x tan (cos y)))

or, we can implement it using function composition notation as follows:

>>> f x = negate . max x . tan . cos

Note that although max is a function of two variables, max x is a function of one variable (since one of the inputs is already supplied). So, we can chain it together with other single-variable functions.

Previously, you wrote a function tail' in Tail.hs that finds the last n elements of a list by reversing the list, finding the head n elements of the reversed list, and then reversing the result.

Rewrite the function tail' using composition notation, so that it's cleaner. Run Tail.hs again to make sure it still gives the same output as before.

b. Write a function isPrime that determines whether a nonnegative integer x is prime. You can use the same approach that you did with one of our beginning Python problems: loop through numbers between 2 and x-1 and see if you can find any factors.

Note that neither 0 nor 1 are prime.

Here is a template for your file isPrime.cpp:

#include <iostream>
#include <cassert>

bool isPrime(int x)
{
// your code here
}

int main()
{
assert(!isPrime(0));
assert(!isPrime(1));
assert(isPrime(2));
assert(isPrime(3));
assert(!isPrime(4));
assert(isPrime(5));
assert(isPrime(7));
assert(!isPrime(9));
assert(isPrime(11));
assert(isPrime(13));
assert(!isPrime(15));
assert(!isPrime(16));
assert(isPrime(17));
assert(isPrime(19));
assert(isPrime(97));
assert(!isPrime(99));
assert(!isPrime(99));
assert(isPrime(13417));

std::cout << "Success!";

return 0;
}

Your program should work like this

>>> g++ isPrime.cpp -o isPrime
>>> ./isPrime
Success!

c. Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.

Here's a reference to the sort command: https://www.thegeekstuff.com/2013/04/sort-files/

Note that the "tab" character must be specified as $'\t'. These problems are super quick, so we'll do several. d. Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query. https://sqlzoo.net/wiki/The_JOIN_operation (queries 12, 13) # Problem 86-3¶ Review; 10% of assignment grade; 15 minutes estimate • Commit your code to Github. • Make 1 GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. • Resolve 1 GitHub issue on one of your own repositories. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Repl.it link to machine-learning/tests/test_random_forest.py: _____ Repl.it link to space-empires/analysis/level_1_matchups.py: _____ Repl.it link to Haskell code: _____ Repl.it link to C++ code: _____ Link to Shell/SQL screenshots (Overleaf or Google Doc): _____ Commit link for machine-learning repo: _____ Commit link for space-empires repo: _____ Commit link for assignment-problems repo: _____ Created issue: _____ Resolved issue: _____ # Problem 85-1¶ Primary problems; 60% of assignment grade; 90 minutes estimate a. Make the following updates to your game: • put "board_size" as an attribute in the game_state. We've got our grid size set as$(5,5)$for now (i.e. a$5 \times 5grid) • Player loses game when their home colony is destroyed • Ships cannot move diagonally • For most of the strategy functions, the input will need to be a partially-hidden game state. There are two types of hidden game states in particular: • hidden_game_state_for_combat - has all the information except for planet locations and the opponent's CP • hidden_game_state - has all the information except for planet locations, the opponent's CP, and the type / hits_left / technology of the opponent's units. (The opponent's units are still in array form, and you can see their locations, but that's it -- you don't know anything else about them.) b. Some background... We need to get some sort of game competition going in the upcoming week so that you guys have something to work on with Prof. Wierman from Caltech. Jason and I talked a bit and came to the conclusion that we need to something working as soon as possible, even if it doesn't use all the stuff we've implemented so far. Plus, the main thing that will be of interest to Prof. Wierman is the types of algorithms you guys are using in your strategies (he won't care if the game doesn't have all the features we want -- it just needs to be rich enough to permit some different strategies). So, let's focus on a very limited type of game and gradually expand it after we get it working. The first type of game we'll consider will be subject to the following constraints. (i) Implement optional arguments in your game that "swtich off" some of the parts when they are set to False: • There are only 2 planets: one for each home colony. That's it. Switch off the part that creates additional planets. • Players start with 3 scouts and their home colony and that's it. No colonyships. No shipyards. Just switch off the line where you give the player the colonyships / shipyards. • Movement phase consists of just 1 round. Switch off the lines in your game for the other 2 rounds. • There will be no economic phase. Just switch off that line of your game. • Players are not allowed to screen ships. Switch off the line where the game asks the player what ships they want to screen. So, the game will consist of each player starting with 3 scouts, moving them around the board, having combat whenever they meet, and trying to reach and destroy the opponent's home colony. Note that nothing we've done is a wasted effort. We're just going to put the other features (technology, other ship types, planets, ship screening, etc) on pause until we get our games working under the simplest constraints. Then, we'll bring all that other stuff back in. (ii) I've included code for two strategies at the bottom of this problem. The strategies are named with Level1 at the end because this is like the "level 1" version of our game. We'll make a level 2 version in the next week, and then a level 3 version, and so on, until we've re-introduced all the features we've been working on. • DumbStrategyLevel1 - sends all of its units to the right • RandomStrategyLevel1 - moves its units randomly • BerserkerStrategyLevel1 - sends all of its units directly towards the enemy home colony. Write some tests for these strategies: • If a BerserkerStrategyLevel1 plays against a DumbStrategyLevel1, the BerserkerStrategyLevel1 wins, and in the final game state each player still has 3 scouts. Announce on Slack once you have this test working. • If a BerserkerStrategyLevel1 plays against a BerserkerStrategyLevel1, there is a winner, and in the final game state one player has 0 scouts. Announce on Slack once you have this test working. • If a BerserkerStrategyLevel1 plays against a RandomStrategyLevel1, the BerserkerStrategyLevel1 should win the majority of the time. (To test this, run 100 games and compute how many times BerserkerStrategyLevel1 wins.) Announce on Slack once you have this test working. (iii) Write a custom strategy called YournameStrategyLevel1. • Write up a rationale for your strategy in an overleaf doc. Explain why you take the actions you do. Explain why it will defeat DumbStrategyLevel1and RandomStrategyLevel1. Explain why you think it might beat BerserkerStrategyLevel1. (iv) Make sure your custom strategy passes the following tests. • Make sure it defeats the DumbStrategyLevel1 all the time. Announce on Slack once you have this test working. • Make sure it defeats RandomStrategyLevel1 the majority of the time. Announce on Slack once you have this test working. • Try to have your strategy defeat BerserkerStrategyLevel1 the majority of the time, too. (To test this, run 100 games and compute how many times BerserkerStrategyLevel1 wins.) Announce on Slack if you get this test working. class DumbStrategyLevel1: # Sends all of its units to the right def __init__(self, player_index): self.player_index = player_index def decide_ship_movement(self, unit_index, hidden_game_state): myself = hidden_game_state['players'][self.player_index] unit = myself['units'][unit_index] x_unit, y_unit = unit['coords'] board_size_x, board_size_y = game_state['board_size'] unit_is_at_edge = (x_unit == board_size_x-1) if unit_is_at_edge: return (0,0) else: return (1,0) def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index): # attack opponent's first ship in combat order combat_order = combat_state[coords] player_indices = [unit['player_index'] for unit in combat_order] opponent_index = 1 - self.player_index for combat_index, unit in enumerate(combat_order): if unit['player_index'] == opponent_index: return combat_index class RandomStrategyLevel1: # Sends all of its units to the right def __init__(self, player_index): self.player_index = player_index def decide_ship_movement(self, unit_index, hidden_game_state): myself = hidden_game_state['players'][self.player_index] unit = myself['units'][unit_index] x_unit, y_unit = unit['coords'] translations = [(0,0), (1,0), (-1,0), (0,1), (0,-1)] board_size_x, board_size_y = hidden_game_state['board_size'] while True: translation = random.choice(translations) delta_x, delta_y = translation x_new = x_unit + delta_x y_new = y_unit + delta_y if 0 <= x_new and 0 <= y_new and x_new <= board_size_x-1 and y_new <= board_size_y-1: return translation def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index): # attack opponent's first ship in combat order combat_order = combat_state[coords] player_indices = [unit['player_index'] for unit in combat_order] opponent_index = 1 - self.player_index for combat_index, unit in enumerate(combat_order): if unit['player_index'] == opponent_index: return combat_index class BerserkerStrategyLevel1: # Sends all of its units directly towards the enemy home colony def __init__(self, player_index): self.player_index = player_index def decide_ship_movement(self, unit_index, hidden_game_state): myself = hidden_game_state['players'][self.player_index] opponent_index = 1 - self.player_index opponent = hidden_game_state['players'][opponent_index] unit = myself['units'][unit_index] x_unit, y_unit = unit['coords'] x_opp, y_opp = opponent['home_coords'] translations = [(0,0), (1,0), (-1,0), (0,1), (0,-1)] best_translation = (0,0) smallest_distance_to_opponent = 999999999999 for translation in translations: delta_x, delta_y = translation x = x_unit + delta_x y = x_unit + delta_y dist = abs(x - x_opp) + abs(y - y_opp) if dist < smallest_distance_to_opponent: best_translation = translation smallest_distance_to_opponent = dist return best_translation def decide_which_unit_to_attack(self, hidden_game_state_for_combat, combat_state, coords, attacker_index): # attack opponent's first ship in combat order combat_order = combat_state[coords] player_indices = [unit['player_index'] for unit in combat_order] opponent_index = 1 - self.player_index for combat_index, unit in enumerate(combat_order): if unit['player_index'] == opponent_index: return combat_index # Problem 85-2¶ Supplemental problems; 30% of assignment grade; 60 minutes estimate Location: assignment-problems a. Skim the following section of http://learnyouahaskell.com/higher-order-functions. Maps and filters Pay attention to the following examples: >>> map (+3) [1,5,3,1,6] [4,8,6,4,9] >>> filter (>3) [1,5,3,2,1,6,4,3,2,1] [5,6,4] Create a Haskell file SquareSingleDigitNumbers.hs and write a function squareSingleDigitNumbers that takes a list returns the squares of the values that are less than 10. To check your function, print squareSingleDigitNumbers [2, 7, 15, 11, 5]. You should get a result of [4, 49, 25]. This is a one-liner. If you get stuck for more than 10 minutes, ask for help on Slack. b. Write a C++ program to calculate the height of a ball that falls from a tower. • Create a file constants.h to hold your gravity constant: #ifndef CONSTANTS_H #define CONSTANTS_H namespace myConstants { const double gravity(9.8); // in meters/second squared } #endif • Create a file simulateFall.cpp #include <iostream> #include "constants.h" double calculateDistanceFallen(int seconds) { // approximate distance fallen after a particular number of seconds double distanceFallen = myConstants::gravity * seconds * seconds / 2; return distanceFallen; } void printStatus(int time, double height) { std::cout << "At " << time << " seconds, the ball is at height " << height << " meters\n"; } int main() { using namespace std; cout << "Enter the initial height of the tower in meters: "; double initialHeight; cin >> initialHeight; // your code here // use calculateDistanceFallen to find the height now // use calculateDistanceFallen and printStatus // to generate the desired output // if the height now goes negative, then the status // should say that the height is 0 and the program // should stop (since the ball stops falling at height 0) return 0; } Your program should work like this >>> g++ simulateFall.cpp -o simulateFall >>> ./simulateFall Enter the initial height of the tower in meters: 100 At 0 seconds, the ball is at height 100 meters At 1 seconds, the ball is at height 95.1 meters At 2 seconds, the ball is at height 80.4 meters At 3 seconds, the ball is at height 55.9 meters At 4 seconds, the ball is at height 21.6 meters At 5 seconds, the ball is at height 0 meters c. Complete these Shell coding challenges and submit screenshots. Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. Here's a reference to the sort command: https://www.thegeekstuff.com/2013/04/sort-files/ These problems are super quick, so we'll do several. d. Complete these SQL coding challenges and submit screenshots. For SQL, each screenshot should include the problem number, the successful smiley face, and your query. https://sqlzoo.net/wiki/The_JOIN_operation (queries 10, 11) # Problem 85-3¶ Review; 10% of assignment grade; 15 minutes estimate Now, everyone should have a handful of issues on their repositories. So we'll go back to making 1 issue and resolving 1 issue. • Make 1 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. • Resolve 1 GitHub issue on one of your own repositories. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Link to space empires tests with the new strategies: _____ Link to overleaf doc with your custom strategy rationale: _____ Repl.it link to Haskell code: _____ Repl.it link to C++ code: _____ Link to Shell/SQL screenshots (Overleaf or Google Doc): _____ Commit link for space-empires repo: _____ Commit link for assignment-problems repo: _____ Created issue: _____ Resolved issue: _____ # Problem 84-1¶ Primary problems; 60% of assignment grade; 90 minutes estimate a. Implement calc_shortest_path(start_node, end_node) in your weighted graph. • To do this, you first need to carry out Dijkstra's algorithm to find the d-values. • Then, you need to find the edges for the shortest-path tree. To do this, loop through all the edges (a,b), and if the difference in d-values is equal to the weight, i.e. nodes[b].dvalue - nodes[a].dvalue == weight[(a,b)], include the edge in your list of edges for the shortest-path tree. • Using your list of edges for the shortest-path tree, create a Graph object and run calc_shortest_path on it. By constructing the shortest-path tree, we have reduced the problem of finding the shortest path in a weighted graph to the problem of finding the shortest path in an undirected graph, which we have already solved. Check your function by carrying out the following tests for the graph given in Problem 83-1. >>> weighted_graph.calc_shortest_path(8,4) [8, 0, 3, 4] >>> weighted_graph.calc_shortest_path(8,7) [8, 0, 1, 7] >>> weighted_graph.calc_shortest_path(8,6) [8, 0, 3, 2, 5, 6] b. Assert that your random decision tree passes the following tests. Test 1 • Create a dataset consisting of 100 points $$\Big[ (x,y,\textrm{label}) \mid x,y \in \mathbb{Z}, \,\, -5 \leq x,y \leq 5, \,\, xy \neq 0 \big],$$ where $$\textrm{label} = \begin{cases} \textrm{positive}, \quad xy > 0 \\ \textrm{negative}, \quad xy < 0 \end{cases}$$ • Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 100%. Test 2 • Create a dataset consisting of 150 points \begin{align*} &\Big[ (x,y,\textrm{A}) \mid x,y \in \mathbb{Z}, \,\, -5 \leq x,y \leq 5, \,\, xy \neq 0 \Big] \\ &+ \Big[ (x,y,\textrm{B}) \mid x,y \in \mathbb{Z}, \,\, 1 \leq x,y \leq 5 \Big] \\ &+ \Big[ (x,y,\textrm{B}) \mid x,y \in \mathbb{Z}, \,\, 1 \leq x,y \leq 5 \Big]. \end{align*} This dataset consists of100$data points labeled "A" distributed evenly throughout the plane and$50$data points labeled "B" in quadrant I. Each integer pair in quadrant I will have$1$data point labeled "A" and$2data points labeled "B". • Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 83.3% (25/150 misclassified) Test 3 • Create a dataset consisting of 1000 points $$\Big[ (x,y,z,\textrm{label}) \mid x,y,z \in \mathbb{Z}, \,\, -5 \leq x,y,z \leq 5, \,\, xyz \neq 0 \big],$$ where $$\textrm{label} = \begin{cases} \textrm{positive}, \quad xyz > 0 \\ \textrm{negative}, \quad xyz < 0 \end{cases}$$ • Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 100%. • Note: These are a lot of data points, but the tree won't need to do many splits, so the code should run quickly. If the code takes a long time to run, it means you've got an issue, and you should post on Slack if you can't figure out why it's taking so long. Test 4 • Create a dataset consisting of 1250 points \begin{align*} &\Big[ (x,y,z,\textrm{A}) \mid x,y,z \in \mathbb{Z}, \,\, -5 \leq x,y,z \leq 5, \,\, xyz \neq 0 \Big] \\ &+ \Big[ (x,y,z,\textrm{B}) \mid x,y,z \in \mathbb{Z}, \,\, 1 \leq x,y,z \leq 5 \Big] \\ &+ \Big[ (x,y,z,\textrm{B}) \mid x,y,z \in \mathbb{Z}, \,\, 1 \leq x,y,z \leq 5 \Big]. \end{align*} This dataset consists of1000$data points labeled "A" distributed evenly throughout the eight octants and$250$data points labeled "B" in octant I. Each integer pair in octant I will have$1$data point labeled "A" and$2$data points labeled "B". • Train a random decision tree to predict the label of this dataset. Train on 100% of the data and test on 100% of the data. You should get an accuracy of 90% (125/1250 misclassified) • Note: These are a lot of data points, but the tree won't need to do many splits, so the code should run quickly. If the code takes a long time to run, it means you've got an issue, and you should post on Slack if you can't figure out why it's taking so long. c. Update your game to use 0 at the head of the prices list for the technologies that start at level 1. 'technology_data': { # lists containing price to purchase the next level level 'shipsize': [0, 10, 15, 20, 25, 30], 'attack': [20, 30, 40], 'defense': [20, 30, 40], 'movement': [0, 20, 30, 40, 40, 40], 'shipyard': [0, 20, 30] } This way, you can do this: price = game_state['technology_data'][tech_type][level] instead of this: if tech_type in ['shipsize', 'movement', 'shipyard']: price = game_state['technology_data'][tech_type][level-1] else: price = game_state['technology_data'][tech_type][level] # Problem 84-2¶ Supplemental problems; 30% of assignment grade; 60 minutes estimate ### PART 1¶ Location: assignment-problems Skim the following section of http://learnyouahaskell.com/recursion. A few more recursive functions Pay attention to the following example. take n myList returns the first n entries of myList. take' :: (Num i, Ord i) => i -> [a] -> [a] take' n _ | n <= 0 = [] take' _ [] = [] take' n (x:xs) = x : take' (n-1) xs Create a Haskell file Tail.hs and write a function tail' that takes a list and returns the last n values of the list. Here's the easiest way to do this... • Write a helper function reverseList that reverses a list. This will be a recursive function, which you can define using the following template: reverseList :: [a] -> [a] reverseList [] = (your code here -- base case) reverseList (x:xs) = (your code here -- recursive formula) Here, x is the first element of the input list and xs is the rest of the elements. For the recursive formula, just call reverseList on the rest of the elements and put the first element of the list at the end. You'll need to use the ++ operation for list concatenation. • Once you've written reverseList and tested to make sure it works as intended, you can implement tail' by reversing the input list, calling take' on the reversed list, and reversing the result. To check your function, print tail' 4 [8, 3, -1, 2, -5, 7]. You should get a result of [-1, 2, -5, 7]. If you get stuck anywhere in this problem, don't spend a bunch of time staring at it. Be sure to post on Slack. These Haskell problems can be tricky if you're not taking the right approach from the beginning, but after a bit of guidance, it can become much simpler. ### PART 2¶ Complete these C++/Shell/SQL coding challenges and submit screenshots. • For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. • For SQL, each screenshot should include the problem number, the successful smiley face, and your query. C++ https://www.hackerrank.com/challenges/inheritance-introduction/problem • Guess what? After this problem, we're done with the useful C++ problems on HackerRank. Next time, we'll start some C++ coding in Repl.it. We'll start by re-implementing a bunch of problems that we did when we were first getting used to Python. Shell https://www.hackerrank.com/challenges/text-processing-tr-1/problem https://www.hackerrank.com/challenges/text-processing-tr-2/problem https://www.hackerrank.com/challenges/text-processing-tr-3/problem • Helpful templates: $ echo "Hello" | tr "e" "E"
HEllo
$echo "Hello how are you" | tr " " '-' Hello-how-are-you$ echo "Hello how are you 1234" | tr -d [0-9]
Hello how are you
$echo "Hello how are you" | tr -d [a-e] Hllo how r you • More info on tr here: https://www.thegeekstuff.com/2012/12/linux-tr-command/ • These problems are all very quick. If you find yourself spending more than a couple minutes on these, be sure to ask for help. SQL https://sqlzoo.net/wiki/The_JOIN_operation (queries 7, 8, 9) # Problem 84-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: • Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Link to weighted graph tests: _____ Link to random decision tree tests: _____ Repl.it link to Haskell code: _____ Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____ Commit link for space-empires repo: _____ Commit link for machine-learning repo: _____ Commit link for graph repo: _____ Commit link for assignment-problems repo: _____ Issue 1: _____ Issue 2: _____ # Problem 83-1¶ Primary problems; 60% of assignment grade; 90 minutes estimate a. In your random decision tree, make the random split selection a little bit smarter. First, randomly choose a feature (i.e. variable name) to split on. But then, instead of choosing a random split for that feature, choose the optimal split as determined by the Gini metric. • This is technically what's meant by a random decision tree -- it chooses each split feature randomly, but it chooses the best split value for each feature. I intended for us to make this update right after problem 77-1 and compare the results, but I forgot about it, so we'll do it now. (And then, in the future, we'll run the analysis again using the max_depth parameter plus another speedup trick.) b. Run your analysis from 77-1 again, now that your random decision tree has been updated. Post your results on #results. • (Don't set a max_depth yet -- we'll do that in the near future.) c. Create a strategy class AggressiveStrategy that buys ships/technology in the same way as CombatPlayer, but sends all their ships directly upward (or downward) towards the enemy home colony. • This should ideally result in battles in multiple locations on the path between the two home colonies, and there should be an actual winner of the game. • Battle two AggressiveStrategy players against each other. Post the following on #results: Ascending die rolls: - num turns: ___ - num combats: ___ - winner: ___ (Player 0 or Player 1?) - Player 1 ending CP: ___ - Player 2 ending CP: ___ Descending die rolls: - num turns: ___ - num combats: ___ - winner: ___ (Player 0 or Player 1?) - Player 1 ending CP: ___ - Player 2 ending CP: ___ # Problem 83-2¶ Supplemental problems; 30% of assignment grade; 60 minutes estimate ### PART 1¶ Location: assignment-problems Skim the following section of http://learnyouahaskell.com/syntax-in-functions. Hello recursion Maximum awesome Pay attention to the following example, especially: maximum' :: (Ord a) => [a] -> a maximum' [] = error "maximum of empty list" maximum' [x] = x maximum' (x:xs) | x > maxTail = x | otherwise = maxTail where maxTail = maximum' xs Create a Haskell file SmallestPositive.hs and write a function findSmallestPositive that takes a list and returns the smallest positive number in the list. The format will be similar to that shown in the maximum' example above. To check your function, print findSmallestPositive [8, 3, -1, 2, -5, 7]. You should get a result of 2. Important: In your function findSmallestPositve, you will need to compare x to 0, which means we must assume that not only can items x be ordered (Ord), they are also numbers (Num). So, you will need to have findSmallestPositive :: (Num a, Ord a) => [a] -> a. Note: It is not necessary to put a "prime" at the end of your function name, like is shown in the example. ### PART 2¶ Complete these C++/Shell/SQL coding challenges and submit screenshots. • For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. • For SQL, each screenshot should include the problem number, the successful smiley face, and your query. C++ https://www.hackerrank.com/challenges/c-tutorial-class/problem Shell https://www.hackerrank.com/challenges/text-processing-tail-1/problem https://www.hackerrank.com/challenges/text-processing-tail-2/problem https://www.hackerrank.com/challenges/text-processing-in-linux---the-middle-of-a-text-file/problem • Helpful templates: tail -n 11 # Last 11 lines tail -c 20 # Last 20 characters head -n 10 | tail -n 5 # Get the first 10 lines, and then get the last 5 lines of those 10 lines (so the final result is lines 6-10) • These problems are all one-liners. If you find yourself spending more than a couple minutes on these, be sure to ask for help. SQL https://sqlzoo.net/wiki/The_JOIN_operation (queries 4,5,6) # Problem 83-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: • Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Link to Overleaf doc: _____ Repl.it link to Haskell code: _____ Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____ Commit link for space-empires repo: _____ Commit link for machine-learning repo: _____ Commit link for assignment-problems repo: _____ Issue 1: _____ Issue 2: _____ # Problem 83-4¶ There will be a 45-minute quiz on Friday from 8:30-9:15. It will mainly be a review of the ML algorithms we've implemented so far, and their use for modeling purposes. Know how to do the following things: • Answer questions about similarities and differences between linear regression, logistic regression, k nearest neighbors, naive bayes, and Gini decision trees. • Answer questions about overfitting, underfitting, training datasets, testing datasets, train-test splits. # Problem 82-1¶ Primary problems; 60% of assignment grade; 90 minutes estimate a. Schedule pair coding sessions to finish game refactoring. Once you've gotten someone else's strategy integrated, update the "Current Completion" portion of the progress sheet: https://docs.google.com/spreadsheets/d/1zUqn5OvF3_U3XJ_d25vtBiFkRB3RgSQSXNv6wga8aeI/edit?usp=sharing • Saturday: • Group 1: Riley, Colby, Elijah • Group 2: George, David • Sunday: • Group 1: Colby, David, Elijah • Group 2: Riley, George b. Create a class WeightedGraph where each edge has an edge weight. Include two methods calc_shortest_path and calc_distance that accomplish the same goals as in your Graph class. But since this is a weighted graph, the actual algorithms for accomplishing those goals are a bit different. • Initialize the WeightedGraph with a weights dictionary instead of an edges list. The edges list just had a list of edges, whereas the weights dictionary will have its keys as edges and its values as the weights of those edges. • Implement the method calc_distance using Dijkstra's algorithm (https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Algorithm). This algorithm works by assigning all other nodes an initial d-value and then iteratively updating those d-values until they actually represent the distances to those nodes. • Initial d-values: initial node is assigned$0,$all other nodes are assigned$\infty$(use a large number like$9999999999$). Set current node to be the initial node. • For each unvisited neighbor of the current node, compute (current node's d-value) + (edge weight). If this sum is greater than the neighbor's d-value, then replace neighbor's d-value with the sum. • Update the current node to be the unvisited node that has the smallest d-value, and keep repeating the procedure until the terminal node has been visited. (Once the terminal node has been visited, its d-value is guaranteed to be correct.) Important: a node is not considered considered visited until it has been set as a current node. Even if you updated the node's d-value at some point, the node is not visited until it is the current node. • Test your code on the following example: >>> weights = { (0,1): 3, (1,7): 4, (7,2): 2, (2,5): 1, (5,6): 8, (0,3): 2, (3,2): 6, (3,4): 1, (4,8): 8, (8,0): 4 } >>> vertex_values = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'] >>> weighted_graph = WeightedGraph(weights, vertex_values) >>> weighted_graph.calc_distance(8,4) 7 >>> [weighted_graph.calc_distance(8,n) for n in range(9)] [4, 7, 12, 6, 7, 13, 21, 11, 0] # Problem 82-2¶ Supplemental problems; 30% of assignment grade; 60 minutes estimate ### PART 1¶ Location: assignment-problems Skim the following section of http://learnyouahaskell.com/syntax-in-functions. Let it be Pay attention to the following example, especially: calcBmis :: (RealFloat a) => [(a, a)] -> [a] calcBmis xs = [bmi | (w, h) <- xs, let bmi = w / h ^ 2, bmi >= 25.0] Create a Haskell file ProcessPoints.hs and write a function smallestDistances that takes a list of 3-dimensional points and returns the distances of any points that are within 10 units from the origin. To check your function, print smallestDistances [(5,5,5), (3,4,5), (8,5,8), (9,1,4), (11,0,0), (12,13,14)]. You should get a result of [8.67, 7.07, 9.90]. • Note: The given result is shown to 2 decimal places. You don't have to round your result. I just didn't want to list out all the digits in the test. ### PART 2¶ Complete these C++/Shell/SQL coding challenges and submit screenshots. • For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. • For SQL, each screenshot should include the problem number, the successful smiley face, and your query. C++ https://www.hackerrank.com/challenges/c-tutorial-struct/problem Shell https://www.hackerrank.com/challenges/text-processing-cut-7/problem https://www.hackerrank.com/challenges/text-processing-cut-8/problem https://www.hackerrank.com/challenges/text-processing-cut-9/problem https://www.hackerrank.com/challenges/text-processing-head-1/problem https://www.hackerrank.com/challenges/text-processing-head-2/tutorial • Remember to check out the tutorial tabs. • Note that if you want to start at the index 2 and then go until the end of a line, you can just omit the ending index. For example, cut -c2- means print characters$2$and onwards for each line in the file. • Also remember the template cut -d',' -f2-4, which means print fields$2$through$4$for each line the file, where the fields are separated by the delimiter ','. • You can also look at this resource for some examples: https://www.folkstalk.com/2012/02/cut-command-in-unix-linux-examples.html • These problems are all one-liners. If you find yourself spending more than a couple minutes on these, be sure to ask for help. SQL https://sqlzoo.net/wiki/SUM_and_COUNT (queries 6,7,8) https://sqlzoo.net/wiki/The_JOIN_operation (queries 1,2,3) # Problem 82-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: 1. Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. 2. ~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Repl.it link to WeightedGraph tests: _____ Repl.it link to Haskell code: _____ Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____ Commit link for space-empires repo: _____ Commit link for graph repo: _____ Commit link for assignment-problems repo: _____ Issue 1: _____ Issue 2: _____ # Problem 81-1¶ Primary problems; 60% of assignment grade; 90 minutes estimate a. If your game doesn't already do this, make it so that if a player commits an invalid move (such as moving off the grid), the game stops. b. Now that we've solved a bunch of issues in our games, it's time to slow down, and focus on 1 strategy at a time. • Schedule a pair coding session with your partner(s) below, sometime today or tomorrow. Let me know when you've scheduled it. During your session, make sure that your DumbStrategy passes their tests, and that their DumbStrategy passes your tests. • Riley & David • Elijah, Colby, & George • Refactoring will go a lot faster if we do it synchronously in small groups, instead of doing asynchronous refactoring with the entire group. By doing small-group synchronous refactoring, it'll be easier to keep a stream of communication going until the DumbStrategy works. • In case you need it: Probem 80-1 has templates of the game_state and the Strategy class. c. Make the following adjustment to your random forest: • In your random decision tree, create a max_depth parameter that stops splitting any nodes beyond the max_depth. For example, if max_depth = 2, then you would stop splitting a node once it is 2 units away from the root of the tree. • A consequence of this is that the terminal nodes might not be pure. If a terminal node is impure, then it represents the majority class. (If there are equal amounts of each class, just choose randomly.) When you initialize the random forest, pass a max_depth parameter that, in turn, gets passed to the random decision trees. • We've got a couple more adjustments to make, but I figured we should break up this task over multiple assignments since it's a bit of work and we've also got to keep making progress on the game refactoring. # Problem 81-2¶ Supplemental problems; 30% of assignment grade; 60 minutes estimate ### PART 1¶ Location: assignment-problems Observe the following example: bmiTell :: (RealFloat a) => a -> a -> String bmiTell weight height | bmi <= underweightThreshold = "The patient may be underweight. If this is the case, the patient should be recommended a higher-calorie diet." | bmi <= normalThreshold = "The patient may be at a normal weight." | otherwise = "The patient may be overweight. If this is the case, the patient should be recommended exercise and a lower-calorie diet." where bmi = weight / height ^ 2 underweightThreshold = 18.5 normalThreshold = 25.0 Create a Haskell file RecommendClothing.hs and write a function recommendClothing that takes the input degreesCelsius, converts it to degreesFahrenheit (multiply by$\dfrac{9}{5}$and add$32$), and makes the following recommendations: • If the temperature is$ \geq 80 \, ^\circ \textrm{F},$then recommend to wear a shortsleeve shirt. • If the temperature is$ > 65 \, ^\circ \textrm{F}$but$ < 80 \, ^\circ \textrm{F},$then recommend to wear a longsleeve shirt. • If the temperature is$ > 50 \, ^\circ \textrm{F}$but$ < 65 \, ^\circ \textrm{F},$then recommend to wear a sweater. • If the temperature is$ \leq 50 \, ^\circ \textrm{F},$then recommend to wear a jacket. ### PART 2¶ Complete these C++/Shell/SQL coding challenges and submit screenshots. • For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. • For SQL, each screenshot should include the problem number, the successful smiley face, and your query. C++ https://www.hackerrank.com/challenges/c-tutorial-strings/problem • Note that you can slice strings like this: myString.substr(1, 3) Shell https://www.hackerrank.com/challenges/text-processing-cut-2/problem https://www.hackerrank.com/challenges/text-processing-cut-3/problem https://www.hackerrank.com/challenges/text-processing-cut-4/problem https://www.hackerrank.com/challenges/text-processing-cut-5/problem https://www.hackerrank.com/challenges/text-processing-cut-6/problem • Here are some useful templates: • cut -c2-4 means print characters$2$through$4$for each line in the file. • cut -d',' -f2-4 means print fields$2$through$4$for each line the file, where the fields are separated by the delimiter ','. • You can also look at this resource for some examples: https://www.folkstalk.com/2012/02/cut-command-in-unix-linux-examples.html • These problems are all one-liners. If you find yourself spending more than a couple minutes on these, be sure to ask for help. SQL https://sqlzoo.net/wiki/SUM_and_COUNT (queries 1,2,3,4,5) # Problem 81-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: 1. Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. 2. ~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Repl.it link to Haskell code: _____ Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____ Commit link for machine-learning repo: _____ Commit link for assignment-problems repo: _____ Issue 1: _____ Issue 2: _____ # Problem 80-1¶ Primary problems; 50% of assignment grade; 60 minutes estimate a. (i) In your game state, make the following updates: • Change "hits" to "hits_left" • Change "Homeworld" to "Colony" • Put another key in your game state: game_state["players"]["home_coords"] • Make sure there are no strings with spaces in them -- instead, we'll use underlines • Attack and defense tech starts at 0; movement, ship size, and shipyard tech all start at 1 • Colony ships are not affected by technology in general • Add "ship_size_needed" to the "unit_data" key in the game state. Otherwise, the player doesn't know what ship size technology it needs before it can buy a ship. • Change the output of decide_purchases to specify locations at which to build the ships, like this: { 'units': [{'type': 'Scout', 'coords': (2,1)}, {'type': 'Scout', 'coords': (2,1)}, {'type': 'Destroyer', 'coords': (2,1)}, 'technology': ['defense', 'attack', 'attack'] } The updated game state is shown below: game_state = { 'turn': 4, 'phase': 'Combat', # Can be 'Movement', 'Economic', or 'Combat' 'round': None, # if the phase is movement, then round is 1, 2, or 3 'player_whose_turn': 0, # index of player whose turn it is (or whose ship is attacking during battle), 'winner': None, 'players': [ {'cp': 9, 'home_coords': (6,3), 'units': [ {'coords': (5,10), 'type': 'Scout', 'hits_left': 1, 'technology': { 'attack': 1, 'defense': 0, 'movement': 3 }}, {'coords': (1,2), 'type': 'Destroyer', 'hits_left': 1, 'technology': { 'attack': 0, 'defense': 0, 'movement': 2 }}, {'coords': (6,0), 'type': 'Homeworld', 'hits_left': 2, 'turn_created': 0 }, {'coords': (5,3), 'type': 'Colony', 'hits_left': 1, 'turn created': 2 }], 'technology': {'attack': 1, 'defense': 0, 'movement': 3, 'shipsize': 1} }, {'cp': 15, 'home_coords': (0,3), 'units': [ {'coords': (1,2), 'type': 'Battlecruiser', 'hits_left': 1, 'technology': { 'attack': 0, 'defense': 0, 'movement': 1 }}, {'coords': (1,2), 'type': 'Scout', 'hits_left': 1, 'technology': { 'attack': 1, 'defense': 0, 'movement': 1 }}, {'coords': (5,10), 'type': 'Scout', 'hits_left': 1, 'technology': { 'attack': 1, 'defense': 0, 'movement': 1 }}, {'coords': (6,12), 'type': 'Homeworld', 'hits_left': 3, 'turn_created': 0 }, {'coords': (5,10), 'type': 'Colony', 'hits_left': 3 'turn_created': 1 }], 'technology': {'attack': 1, 'defense': 0, 'movement': 1, 'shipsize': 1} }], 'planets': [(5,3), (5,10), (1,2), (4,8), (9,1)], 'unit_data': { 'Battleship': {'cp_cost': 20, 'hullsize': 3, 'shipsize_needed': 5, 'tactics': 5, 'attack': 5, 'defense': 2, 'maintenance': 3}, 'Battlecruiser': {'cp_cost': 15, 'hullsize': 2, 'shipsize_needed': 4, 'tactics': 4, 'attack': 5, 'defense': 1, 'maintenance': 2}, 'Cruiser': {'cp_cost': 12, 'hullsize': 2, 'shipsize_needed': 3, 'tactics': 3, 'attack': 4, 'defense': 1, 'maintenance': 2}, 'Destroyer': {'cp_cost': 9, 'hullsize': 1, 'shipsize_needed': 2, 'tactics': 2, 'attack': 4, 'defense': 0, 'maintenance': 1}, 'Dreadnaught': {'cp_cost': 24, 'hullsize': 3, 'shipsize_needed': 6, 'tactics': 5, 'attack': 6, 'defense': 3, 'maintenance': 3}, 'Scout': {'cp_cost': 6, 'hullsize': 1, 'shipsize_needed': 1, 'tactics': 1, 'attack': 3, 'defense': 0, 'maintenance': 1}, 'Shipyard': {'cp_cost': 3, 'hullsize': 1, 'shipsize_needed': 1, 'tactics': 3, 'attack': 3, 'defense': 0,, 'maintenance': 0}, 'Decoy': {'cp_cost': 1, 'hullsize': 0, 'shipsize_needed': 1, 'tactics': 0, 'attack': 0, 'defense': 0, 'maintenance': 0}, 'Colonyship': {'cp_cost': 8, 'hullsize': 1, 'shipsize_needed': 1, 'tactics': 0, 'attack': 0, 'defense': 0, 'maintenance': 0}, 'Base': {'cp_cost': 12, 'hullsize': 3, 'shipsize_needed': 2, 'tactics': 5, 'attack': 7, 'defense': 2, 'maintenance': 0}, }, 'technology_data': { # lists containing price to purchase the next level level 'shipsize': [0, 10, 15, 20, 25, 30], 'attack': [20, 30, 40], 'defense': [20, 30, 40], 'movement': [0, 20, 30, 40, 40, 40], 'shipyard': [0, 20, 30] } } The Strategy template is shown below:  class CombatStrategy: def __init__(self, player_index): self.player_index = player_index def will_colonize_planet(self, coordinates, game_state): ... return either True or False def decide_ship_movement(self, unit_index, game_state): ... return a "translation" which is a tuple representing the direction in which the ship moves. # For example, if a unit located at (1,2) wanted to # move to (1,1), then the translation would be (0,-1). def decide_purchases(self, game_state): ... return { 'units': list of unit objects you want to buy, 'technology': list of technology attributes you want to upgrade } # for example, if you wanted to buy 2 Scouts, 1 Destroyer, # upgrade defense technology once, and upgrade attack # technology twice, you'd return # { # 'units': [{'type': 'Scout', 'coords': (2,1)}, # {'type': 'Scout', 'coords': (2,1)}, # {'type': 'Destroyer', 'coords': (2,1)}, # 'technology': ['defense', 'attack', 'attack'] # } def decide_removal(self, game_state): ... return the unit index of the ship that you want to remove. for example, if you want to remove the unit at index 2, return 2 def decide_which_unit_to_attack(self, combat_state, coords, attacker_index) # combat_state is a dictionary in the form coordinates : combat_order # { # (1,2): [{'player': 1, 'unit': 0}, # {'player': 0, 'unit': 1}, # {'player': 1, 'unit': 1}, # {'player': 1, 'unit': 2}], # (2,2): [{'player': 2, 'unit': 0}, # {'player': 3, 'unit': 1}, # {'player': 2, 'unit': 1}, # {'player': 2, 'unit': 2}] # } # attacker_index is the index of your unit, whose turn it is # to attack. ... return the index of the ship you want to attack in the combat order. # in the above example, if you want to attack player 1's unit 1, # then you'd return 2 because it corresponds to # combat_state['order'][2] def decide_which_units_to_screen(self, combat_state, coords): # again, the combat_state is the combat_state for the # particular battle ... return the indices of the ships you want to screen in the combat order # in the above example, if you are player 1 and you want # to screen units 1 and 2, you'd return [2,3] because # the ships you want to screen are # combat_state['order'][2] and combat_state['order'][3] # NOTE: FOR COMBATSTRATEGY AND DUMBSTRATEGY, # YOU CAN JUST RETURN AN EMPTY ARRAY (ii) Once your game state / strategies are ready to be tested, post on #machine-learning to let your classmates know. (iii) Run your classmates' strategies after they post that the strategies are ready. If there are any issues with their strategy, post on #machine-learning to let them know. I'm hoping that, possibly with a little back-and-forth fixing, we can have all the strategies working in everyone's games by the end of the long weekend. b. Create a steepest_descent_optimizer(n) optimizer for the 8 queens problem, which starts with the best of 100 random locations arrays, and on each iteration, repeatedly compares all possible next location arrays that result from moving one queen by one space, and chooses the one that results in the minimum cost. The algorithm will run for n iterations. Some clarifications: • By "starts with the best of 100 random locations arrays", I mean that you should start by generating 100 random locations arrays and selecting the lowest-cost array to be your initial locations array. • There are$8$queens, and each queen can move in one of$8$directions (up, down, left, right, or in a diagonal direction) unless one of those directions is blocked by another queen or invalid due to being off the board. • So, the number of possible "next location arrays" resulting from moving one queen by one space will be around$8 \times 8 = 64,$though probably a little bit less. This means that on each iteration, you'll have to check about$64$possible next location arrays and choose the one that minimies the cost function. • If multiple configurations minimize the cost, randomly select one of them. If every next configuration increases the cost, then terminate the algorithm and return the current locations. Important: We didn't discuss this in class, so be sure to post on Slack if you get confused on any part of this problem. Your function should again return the following dictionary: { 'locations': array that resulted in the lowest cost, 'cost': the actual value of that lowest cost } Print out the cost of your steepest_descent_optimizer for n=10,50,100,500,1000. Once you have those printouts, post it on Slack in the #results channel. # Problem 80-2¶ Supplemental problems; 40% of assignment grade; 60 minutes estimate ### PART 1¶ Location: assignment-problems/refactor_string_processing.py The following code is supposed to turn a string into an array. Currently, it's messy, and there's some subtle issues with the code. Clean up the code and get it to work. Some particular things to fix are: • Putting whitespace where appropriate • Naming variables clearly • Deleting any pieces of code that aren't necessary string = '"alpha","beta","gamma","delta"\n1,2,3,4\n5.0,6.0,7.0,8.0' strings = [x.split(',') for x in string.split('\n')] length_of_string = len(string) arr = [] for string in strings: newstring = [] if len(string) > 0: for char in string: if char[0]=='"' and char[-1]=='"': char = char[1:] elif '.' in char: char = int(char) else: char = float(char) newstring.append(char) arr.append(newstring) print(arr) --- What it should print: [['alpha', 'beta', 'gamma', 'delta'], [1, 2, 3, 4], [5.0, 6.0, 7.0, 8.0]] What actually happens: Traceback (most recent call last): File "datasets/myfile.py", line 10, in <module> char = int(char) ValueError: invalid literal for int() with base 10: '5.0' ### PART 2¶ Location: assignment-problems Skim the following section of http://learnyouahaskell.com/syntax-in-functions. Pattern matching Create Haskell file Fibonacci.hs and write a function nthFibonacciNumber that computes the nth Fibonacci number, starting with$n=0$. Remember that the Fibonacci sequence is$0,1,1,2,3,5,8,\ldots$where each number comes from adding the previous two. To check your function, print nthFibonacciNumber 20. You should get a result of 6765. Note: This part of the section will be very useful, since it talks about how to write a recursive function. factorial :: (Integral a) => a -> a factorial 0 = 1 factorial n = n * factorial (n - 1) ### PART 3¶ Complete these C++/Shell/SQL coding challenges and submit screenshots. • For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. • For SQL, each screenshot should include the problem number, the successful smiley face, and your query. C++ https://www.hackerrank.com/challenges/arrays-introduction/problem • Note that when the input is in the form of numbers separated by a space, you can read it into an array: for (int i=0; i<n; i++) { cin >> a[i]; } You can read the array out in a similar way. Shell https://www.hackerrank.com/challenges/text-processing-cut-1/problem • Tip: for the this problem, you can read input lines from a file using the following syntax: while read line do (your code here) done Again, be sure to check out the top-right "Tutorial" tab. SQL https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial (queries 9,10) # Problem 80-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: 1. Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. 2. ~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Repl.it link to Haskell code: _____ Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____ Commit link for machine-learning repo: ____ Commit link for space-empires repo: ____ Issue 1: _____ Issue 2: _____ # Problem 79-1¶ Primary problems; 45% of assignment grade; 90 minutes estimate ### PART 1¶ Adjustments to the game... • If you have grid_size, rename to board_size. It should be a tuple (x,y) instead of just 1 integer • Updates to combat strategy: • decide_purchases, the units should be strings (not objects) • Updates to game state: • Change "location" to "coords" • Unit types should be strings • Add a dictionaries unit_data and technology_data • Person-specific corrections (if you haven't addressed these already): • Colby • The conventions for his shipyard/colonyship class naming are incorrect: He has an underscore, he uses something called 'movement_round' in the gamestate. • There are a ton of other really small naming differences in the gamestate, like a player's units vs ships Board_size vs grid_size • Riley • Board_size vs grid_size • decide_which_ship_to_attack arguments are reversed • combat_state is in a different format • George • His unit folder is called "units" (plural) • Shipyard and colonyship filenames don't have underscores • Shipyard vs ShipYard • coords vs pos in ship state • name vs type in ship state • David • It looks like he passes in some datastruct instead of a dictionary and accesses properties using a dot instead of square brackets. • He used some non-python syntax (++ increment) in his CombatStrategy • In decide_ship_movement he has the ship index as the attribute, which is correct, but then he uses ship.coordinates as if he is handling the object. game_state = { 'turn': 4, 'phase': 'Combat', # Can be 'Movement', 'Economic', or 'Combat' 'round': None, # if the phase is movement, then round is 1, 2, or 3 'player_whose_turn': 0, # index of player whose turn it is (or whose ship is attacking during battle), 'winner': None, 'players': [ {'cp': 9 'units': [ {'coords': (5,10), 'type': 'Scout', 'hits': 0, 'technology': { 'attack': 1, 'defense': 0, 'movement': 3 }}, {'coords': (1,2), 'type': 'Destroyer', 'hits': 0, 'technology': { 'attack': 0, 'defense': 0, 'movement': 2 }}, {'coords': (6,0), 'type': 'Homeworld', 'hits': 0, 'turn_created': 0 }, {'coords': (5,3), 'type': 'Colony', 'hits': 0, 'turn created': 2 }], 'technology': {'attack': 1, 'defense': 0, 'movement': 3, 'shipsize': 0} }, {'cp': 15 'units': [ {'coords': (1,2), 'type': 'Battlecruiser', 'hits': 1, 'technology': { 'attack': 0, 'defense': 0, 'movement': 1 }}, {'coords': (1,2), 'type': 'Scout', 'hits': 0, 'technology': { 'attack': 1, 'defense': 0, 'movement': 1 }}, {'coords': (5,10), 'type': 'Scout', 'hits': 0, 'technology': { 'attack': 1, 'defense': 0, 'movement': 1 }}, {'coords': (6,12), 'type': 'Homeworld', 'hits': 0, 'turn_created': 0 }, {'coords': (5,10), 'type': 'Colony', 'turn_created': 1 }], 'technology': {'attack': 1, 'defense': 0, 'movement': 1, 'shipsize': 1} }], 'planets': [(5,3), (5,10), (1,2), (4,8), (9,1)], 'unit_data': { 'Battleship': {'cp_cost': 20, 'hullsize': 3, 'shipsize_needed': 5}, 'Battlecruiser': {'cp_cost': 15, 'hullsize': 2, 'shipsize_needed': 4}, 'Cruiser': {'cp_cost': 12, 'hullsize': 2, 'shipsize_needed': 3}, 'Destroyer': {'cp_cost': 9, 'hullsize': 1, 'shipsize_needed': 2}, 'Dreadnaught': {'cp_cost': 24, 'hullsize': 3, 'shipsize_needed': 6}, 'Scout': {'cp_cost': 6, 'hullsize': 1, 'shipsize_needed': 1}, 'Shipyard': {'cp_cost': 3, 'hullsize': 1, 'shipsize_needed': 1}, 'Decoy': {'cp_cost': 1, 'hullsize': 0, 'shipsize_needed': 1}, 'Colonyship': {'cp_cost': 8, 'hullsize': 1, 'shipsize_needed': 1}, 'Base': {'cp_cost': 12, 'hullsize': 3, 'shipsize_needed': 2}, }, 'technology_data': { # lists containing price to purchase the next level level 'shipsize': [10, 15, 20, 25, 30], 'attack': [20, 30, 40], 'defense': [20, 30, 40], 'movement': [20, 30, 40, 40, 40], 'shipyard': [20, 30] } }  class CombatStrategy: def __init__(self, player_index): self.player_index = player_index def will_colonize_planet(self, coordinates, game_state): ... return either True or False def decide_ship_movement(self, unit_index, game_state): ... return a "translation" which is a tuple representing the direction in which the ship moves. # For example, if a unit located at (1,2) wanted to # move to (1,1), then the translation would be (0,-1). def decide_purchases(self, game_state): ... return { 'units': list of unit objects you want to buy, 'technology': list of technology attributes you want to upgrade } # for example, if you wanted to buy 2 Scouts, 1 Destroyer, # upgrade defense technology once, and upgrade attack # technology twice, you'd return # { # 'units': ['Scout', 'Scout', 'Destroyer'], # 'technology': ['defense', 'attack', 'attack'] # } def decide_removal(self, game_state): ... return the unit index of the ship that you want to remove. for example, if you want to remove the unit at index 2, return 2 def decide_which_unit_to_attack(self, combat_state, coords, attacker_index) # combat_state is a dictionary in the form coordinates : combat_order # { # (1,2): [{'player': 1, 'unit': 0}, # {'player': 0, 'unit': 1}, # {'player': 1, 'unit': 1}, # {'player': 1, 'unit': 2}], # (2,2): [{'player': 2, 'unit': 0}, # {'player': 3, 'unit': 1}, # {'player': 2, 'unit': 1}, # {'player': 2, 'unit': 2}] # } # attacker_index is the index of your unit, whose turn it is # to attack. ... return the index of the ship you want to attack in the combat order. # in the above example, if you want to attack player 1's unit 1, # then you'd return 2 because it corresponds to # combat_state['order'][2] def decide_which_units_to_screen(self, combat_state): # again, the combat_state is the combat_state for the # particular battle ... return the indices of the ships you want to screen in the combat order # in the above example, if you are player 1 and you want # to screen units 1 and 2, you'd return [2,3] because # the ships you want to screen are # combat_state['order'][2] and combat_state['order'][3] # NOTE: FOR COMBATSTRATEGY AND DUMBSTRATEGY, # YOU CAN JUST RETURN AN EMPTY ARRAY ### PART 2¶ Location: machine-learning/analysis/8_queens.py We're going to be exploring approaches to solving the 8-queens problem on the next couple assignments. The 8-queens problem is a challenge to place 8 queens on a chess board in a way that none can attack each other. Remember that in chess, queens can attack any piece that is on the same row, column, or diagonal. So, the 8-queens problem is to place 8 queens on a chess board so that none of them are on the same row, column, or diagonal. a. Write a function show_board(locations) that takes a list of locations of 8 queens and prints out the corresponding board by placing periods in empty spaces and the index of the location in any space occupied by a queen. >>> locations = [(0,0), (6,1), (2,2), (5,3), (4,4), (7,5), (1,6), (2,6)] >>> show_board(locations) 0 . . . . . . . . . . . . . 6 . . . 2 . . . 7 . . . . . . . . . . . . . 4 . . . . . . 3 . . . . . 1 . . . . . . . . . . . 5 . . Tip: To print out a row, you can first construct it as an array and then print the corresponding string, which consists of the array entries separated by two spaces: >>> row_array = ['0', '.', '.', '.', '.', '.', '.', '.'] >>> row_string = ' '.join(row_array) # note that ' ' is TWO spaces >>> print(row_string) 0 . . . . . . . b. Write a function that calc_cost(locations) computes the "cost", i.e. the number of pairs of queens that are on the same row, column, or diagonal. For example, in the board above, the cost is 10: 1. Queen 2 and queen 7 are on the same row 2. Queen 6 and queen 7 are on the same column 3. Queen 0 and queen 2 are on the same diagonal 4. Queen 0 and queen 4 are on the same diagonal 5. Queen 2 and queen 4 are on the same diagonal 6. Queen 3 and queen 4 are on the same diagonal 7. Queen 4 and queen 7 are on the same diagonal 8. Queen 3 and queen 7 are on the same diagonal 9. Queen 1 and queen 6 are on the same diagonal 10. Queen 3 and queen 5 are on the same diagonal Verify that the cost of the above configuration is 10: >>> calc_cost(locations) 10 Tip 1: It will be easier to debug your code if you write several helper functions -- one which takes two coordinate pairs and determines whether they're on the same row, another which determines whether they're on the same column, another which determines if they're on the same diagonal. Tip 2: To check if two locations are on the same diagonal, you can compute the slope between those two points and check if the slope comes out to$1$or$-1.$c. Write a function random_optimizer(n) that generates n random locations arrays for the 8 queens, and returns the following dictionary: { 'locations': array that resulted in the lowest cost, 'cost': the actual value of that lowest cost } Then, print out the cost of your random_optimizer for n=10,50,100,500,1000. Once you have those printouts, post it on Slack in the #results channel. # Problem 79-2¶ Supplemental problems; 45% of assignment grade; 60 minutes estimate ### PART 1¶ Location: assignment-problems/refactor_linear_regressor.py The following code is taken from a LinearRegressor class. While most of the code will technically work, there may be a couple subtle issues, and the code is difficult to read. Refactor this code so that it is more readable. It should be easy to glance at and understand what's going on. Some particular things to fix are: • Putting whitespace where appropriate • Naming variables clearly • Expanding out complicated one-liners • Deleting any pieces of code that aren't necessary Important: • You don't have to actually run the code. This is just an exercise in improving code readability. You just need to copy and paste the code below into a file and clean it up. • Don't spend more than 20 min on this problem. You should fix the things that jump out at you as messy, but don't worry about trying to make it absolutely perfect.  def calculate_coefficients(self): final_dict = {} mat = [[1 for x in list(self.df.data_dict.values())[0][0]]] mat_dict = {} for key in self.df.data_dict: if key != self.dependent_variable: mat_dict[key] = self.df.data_dict[key] for row in range(len(mat_dict)): mat.append(list(self.df.data_dict.values())[row][0]) mat = Matrix(mat) mat = mat.transpose() mat_t = mat.transpose() mat_mult = mat_t.matrix_multiply(mat) mat_inv = mat_mult.inverse() mat_pseudoinv = mat_inv.matrix_multiply(mat_t) multiplier = [[num] for num in list(self.df.data_dict.values())[1][0]] multiplier_mat = mat_pseudoinv.matrix_multiply(Matrix(multiplier)) for num in range(len(multiplier_mat.elements)): if num == 0: key = 'constant' else: key = list(self.df.data_dict.keys())[num-1] final_dict[key] = [row[0] for row in multiplier_mat.elements][num] return final_dict ### PART 2¶ Location: assignment-problems Skim the following section of http://learnyouahaskell.com/syntax-in-functions. Pattern matching Create Haskell file CrossProduct.hs and write a function crossProduct in that takes an two input 3-dimensional tuples, (x1,x2,x3) and (y1,y2,y3) and computes the cross product. To check your function, print crossProduct (1,2,3) (3,2,1). You should get a result of (-4,8,-4). Note: This part of the section will be very useful: addVectors :: (Num a) => (a, a) -> (a, a) -> (a, a) addVectors (x1, y1) (x2, y2) = (x1 + x2, y1 + y2) Note that the top line just states the "type" of addVectors. This line says that addVectors works with Numbers a, and it takes two inputs of the form (a, a) and (a, a) and gives an output of the form (a, a). Here, a just stands for the type, Number. ### PART 3¶ Complete these C++/Shell/SQL coding challenges and submit screenshots. • For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. • For SQL, each screenshot should include the problem number, the successful smiley face, and your query. C++ https://www.hackerrank.com/challenges/c-tutorial-pointer/problem • Don't overthink this one. The solution is very, very short. Be sure to ask if you have trouble. Shell https://www.hackerrank.com/challenges/bash-tutorials---arithmetic-operations/problem • Be sure to check out the top-right "Tutorial" tab to read about the commands necessary to solve this problem. SQL # Problem 79-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: 1. Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. 2. ~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: PART 1 repl.it link for space-empires refactoring: ____ repl.it link for 8 queens: ____ PART 2 refactor_linear_regressor repl.it link: _____ Repl.it link to Haskell code: _____ Link to C++/Shell/SQL screenshots (Overleaf or Google Doc): _____ PART 3 Issue 1: _____ Issue 2: _____ # Problem 78-1¶ Primary problems; 45% of assignment grade; 30-75 minutes estimate ### Part 1¶ Make sure that Problem 77-1-a is done so that we can discuss the results next class. If you've already finished this, you can submit the same link that you did for Problem 77. Note: your table should have only 5 entries, exactly 1 entry for each model. For each model, you should count all the correct predictions (over all train-test splits) and divide by the total number of predictions (over all train-test splits). Also note that, all together, it will probably take 5 minutes to train the models on all the splits. This is because we've implemented the simplest version of a random forest that could possibly be concieved, and it's really inefficient. We will make it more efficient next time. ### Part 2¶ For each classmate, make a list of specific things (if any) that they have to fix in their strategies in order for them to seamlessly integrate into our game. Next time, we will aggregate and discuss all these fixes and hopefully our strategies will integrate seamlessly after that. # Problem 78-2¶ Supplemental problems; 60% of assignment grade; 75 minutes estimate ### PART 1¶ Recall the standard normal distribution: $$p(x) = \dfrac{1}{\sqrt{2\pi}} e^{-x^2/2}$$ Previously, you wrote a function calc_standard_normal_probability(a,b) using a Riemann sum with step size 0.001. Now, you will generalize the function: • use an arbitrary number of n subintervals (the step size will be (b-a)/n • allow 5 different rules for computing the sum ("left endpoint", "right endpoint", "midpoint", "trapezoidal", "simpson") The resulting function will be calc_standard_normal_probability(a,b,n,rule). Note: The rules are from AP Calc BC. They are summarized below for a partition$\{ x_0, x_1, \ldots, x_n \}$and step size$\Delta x.\begin{align*} \textrm{Left endpoint rule} &= \Delta x \left[ f(x_0) + f(x_1) + \ldots + f(x_{n-1}) \right] \\[7pt] \textrm{Right endpoint rule} &= \Delta x \left[ f(x_1) + f(x_2) + \ldots + f(x_{n}) \right] \\[7pt] \textrm{Midpoint rule} &= \Delta x \left[ f \left( \dfrac{x_0+x_1}{2} \right) + f \left( \dfrac{x_1+x_2}{2} \right) + \ldots + f\left( \dfrac{x_{n-1}+x_{n}}{2} \right) \right] \\[7pt] \textrm{Trapezoidal rule} &= \Delta x \left[ 0.5f(x_0) + f(x_1) + f(x_2) + \ldots + f(x_{n-1}) + 0.5f(x_{n}) \right] \\[7pt] \textrm{Simpson's rule} &= \dfrac{\Delta x}{3} \left[ f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + 2f(x_4) + \ldots + 4f(x_{n-1}) + f(x_{n}) \right] \\[7pt] \end{align*} For each rule, estimateP(0 \leq x \leq 1)$by making a plot of the estimate versus the number of subintervals for the even numbers$n \in \{ 2, 4, 6, \ldots, 100 \}.$The resulting graph should look something like this. Post your plot on #computation-and-modeling once you've got it. ### PART 2¶ Location: assignment-problems Skim the following section of http://learnyouahaskell.com/starting-out. Texas ranges I'm a list comprehension Create Haskell file ComplicatedList.hs and write a function calcList in that takes an input number n and counts the number of ordered pairs [x,y] that satisfy$-n \leq x,y \leq n$and$x-y \leq \dfrac{xy}{2} \leq x+y$and$x,y \notin \{ -2, -1, 0, 1, 2 \}.$This function should generate a list comprehension and then count the length of that list. To check your function, print calcList 50. You should get a result of$16.$### PART 3¶ Complete these C++/Shell/SQL coding challenges and submit screenshots. https://www.hackerrank.com/challenges/c-tutorial-for-loop/problem https://www.hackerrank.com/challenges/c-tutorial-functions/problem https://www.hackerrank.com/challenges/bash-tutorials---comparing-numbers/problem https://www.hackerrank.com/challenges/bash-tutorials---more-on-conditionals/problem https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial (queries 4,5,6) • For C++/Shell, each screenshot should include your username, the problem title, and the "Status: Accepted" indicator. • For SQL, each screenshot should include the problem number, the successful smiley face, and your query. • Here's a helpful example of some bash syntax. (The spaces on the inside of the brackets are really important! It won't work if you remove the spaces, i.e. [$n -gt 100])

read n
if [ $n -gt 100 ] || [$n -lt -100 ]
then
echo What a large number.
else
echo The number is smol.
if [ $n -eq 13 ] then echo And it\'s unlucky!!! fi fi ### PART 4¶ a. b. • Remember that for a probability distribution$f(x),$the cumulative distribution function (CDF) is$F(x) = P(X \leq x) = \displaystyle \int_{-\infty}^x f(x) \, \textrm dx.$• Remember that$EX$means$\textrm E[X].$# Problem 78-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: 1. Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. 2. ~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them. ### SUBMISSION TEMPLATE¶ For your submission, copy and paste your links into the following template: Commit link to machine-learning repo (if any changes were required): _____ Repl.it link to Haskell code: _____ Commit link for assignment-problems repo: _____ Link to C++/SQL screenshots (Overleaf or Google Doc): _____ Link to probability solutions (on Overleaf): Issue 1: _____ Issue 2: _____ # Problem 77-1¶ Primary problems; 45% of assignment grade; 75 minutes estimate a. You'll need to do part 1 of the supplemental problem before you do this problem. (i) Download the freshman_lbs.csv dataset from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, read it into a DataFrame, and create 5 test-train splits: 1. Testing data = first 20% of the records, training data = remaining 80% 2. Testing data = second 20% of the records, training data = remaining 80% 3. Testing data = third 20% of the records, training data = remaining 80% 4. Testing data = fourth 20% of the records, training data = remaining 80% 5. Testing data = fifth 20% of the records, training data = remaining 80% Note that you'll need to convert the appropriate entries to numbers (instead of strings) in the dataset. There are 2 options for doing this: • Option 1: don't worry about fixing the format within the read_csv method. Just do something like df = df.apply('weight', lambda x: int(x)) afterwards, before you pass the dataframe into your model. • Option 2: when you read in the csv, after you do the lines = file.read().split('\n') entries = [line.split(',') for line in lines] thing, you can loop through the entries, and if entry[0]+entry[-1] == '""', then you can set entry = entry[1:-1] to remove the quotes. Otherwise, if entry[0]+entry[-1] != '""', then you can try to do entry = float(entry[1:-1]). (ii) For each test-train split, fit each of the following models on the training data and use it to predict the sexes on the testing data. (You are predicting sex as a function of weight and BMI, and you can just use columns corresponding to September data.) • Decision tree using Gini split criterion • A single random decision tree • Random forest with 10 trees • Random forest with 100 trees • Random forest with 1000 trees (iii) For each model, compute the accuracy (count the total number of correct classifications and divide by the total number of classifications). Put these results in a table in an Overleaf document. Note that the total number of classifications should be equal to the total number of records in the dataset (you did 5 train-test splits, and each train-test split involved testing on 20% of the data). (iv) Below the table, analyze the results. Did you expect these results, or did they surprise you? Why do you think you got the results you did? b. For each of your classmates, copy over their DumbStrategy and CombatStrategy and run your DumbPlayer/CombatPlayer tests using your classmate's strategy. Fill out the following information for each classmate: • Name of classmate • When you copied over their DumbStrategy and ran your DumbPlayer tests, did they pass? If not, then what's the issue? Is it a problem with your game, or with their strategy class? • When you copied over their CombatStrategy and ran your CombatPlayer tests, did they pass? If not, then what's the issue? Is it a problem with your game, or with their strategy class? # Problem 77-2¶ Supplemental problems; 45% of assignment grade; 75 minutes estimate ### PART 1¶ In your machine-learning repository, create a folder machine-learning/datasets/. Go to https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, download the file airtravel.csv, and put it in your datasets/ folder. In Python, you can read a csv as follows: >>> path_to_datasets = '/home/runner/machine-learning/datasets/' >>> filename = 'airtravel.csv' >>> with open(path_to_datasets + filename, "r") as file: print(file.read()) "Month", "1958", "1959", "1960" "JAN", 340, 360, 417 "FEB", 318, 342, 391 "MAR", 362, 406, 419 "APR", 348, 396, 461 "MAY", 363, 420, 472 "JUN", 435, 472, 535 "JUL", 491, 548, 622 "AUG", 505, 559, 606 "SEP", 404, 463, 508 "OCT", 359, 407, 461 "NOV", 310, 362, 390 "DEC", 337, 405, 432 Write a @classmethod called DataFrame.from_csv(path_to_csv, header=True) that constructs a DataFrame from a csv file (similar to how DataFrame.from_array(arr) constructs the DataFrame from an array). Test your method as follows: >>> path_to_datasets = '/home/runner/machine-learning/datasets/' >>> filename = 'airtravel.csv' >>> filepath = path_to_datasets + filename >>> df = DataFrame.from_csv(filepath, header=True) >>> df.to_array() [['"Month"', '"1958"', '"1959"', '"1960"'], ['"JAN"', '340', '360', '417'], ['"FEB"', '318', '342', '391'], ['"MAR"', '362', '406', '419'], ['"APR"', '348', '396', '461'], ['"MAY"', '363', '420', '472'], ['"JUN"', '435', '472', '535'], ['"JUL"', '491', '548', '622'], ['"AUG"', '505', '559', '606'], ['"SEP"', '404', '463', '508'], ['"OCT"', '359', '407', '461'], ['"NOV"', '310', '362', '390'], ['"DEC"', '337', '405', '432']] ### PART 2¶ Location: assignment-problems Skim the following section of http://learnyouahaskell.com/starting-out. An intro to lists Create Haskell file ListProcessing.hs and write a function prodFirstLast in Haskell that takes an input list arr and computes the product of the first and last elements of the list. Then, apply this function to the input [4,2,8,5]. Tip: use the !! operator and the length function. Your file will look like this: prodFirstLast arr = (your code here) main = print (prodFirstLast [4,2,8,5]) Note that, to print out an integer, we use print instead of putStrLn. (You can also use print for most strings. The difference is that putStrLn can show non-ASCII characters like "я" whereas print cannot.) Run your function and make sure it gives the desired output (which is 20). ### PART 3¶ a. Complete these introductory C++ coding challenges and submit screenshots: https://www.hackerrank.com/challenges/c-tutorial-basic-data-types/problem https://www.hackerrank.com/challenges/c-tutorial-conditional-if-else/problem b. Complete these Bash coding challenges and submit screenshots: https://www.hackerrank.com/challenges/bash-tutorials---a-personalized-echo/problem https://www.hackerrank.com/challenges/bash-tutorials---the-world-of-numbers/problem (Each screenshot should include your username, the problem title, and the "Status: Accepted" indicator.) c. Complete SQL queries 1-3 here and submit screenshots: https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial (Each screenshot should include the problem number, the successful smiley face, and your query.) ### PART 4¶ a. As we will see in the near future, the standard normal distribution comes up A LOT in the context of statistics. It is defined as $$p(x) = \dfrac{1}{\sqrt{2\pi}} e^{-x^2/2}.$$ The reason why we haven't encountered it until now is that it's difficult to integrate. In practice, it's common to use a pre-computed table of values to look up probabilities from this distribution. The actual problem: Write a function calc_standard_normal_probability(a,b) to approximate$P(a \leq X \leq b)$for the standard normal distribution, using a Riemann sum with step size 0.001. To check your function, print out estimates of the following probabilities: •$P(-1 \leq x \leq 1)$•$P(-2 \leq x \leq 2)$•$P(-3 \leq x \leq 3)$Your estimates should come out close to 0.68, 0.955, 0.997 respectively. (https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule) b. • "CDF" stands for Cumulative Distribution Function. The CDF of a probability distribution$f(x)$is defined as $$F(x) = P(X \leq x) = \int_{-\infty}^x f(x) \, \textrm dx.$$ • Your answer for the CDF will be a piecewise function (3 pieces). •$EX$means$E[X].$c. # Problem 77-3¶ Review; 10% of assignment grade; 15 minutes estimate Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of$0$until you resubmit with links to your commits. Additionally, do the following: 1. Make 2 GitHub issues on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include the links to the issues you created. 2. ~Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)~ Let's actually hold off on this bit for the next couple weeks, so that we can build up an inventory of issues on our repositories. Then, once we have an inventory of 5-10 issues to choose from each time, we can start resolving them. # Problem 76-1¶ Primary problems; 40% of assignment grade; 60 minutes estimate a. Create a RandomForest class in machine-learning/src/random-forest that is initialized with a value n that represents the number of random decision trees to use. The RandomForest should have a fit() method and a predict() method, just like the DecisionTree. • The fit() method should fit all the random decision trees. • The predict() method should get a prediction from each random decision tree, and then return the prediction that occurred most frequently. (If there are multiple predictions that occurred most frequently, then choose randomly among them.) So it should work like this: rf = RandomForest(10) # random forest consisting of 10 random trees rf.fit(df) # fit all 10 of those trees to the dataframe rf.predict(observation) # have each of the 10 trees make a prediction, and # return the majority vote of the 10 trees b. Refactor the combat_state in your game. Previously, it looked like this: [ {'location': (1,2), 'order': [{'player': 1, 'unit': 0,}, {'player': 0, 'unit': 1}, {'player': 1, 'unit': 1}] }, {'location': (5,10), 'order': [{'player': 0, 'unit': 0}, {'player': 1, 'unit': 2}, {'player': 1, 'unit': 4}] } ] Now, we will refactor the above into this: { (1,2): [{'player': 1, 'unit': 0}, {'player': 0, 'unit': 1}, {'player': 1, 'unit': 1}, {'player': 1, 'unit': 2}] (2,2): [{'player': 2, 'unit': 0}, {'player': 3, 'unit': 1}, {'player': 2, 'unit': 1}, {'player': 2, 'unit': 2}] } As a result, we will also have to update the inputs to decide_which_unit_to_attack. Originally, the inputs were as follows: decide_which_unit_to_attack(self, combat_state, attacker_index) Now, we will have to include an additional input location as follows: decide_which_unit_to_attack(self, combat_state, location, attacker_index) c. Refactor your decide_removals function into a function decide_removal (singular, not plural) that returns the index of a single ship to remove. So, it will return a single integer instead of an array. Then, refactor your game so that it calls decide_removal repeatedly until no more removals are required. This will prevent a situation in which our game crashes because a player did not remove enough ships. def decide_removal(self, game_state): ... return the unit index of the ship that you want to remove. for example, if you want to remove the unit at index 2, return 2 # Problem 76-2¶ Supplemental problems; 50% of assignment grade; 75 minutes estimate PART 1 Location: assignment-problems Write a function random_draw(distribution) that draws a random number from the probability distribution. Assume that the distribution is an array such that distribution[i] represents the probability of drawing i. Here are some examples: • random_draw([0.5, 0.5]) will return 0 or 1 with equal probability • random_draw([0.25, 0.25, 0.5]) will return 0 a quarter of the time, 1 a quarter of the time, and 2 half of the time • random_draw([0.05, 0.2, 0.15, 0.3, 0.1, 0.2]) will return 0 5% of the time, 1 20% of the time, 2 15% of the time, 3 30% of the time, 4 10% of the time, and 0.2 20% of the time. The way to implement this is to 1. turn the distribution into a cumulative distribution, 2. choose a random number between 0 and 1, and then 3. find the index of the first value in the cumulative distribution that is greater than the random number. Distribution: [0.05, 0.2, 0.15, 0.3, 0.1, 0.2] Cumulative distribution: [0.05, 0.25, 0.4, 0.7, 0.8, 1.0] Choose a random number between 0 and 1: 0.77431 The first value in the cumulative distribution that is greater than 0.77431 is 0.8. This corresponds to the index 4. So, return 4. To test your function, generate 1000 random numbers from each distribution and ensure that their average is close to the true expected value of the distribution. In other words, for each of the following distributions, print out the true expected value, and then print out the average of 1000 random samples. • [0.5, 0.5] • [0.25, 0.25, 0.5] • [0.05, 0.2, 0.15, 0.3, 0.1, 0.2] PART 2 Location: assignment-problems Skim the following sections of http://learnyouahaskell.com/starting-out. • Ready, set, go! • Baby's first functions Create Haskell file ClassifyNumber.hs and write a function classifyNumber in Haskell that takes an input number x and returns • "negative" if x is negative • "nonnegative" if x is nonnegative. Then, apply this function to the input 5. Your file will look like this: classifyNumber x = (your code here) main = putStrLn (classifyNumber 5) Now, run your function by typing the following into the command line: >>> ghc --make ClassifyNumber >>> ./ClassifyNumber ghc is a Haskell compiler. It will compile or "make" an executable object using your .hs file. The command ./ClassifyNumber. actually runs your executable object. PART 3 Complete this introductory C++ coding challenge: https://www.hackerrank.com/challenges/cpp-input-and-output/problem Submit a screenshot that includes the name of the problem (top left), your username (top right), and Status: Accepted (bottom). PART 4 Complete this introductory Shell coding challenge: https://www.hackerrank.com/challenges/bash-tutorials---looping-and-skipping/problem The following example of a for loop will be helpful: for i in {2..10} do ((n = 5 * i)) echo$n
done

Note: You can solve this problem with just a single for loop

Again, submit a screenshot that includes the name of the problem (top left), your username (top right), and Status: Accepted (bottom), just like in part 3.

PART 5

Complete queries 11-14 here: https://sqlzoo.net/wiki/SELECT_from_Nobel_Tutorial

As usual, include a screenshot for each problem that includes the problem number, the successful smiley face, and your query.

PART 6

Location: Overleaf

Complete the following probability problems:

a.

• Use conditional probability. In other words, compute the probability that C has exactly $4$ spaces, given that A and B have exactly 7 spaces (together).

b.

• Write your answer using sigma notation or "dot dot dot" notation.

# Problem 76-3¶

Review; 10% of assignment grade; 15 minutes estimate

Commit your code to GitHub. When you submit your assignment, include a link to your commit(s). If you don't do this, your assignment will receive a grade of $0$ until you resubmit with links to your commits.

Additionally, do the following:

1. Make a GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include a link to the issue you created.

2. Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved. (If you don't have any issues on any of your repositories, then you don't have to do anything, but state that this is the case when you turn in your assignment.)

# Problem 75-1¶

Location: machine-learning/src/decision_tree.py

Update your DecisionTree to have the option to build the tree via random splits. By "random splits", I mean that the tree should randomly choose from the possible splits, and it should keep splitting until each leaf node is pure.

>>> dt = DecisionTree(split_metric = 'gini')
>>> dt.fit(df)
Fits the decision tree using the Gini metric

>>> dt = DecisionTree(split_metric = 'random')
>>> dt.fit(df)
Fits the decision tree by randomly choosing splits

# Problem 75-2¶

Estimated Time: 60 minutes

Submit corrections to final (put your corrections in an overleaf doc). I made a final review video that goes through each problem, available here: https://vimeo.com/496684498

For each correction, explain

1. what misunderstanding you had, and
2. how you get to the correct result.

Important: The majority of the misunderstandings should NOT be "I ran out of time", and when you explainhow to get to the correct result, SHOW ALL WORK.

# Problem 75-3¶

Make sure that problem 73-1 is done. In the next assignment, you will run everyone else's strategies and they will run yours as well. We should all get the same results.

# Problem 75-4¶

Estimated Time: 20 minutes

Important! If you don't do the things below, your assignment will receive a grade of zero.

1. Commit your code to GitHub. When you submit your assignment, include a link to your commit(s).

2. Make a GitHub issue on your assigned classmate's repository (but NOT assignment-problems). See eurisko.us/resources/#code-reviews to determine your assigned classmate. When you submit your assignment, include a link to the issue you created.

3. Resolve an issue that has been made on your own GitHub repository. When you submit your assignment, include a link to the issue you resolved.

# Problem 74-1¶

Wrapping up the semester...

• Read the edited version of your blog post here. If you'd like to see any changes, post on Slack by the end of the week. Otherwise, these are going up on the website!

• Turn in any missing assignments / resubmissions / reviews / test corrections by Sunday 1/3 at the very latest. Finish strong! I want to give out strong grades, but I can only do that if you're up to date with all your work and you've done it well.

# Problem 74-2¶

Study for the final!

Probability/Statistics

definitions of independent/disjoint, conditional probability, mean, variance, standard deviation, covariance, how variance/covariance are related to expectation identifying probability distributions, solving for an unknown constant so that a probability distribution is valid, discrete uniform, continuous uniform, exponential, poisson, using cumulative distributions i.e. P(a <= x < b) = P(x < b) - P(x < a), KL divergence, joint distributions, basic probability computations with joint distributions, likelihood distribution, posterior/prior distributions

Machine Learning

pseudoinverse, fitting a linear regression, fitting a logistic regression, end behaviors of linear and logistic regression, interaction terms, using linear regression to fit the coefficients of a nonlinear function, categorical variables, naive bayes, k-nearest neighbors, decision trees, leave-one-out cross validation, underfitting/overfitting, training/testing datasets (testing datasets are also known as validation datasets)

Algorithms

Intelligent search (backtracking), depth-first search, breadth-first search, shortest path in a graph using breadth-first search, quicksort, computing big-O notation given a recurrence, bisection search (also known as bisection search)

Simulation

euler estimation, SIR model, predator-prey model, hodgkin-huxley model, translating a description into a system of differential equations

Review

Basic string processing (something like separate_into_words and reverse_word_order from Quiz 1), Implementing a recursive sequence, unlisting, big-O notation, matrix multiplication, converting to reduced row echelon form, determinant using rref, determinant using cofactors, why determinant using rref is faster than determinant using cofactors, inverse via augmented matrix, tally sort, merge sort (also know how to merge two sorted lists), swap sort, Newton-Raphson (i.e. the “zero of tangent line” method), gradient descent, grid search (also know how to compute cartesian product), Linked list, tree, stack, queue, converting between binary and decimal

# Problem 73-1¶

Estimated Time: 2 hours

Points: 20

Refactor your game so that strategies adhere to this format exactly. Put the strategies as separate files in src/strategies.

Note: If you have any disagreements with the strategy template below, post on Slack, and we can discuss.

from units.base import Base
from units.battlecruiser import Battlecruiser
from units.battleship import Battleship
from units.colony import Colony
from units.cruiser import Cruiser
from units.destroyer import Destroyer
from units.scout import Scout
from units.shipyard import Shipyard

class CombatStrategy:

def __init__(self, player_index):
self.player_index = player_index

def will_colonize_planet(self, coordinates, game_state):
...
return either True or False

def decide_ship_movement(self, unit_index, game_state):
...
return a "translation" which is a tuple representing
the direction in which the ship moves.

# For example, if a unit located at (1,2) wanted to
# move to (1,1), then the translation would be (0,-1).

def decide_purchases(self, game_state):
...
return {
'units': list of unit objects you want to buy,
'technology': list of technology attributes you want to upgrade
}

# for example, if you wanted to buy 2 Scouts, 1 Destroyer,
# upgrade defense technology once, and upgrade attack
# technology twice, you'd return
# {
#     'units': [Scout, Scout, Destroyer],
#     'technology': ['defense', 'attack', 'attack']
# }

def decide_removals(self, game_state):
...
return a list of unit indices of ships that you want to remove.

# for example, if you want to remove your 0th and 3rd units, you'd
# return [0, 3]

def decide_which_unit_to_attack(self, combat_state, attacker_index):

# combat_state is the combat_state for the particular battle
# being considered. It will take the form
# {'location': (1,2),
#     'order': [{'player': 1, 'unit': 0},
#               {'player': 0, 'unit': 1},
#               {'player': 1, 'unit': 1},
#               {'player': 1, 'unit': 2}],
# }.

# attacker_index is the index of your unit, whose turn it is
# to attack.

...

return the index of the ship you want to attack in the
combat order.

# in the above example, if you want to attack player 1's unit 1,
# then you'd return 2 because it corresponds to
# combat_state['order'][2]

def decide_which_units_to_screen(self, combat_state):

# again, the combat_state is the combat_state for the
# particular battle

...

return the indices of the ships you want to screen
in the combat order

# in the above example, if you are player 1 and you want
# to screen units 1 and 2, you'd return [2,3] because
# the ships you want to screen are
# combat_state['order'][2] and combat_state['order'][3]

# NOTE: FOR COMBATSTRATEGY AND DUMBSTRATEGY,
# YOU CAN JUST RETURN AN EMPTY ARRAY

Note: for technology upgrades, you'll likely have to translate between strings of technology names and technology stored as Player attributes. The setattr and getattr functions may be helpful:

>>> class Cls:
pass
>>> obj = Cls()
>>> setattr(obj, "foo", "bar")
>>> obj.foo
'bar'
>>> getattr(obj, "foo")
'bar'

# Problem 73-2¶

Estimated Time: 1 hour

Points: 15

We need to extend our EulerEstimator to allow for "time delay". To do this, we'll need to keep a cache of data for the necessary variables. However, it's going to be very hard to build this if we always have to refer to variables by their index in the point. So, in this problem, we're going to update our EulerEstimator so that we can refer to variables by their actual names.

Refactor your EulerEstimator so that x is a dictionary instead of an array. This way, we can reference components of x by their actual labels rather than having to always use indices.

For example, to run our SIR model, we originally did this:

derivatives = [
(lambda t, x: -0.0003*x[0]*x[1]),
(lambda t, x: 0.0003*x[0]*x[1] - 0.02*x[1]),
(lambda t, x: 0.02*x[1])
]
starting_point = (0, (1000, 1, 0))

estimator = EulerEstimator(derivatives, starting_point)

Now, we need to refactor it into this:

derivatives = {
'susceptible': (lambda t, x: -0.0003*x['susceptible']*x['infected']),
'infected': (lambda t, x: 0.0003*x['susceptible']*x['infected'] - 0.02*x['infected']),
'recovered': (lambda t, x: 0.02*x['infected'])
}
starting_point = (0, {'susceptible': 1000, 'infected': 1, 'recovered': 0})

estimator = EulerEstimator(derivatives, starting_point)

Update the code in test_euler_estimator.py and 3_neuron_network.py to adhere to this new convention.

When I check your submission, I'm going to check that your EulerEstimator has been initialized with a dictionary in each of these files, and I'm going to run each of these files to make sure that they generate the same plots as before.

# Problem 72-1¶

Estimated Time: 15 minutes

Location:

machine-learning/analysis/scatter_plot.py

Points: 5

Make a scatter plot of the following dataset consisting of the points (x, y, class). When the class is A, color the dot red. When it is B, color the dot blue. Post your plot on slack once you've got it.

data = [[2,13,'B'],[2,13,'B'],[2,13,'B'],[2,13,'B'],[2,13,'B'],[2,13,'B'],
[3,13,'B'],[3,13,'B'],[3,13,'B'],[3,13,'B'],[3,13,'B'],[3,13,'B'],
[2,12,'B'],[2,12,'B'],
[3,12,'A'],[3,12,'A'],
[3,11,'A'],[3,11,'A'],
[3,11.5,'A'],[3,11.5,'A'],
[4,11,'A'],[4,11,'A'],
[4,11.5,'A'],[4,11.5,'A'],
[2,10.5,'A'],[2,10.5,'A'],
[3,10.5,'B'],
[4,10.5,'A']]

In the plot, make the dot size proportional to the number of points at that location.

For example, to plot a data set

[
(1,1),
(2,4), (2,4),
(3,9), (3,9), (3,9), (3,9),
(4,16), (4,16), (4,16), (4,16), (4,16), (4,16), (4,16), (4,16), (4,16)
]

you would use the following code:

In [ ]:
import matplotlib.pyplot as plt
plt.scatter(x=[1, 2, 3, 4], y=[1, 4, 9, 16], s=[20, 40, 80, 160], c='red')

Out[ ]:
<matplotlib.collections.PathCollection at 0x7fb4c5d9e668>

# Problem 72-2¶

Estimated Time: 10-60 minutes (depending on whether you've got bugs)

Location:

machine-learning/src/decision_tree.py
machine-learning/tests/test_decision_tree.py

Points: 10

Refactor your DecisionTree so that the dataframe is passed in the fit method (not when the decision tree is initialized). Also, create a method to classify points.

Then, make sure decision tree passes the following tests, using the data from problem 71-1.

Note: Based on visually inspecting a plot of the data, I think these tests are correct, but if you get something different (that looks reasonable), post on Slack so I can check.

df = DataFrame.from_array(data, columns = ['x', 'y', 'class'])

>>> dt = DecisionTree()
>>> dt.fit(df)

The tree should look like this:

(13A, 15B)
/      \
(y < 12.5)       (y >= 12.5)
(13A, 3B)        (12B)
/         \
(x < 2.5)          (x >= 2.5)
(2A, 2B)                (11A, 1B)
/     \                  /        \
(y < 11.25)   (y >= 11.25)  (y < 10.75)     (y >= 10.75)
(2A)          (2B)          (1A, 1B)        (10A)
/      \
(x < 3.5)        (x >= 3.5)
(1B)        (1A)

>>> dt.root.best_split
('y', 12.5)
>>> dt.root.low.best_split
('x', 2.5)
>>> dt.root.low.low.best_split
('y', 11.25)
>>> dt.root.low.high.best_split
('y', 10.75)
>>> dt.root.low.high.low.best_split
('x', 3.5)

>>> dt.classify({'x': 2, 'y': 11.5})
'B'
>>> dt.classify({'x': 2.5, 'y': 13})
'B'
>>> dt.classify({'x': 4, 'y': 12})
'A'
>>> dt.classify({'x': 3.25, 'y': 10.5})
'B'
>>> dt.classify({'x': 3.75, 'y': 10.5})
'A'

# Problem 72-3¶

Estimated time: 45 minutes

Location: Overleaf

a.

b.

• For part (a), you need to compute the probability of each "path" to the desired outcome:
\begin{align*} P( \geq \text{2 born in same month}) &= P( \geq \text{2 born in same month} \, | \, k=5) P(k=5) \\ &\quad + P(\geq \text{2 born in same month} \, | \, k=10) P(k=10) \\ &\quad + P(\geq \text{2 born in same month} \, | \, k=15) P(k=15) \end{align*}
• For part (b), start with $P(\text{k=10} \, | \, \text{2 born in same month} ),$ and use the following two equivalent statements of Bayes' theorem:
\begin{align*} P(A \, | \, B) &= \dfrac{P(A \textrm{ and } B)}{P(B)} \\ P(A \textrm{ and } B) &= P(B \, | \, A) P(A) \end{align*}

# Problem 72-4¶

Estimated time: 45 minutes

Location: Overleaf

• Complete queries 1-10 in SQL Zoo Module 3. Take a screenshot of each successful query (with the successful smiley face showing) and put them in the overleaf doc.

# Problem 71-1¶

Estimated Time: 45 minutes

Location:

machine-learning/src/decision_tree.py
machine-learning/tests/test_decision_tree.py

Points: 15

If you haven't already, create a split() method in your DecisionTree (not the same as the split() method in your Node!) that splits the tree at the node with highest impurity.

Then, create a fit() method in your DecisionTree that keeps on split()-ing until all terminal nodes are completely pure.

Assert that the following tests pass:

>>> df = DataFrame.from_array(
[[1, 11, 'A'],
[1, 12, 'A'],
[2, 11, 'A'],
[1, 13, 'B'],
[2, 13, 'B'],
[3, 13, 'B'],
[3, 11, 'B']],
columns = ['x', 'y', 'class']
)
>>> dt = DecisionTree(df)

# currently, the decision tree looks like this:

(3A, 4B)

>>> dt.split()
# now, the decision tree looks like this:

(3A, 4B)
/      \
(y < 12.5)       (y >= 12.5)
(3A, 1B)         (3B)

>>> dt.split()
# now, the decision tree looks like this:

(3A, 4B)
/      \
(y < 12.5)       (y >= 12.5)
(3A, 1B)         (3B)
/         \
(x < 2.5)          (x >= 2.5)
(3A)               (1B)

>>> dt.root.high.row_indices
[3, 4, 5]
>>> dt.root.low.low.row_indices
[0, 1, 2]
>>> dt.root.low.high.row_indices
[6]

>>> dt = DecisionTree(df)

# currently, the decision tree looks like this:

(3A, 4B)

>>> dt.fit()
# now, the decision tree looks like this:

(3A, 4B)
/      \
(y < 12.5)       (y >= 12.5)
(3A, 1B)         (3B)
/         \
(x < 2.5)          (x >= 2.5)
(3A)               (1B)

>>> dt.root.high.row_indices
[3, 4, 5]
>>> dt.root.low.low.row_indices
[0, 1, 2]
>>> dt.root.low.high.row_indices
[6]

# Problem 71-2¶

Estimated time: 45 minutes

Location: Overleaf

• Complete queries 6-13 in SQL Zoo Module 2. Take a screenshot of each successful query (with the successful smiley face showing) and put them in the overleaf doc.

# Problem 71-3¶

Estimated time: 45 minutes

Location: Overleaf

(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)

a.

b.

c.

# Problem 70-1¶

Estimated time: 60 min

Locations:

machine-learning/src/leave_one_out_cross_validator.py
machine-learning/tests/test_leave_one_out_cross_validator.py

Write a class LeaveOneOutCrossValidator that computes percent_accuracy (also known as "leave-one-out cross validation") for any input classifier. For a refresher, see problem 58-1.

Assert that LeaveOneOutCrossValidator passes the following tests:

>>> df = the cookie dataset that's in test_k_nearest_neighbors_classifier.py
>>> knn = KNearestNeighborsClassifier(k=5)

>>> cv = LeaveOneOutCrossValidator(knn, df, prediction_column='Cookie Type')
[ Note: under the hood, the LeaveOneOutCrossValidator should
create a leave_one_out_df and do
knn.fit(leave_one_out_df, prediction_column='Cookie Type') ]

>>> cv.accuracy()
0.7894736842105263 (Updated!)

Note: the following is included to help you debug.
Row 0 -- True Class is Shortbread; Predicted Class was Shortbread
Row 1 -- True Class is Shortbread; Predicted Class was Shortbread
Row 2 -- True Class is Shortbread; Predicted Class was Shortbread
Row 3 -- True Class is Shortbread; Predicted Class was Shortbread
Row 4 -- True Class is Sugar; Predicted Class was Sugar
Row 5 -- True Class is Sugar; Predicted Class was Sugar
Row 6 -- True Class is Sugar; Predicted Class was Sugar
Row 7 -- True Class is Sugar; Predicted Class was Shortbread
Row 8 -- True Class is Sugar; Predicted Class was Shortbread
Row 9 -- True Class is Sugar; Predicted Class was Sugar
Row 10 -- True Class is Fortune; Predicted Class was Fortune (Updated!)
Row 11 -- True Class is Fortune; Predicted Class was Fortune
Row 12 -- True Class is Fortune; Predicted Class was Fortune
Row 13 -- True Class is Fortune; Predicted Class was Shortbread
Row 14 -- True Class is Fortune; Predicted Class was Fortune (Updated!)
Row 15 -- True Class is Shortbread; Predicted Class was Sugar
Row 16 -- True Class is Shortbread; Predicted Class was Shortbread
Row 17 -- True Class is Shortbread; Predicted Class was Shortbread
Row 18 -- True Class is Shortbread; Predicted Class was Shortbread

>>> accuracies = []
>>> for k in range(1, len(data)-1):
>>>    knn = KNearestNeighborsClassifier(k)
>>>    cv = LeaveOneOutCrossValidator(knn, df, prediction_column='Cookie Type')
>>>    accuracies.append(cv.accuracy())

>>> accuracies
[0.5789473684210527,
0.5789473684210527, #(Updated!)
0.5789473684210527,
0.5789473684210527,
0.7894736842105263, #(Updated!)
0.6842105263157895,
0.5789473684210527,
0.5789473684210527, #(Updated!)
0.6842105263157895, #(Updated!)
0.5263157894736842,
0.47368421052631576, #(Updated!)
0.42105263157894735,
0.42105263157894735, #(Updated!)
0.3684210526315789, #(Updated!)
0.3684210526315789, #(Updated!)
0.3684210526315789, #(Updated!)
0.42105263157894735]

# Problem 70-2¶

Estimated time: 45 minutes

Location: Overleaf

Suppose you are a mission control analyst who is looking down at an enemy headquarters through a satellite view, and you want to get an estimate of how many tanks they have. Most of the headquarters is hidden, but you notice that near the entrance, there are four tanks visible, and these tanks are labeled with the numbers $52, 30, 68, 7.$ So, you assume that they have $N$ tanks that they have labeled with numbers from $1$ to $N.$

Your commander asks you for an estimate: with $95\%$ certainty, what's the max number of tanks they have? Be sure to show your work.

In this problem, you'll answer that question using the same process that you used in Problem 41-1. See here for some additional clarifications that were added to this problem when it was given to the Computation & Modeling class.

# Problem 70-3¶

Estimated time: 30 minutes

Location: Overleaf

a.

b.

c.

# Problem 69-1¶

George & David, this will be a catch-up problem for you guys. You guys are missing a handful of recent assignments, and there are some key problems that serve as foundations for future problems. These are the key problems: 67-1, 66-1, 62-1 (in that order of importance).

• Your task for this assignment is to complete and submit those problems.

Colby, this is also a catch-up problem for you -- your task is to complete 68-1.

Eli & Riley you'll get 10 points for this problem because you're up-to-date.

# Problem 69-2¶

Grading: extra credit (you can get 200% on this assignment)

Location: assignment-problems/sudoku_solver.py

Use "intelligent search" to solve the following mini sudoku puzzle. Fill in the grid so that every row, every column, and every 3x2 box contains the digits 1 through 6.

For a refresher on "intelligent search", see problem 44-1.

Format your output so that when your code prints out the result, it prints out the result in the shape of a sudoku puzzle:

-----------------
| . . 4 | . . . |
| . . . | 2 3 . |
-----------------
| 3 . . | . 6 . |
| . 6 . | . . 2 |
-----------------
| . 2 1 | . . . |
| . . . | 5 . . |
-----------------

# Problem 68-1¶

Estimated Time: 2-3 hours

Location:

machine-learning/src/decision_tree.py
machine-learning/tests/test_decision_tree.py

Points: 15

In this problem, you will create the first iteration of a class DecisionTree that builds a decision tree by repeatedly looping through all possible splits and choosing the split with the highest "goodness of split".

We will use the following simple dataset:

['x', 'y', 'class']
[1, 11, 'A']
[1, 12, 'A']
[2, 11, 'A']
[1, 13, 'B']
[2, 13, 'B']
[3, 12, 'B']
[3, 13, 'B']

For this dataset, "all possible splits" mean all midpoints between distinct entries in sorted data columns.

• The sorted distinct entries of x are 1, 2, 3.

• The sorted distinct entries of y are 11, 12, 13.

So, "all possible splits" are x=1.5, x=2.5, y=11.5, y=12.5.

Assert that the following tests pass. Note that you will need to create a Node class for the nodes in your decision tree.

>>> df = DataFrame.from_array(
[[1, 11, 'A'],
[1, 12, 'A'],
[2, 11, 'A'],
[1, 13, 'B'],
[2, 13, 'B'],
[3, 13, 'B'],
[3, 11, 'B']],
columns = ['x', 'y', 'class']
)
>>> dt = DecisionTree(df)

>>> dt.root.row_indices
[0, 1, 2, 3, 4, 5, 6] # these are the indices of data points in the root node

>>> dt.root.class_counts
{
'A': 3,
'B': 4
}

>>> dt.root.impurity
0.490 # rounded to 3 decimal places

>>> dt.root.possible_splits.to_array()
# dt.possible_splits is a dataframe with columns
# ['feature', 'value', 'goodness of split']
# Note: below is rounded to 3 decimal places

[['x', 1.5,  0.085],
['x', 2.5,  0.147],
['y', 11.5, 0.085],
['y', 12.5, 0.276]]

>>> dt.root.best_split
('y', 12.5)

>>> dt.root.split()
# now, the decision tree looks like this:

(3A, 4B)
/      \
(y < 12.5)       (y >= 12.5)
(3A, 1B)         (3B)

# "low" refers to the "<" child node
# "high" refers to the ">=" child node
>>> dt.root.low.row_indices
[0, 1, 2, 6]
>>> dt.root.high.row_indices
[3, 4, 5]

>>> dt.root.low.impurity
0.375
>>> dt.root.high.impurity
0

>>> dt.root.low.possible_splits.to_array()

[['x', 1.5,  0.125],
['x', 2.5,  0.375],
['y', 11.5, 0.042]]

>>> dt.root.low.best_split
('x', 2.5)

>>> dt.root.low.split()
# now, the decision tree looks like this:

(3A, 4B)
/      \
(y < 12.5)       (y >= 12.5)
(3A, 1B)         (3B)
/         \
(x < 2.5)          (x >= 2.5)
(3A)               (1B)

>>> dt.root.low.low.row_indices
[0, 1, 2]
>>> dt.root.low.high.row_indices
[6]

>>> dt.root.low.low.impurity
0
>>> dt.root.low.high.impurity
0

# Problem 67-1¶

Estimated time: 0-10 hours (?)

Grading: 1,000,000,000 points (okay, not actually that many, but this is IMPORTANT because we need to get our game working for the opportunity with Caltech)

Problem 59-2 was to refactor the DumbPlayer tests to use the game state, and make sure they pass. If you haven't completed this yet, you'll need to do that before starting on this problem.

This problem involves refactoring the way we structure players. Currently, we have a class DumbPlayer that does everything we'd expect from a dumb player. But really, the only reason why DumbPlayer is dumb is that it uses a dumb strategy.

So, we are going to replace DumbPlayer with a class DumbStrategy, and refactor Player so that we can initialize like this:

>>> dumb_player_1 = Player(strategy = DumbStrategy)
>>> dumb_player_2 = Player(strategy = DumbStrategy)
>>> game = Game(dumb_player_1, dumb_player_2)

a. Write a class DumbStrategy in the file src/strategies/dumb_strategy.py that contains the strategies for following methods:

• will_colonize_planet(colony_ship, game_state): returns either True or False; will be called whenever a player's colony ship lands on an uncolonized planet

• decide_ship_movement(ship, game_state): returns the coordinates to which the player wishes to move their ship.

• decide_purchases(game_state): returns a list of ship and/or technology types that you want to purchase; will be called during each economic round.

• decide_removals(game_state): returns a list of ships that you want to remove; will be called during any economic round when your total maintenance cost exceeds your CP.

• decide_which_ship_to_attack(attacking_ship, game_state): looks at the ships in the combat order and decides which to attack; will be called whenever it's your turn to attack

b. Refactor your class Player so that you can initialize a dumb player like this:

>>> dumb_player_1 = Player(strategy = DumbStrategy)
>>> dumb_player_2 = Player(strategy = DumbStrategy)
>>> game = Game(dumb_player_1, dumb_player_2)

c. Make sure that all your tests in tests/test_game_state_dumb_player.py still pass.

d. Write a class CombatStrategy in the file src/strategies/combat_strategy.py that contains the strategies for the same methods as DumbStrategy. But this time, the strategies should be the same as those that are used in CombatPlayer.

e. Refactor your tests in tests/test_game_state_dumb_player.py and make sure they still pass. When you initialize the game, you should do so like this:

>>> combat_player_1 = Player(strategy = CombatStrategy)
>>> combat_player_2 = Player(strategy = CombatStrategy)
>>> game = Game(combat_player_1, combat_player_2)

# Problem 67-2¶

Take a look at all your assignments so far in this course. If there are any assignments with low grades, that you haven't already resubmitted, then be sure to resubmit them.

Also, if you haven't already, submit quiz corrections for all of the quizzes we've had so far!

# Problem 66-1¶

Estimated time: 45 min

Locations:

machine-learning/src/k_nearest_neighbors_classifier.py
machine-learning/tests/test_k_nearest_neighbors_classifier.py

Update your KNearestNeighborsClassifier so that

• k is defined upon initialization,
• the model is fit by calling fit, and passing in the data & dependent variable, and
• when we classify an observation, all we need to pass in is the observation.

Update the tests, too, and make sure they still pass.

>>> df = the cookie dataset that's in test_k_nearest_neighbors_classifier.py
>>> knn = KNearestNeighborsClassifier(k=5)
>>> knn.fit(df, dependent_variable='Cookie Type') # dependent_variable is the new name for prediction_column
>>> df = the observation that's in test_k_nearest_neighbors_classifier.py
>>> knn.classify(observation) # we no longer pass in k

# Problem 66-2¶

Estimated Time: 45 min

Location: Overleaf

(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)

a.

b.

c.

# Problem 66-3¶

Estimated time: 30 min

Location: Overleaf

• Complete queries 1-5 in SQL Zoo Module 2. Take a screenshot of each successful query (with the successful smiley face showing) and put them in the overleaf doc.

• Complete Module 8 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and put it in the overleaf doc.

# Problem 65-1¶

Estimated Time: 30 min

Location: Overleaf

(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)

a.

• Tip: for (b), compute $1-P(\textrm{complement}).$ Here, the complement is the event that you get no aces.

b.

• Tip: again, compute $1-P(\textrm{complement}).$

c.

• Remember that PMF means "probability mass function". This is just the function $P(Z=z).$

• Tip: Find the possible values of $Z,$ and then find the probabilities of those values of $Z$ occurring. Your answer will be a piecewise function: $$P(z) = \begin{cases} \_\_\_, \, z=\_\_\_ \\ \_\_\_, \, z=\_\_\_ \\ \ldots \end{cases}$$

# Problem 65-2¶

Estimated time: 30 min

Location: Overleaf

• Complete queries 11-15 in the SQL Zoo. Take a screenshot of each successful query (with the successful smiley face showing) and put them in the overleaf doc.

• Complete Module 7 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and put it in the overleaf doc.

# Problem 64-1¶

Estimated time: 60 min

Location: assignment-problems/quicksort.py

Previously, you wrote a variant of quicksort that involved splitting the list into two parts (one part $\leq$ the pivot, and another part $>$ the pivot), and then recursively calling quicksort on those parts.

However, this algorithm can be made more efficient by keeping everything in the same list (rather than creating two new lists). You can do this by swapping elements rather than breaking them out into new lists.

Your task is to write a quicksort algorithm that uses only one list, and uses swaps to re-order elements within that list, per the quicksort algorithm. Here is an example of how to do that.

Make sure your algorithm passes the same test as the quicksort without swaps (that you did on the previous assignment).

# Problem 64-2¶

Estimated time: 30 min

Location: Overleaf

Complete queries 1-10 in the SQL Zoo. Here's a reference for the LIKE operator, which will come in handy.

Take a screenshot of each successful query and put them in an overleaf doc. When a query is successful, you'll see a smiley face appear. Your screenshots should look like this:

# Problem 64-3¶

Estimated Time: 60 min

Location: Overleaf

(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)

a.

• The PMF tells you the probabilities of values of $X.$ For example, from the PMF, we have $P(X=0) = 0.2.$ You just need to plug in these values of $X$ into the function $Y=X(X-1)(X-2)$ and sum up any probabilities for which the same value of $Y$ is obtained.

b.

• Remember Bayes' rule: $P(A \, | \, B) = \dfrac{P(A \cap B)}{P(B)}$

c.

• If two events $X$ and $Y$ are "independent", then $P(X \cap Y) = P(X) P(Y).$

• If two events $X$ and $Y$ are "disjoint", then $P(X \cap Y) = 0.$

d.

• Try setting up a system of equations.

# Problem 64-4¶

Estimated Time: 15 min

Location: Overleaf

(Taken from Introduction to Statistical Learning)

This problem is VERY similar to the test/train analysis you did in the previous assignment. But this time, you don't have to actually code up anything. You just have to use the concepts of overfitting and underfitting to justify your answers.

# Problem 63-1¶

(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)

a.

b.

• This problem involves summing up probabilities over all possible paths that lead to a desired outcome. An easy way to do this is to use a tree diagram.

c.

d.

• Note: "with replacement" means that each time a ball is drawn, it is put back in for the next draw. So, it would be possible to draw the same ball more than once.

e.

• Note: "without replacement" means that each time a ball is drawn, it is NOT put back in for the next draw. So, it would NOT be possible to draw the same ball more than once.

f.

• Note: CDF stands for "Cumulative Distribution Function" and is defined as $\textrm{CDF}(x) = P(X \leq x).$

g.

# Problem 63-2¶

• Complete Module 6 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

• Complete Module 4 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

# Problem 63-3¶

Resolve the suggestions/comments on your blog post. Copy and paste everything back into Overleaf, and take a final proofread. Read it through to make sure everything is grammatically correct and makes sense. Submit your shareable Overleaf link along with the assignment.

# Problem 62-1¶

Create a class NaiveBayesClassifier withing machine-learning/src/naive_bayes_classifier.py that passes the following tests. These tests should be written in tests/test_naive_bayes_classifier.py using assert statements.

>>> df = DataFrame.from_array(
[
[False, False, False],
[True, True, True],
[True, True, True],
[False, False, False],
[False, True, False],
[True, True, True],
[True, False, False],
[False, True, False],
[True, False, True],
[False, True, False]
]
columns = ['errors', 'links', 'scam']
)
>>> naive_bayes = NaiveBayesClassifier(df, dependent_variable='scam')

>>> naive_bayes.probability('scam', True)
0.4
>>> naive_bayes.probability('scam', False)
0.6

>>> naive_bayes.conditional_probability(('errors',True), given=('scam',True))
1.0
0.25

>>> naive_bayes.conditional_probability(('errors',True), given=('scam',False))
0.16666666666666666
0.5

>>> observed_features = {
'errors': True,
}
>>> naive_bayes.likelihood(('scam',True), observed_features)
0.1
>>> naive_bayes.likelihood(('scam',False), observed_features)
0.05

>>> naive_bayes.classify('scam', observed_features)
True

Note: in the event of a tie, choose the dependent variable that occurred most frequently in the dataset.

# Problem 62-2¶

Location: assignment-problems/quicksort_without_swaps.py

Implement a function quicksort that implements the variant of quicksort described here: https://www.youtube.com/watch?v=XE4VP_8Y0BU

• Note: this variant of quicksort is very similar to mergesort.

Use your function to sort the list [5,8,-1,9,10,3.14,2,0,7,6] (write a test with an assert statement). Choose the pivot as the rightmost entry.

# Problem 62-3¶

Location: Writeup in Overleaf; code in machine-learning/analysis/assignment_62.py

Watch this video FIRST: https://youtu.be/EuBBz3bI-aA?t=29

a. Create a dataset as follows:

$$\left\{ (x, y) \, \Bigg| \, \begin{matrix} x=0.1, 0.2, \ldots, 10 \\ y=3+0.5x^2 + \epsilon, \, \epsilon \sim \mathcal{U}(-5, 5) \end{matrix} \right\}$$

Split the dataset into two subsets:

• a training dataset consisting of 80% of the data points, and
• a testing dataset consisting of 20% of the data points.

To do this, you can randomly remove 20% of the data points from the dataset.

b. Fit 5 models to the data: a linear regressor, a quadratic regressor, a cubic regressor, a quartic regressor, and a quintic regressor. Compute the residual sum of squares (RSS) for each model on the training data. Which model is most accurate on the training data? Explain why.

c. Compute the RSS for each model on the testing data. Which model is most accurate on the testing data? Explain why.

d. Based on your findings, which model is the best model for the data? Justify your choice.

# Problem 61-1¶

Location: Overleaf

Construct a decision tree model for the following data. Include the Gini impurity and goodness of split at each node. You should choose the splits so as to maximize the goodness of split each time. Also, draw a picture of the decision boundary on the graph.

# Problem 61-2¶

Location: simulation/analysis/3-neuron-network.py

There are a couple things we need to update in our BiologicalNeuron and BiologicalNeuralNetwork, to make the model more realistic.

The first thing is that the synapse only releases neurotransmitters when a neuron has "fired". So, the voltage due to synapse inputs should not be a sum of all the raw voltages of the corresponding neurons. Instead, we should only sum the voltages that are over some threshold, say, $50 \, \textrm{mV}.$

So, our model becomes

$$\dfrac{\textrm dV}{\textrm dt} = \underbrace{\dfrac{1}{C} \left[ s(t) - I_{\text{Na}}(t) - I_{\text K}(t) - I_{\text L}(t) \right]}_\text{neuron in isolation} + \underbrace{\dfrac{1}{C} \left( \sum\limits_{\begin{matrix} \textrm{synapses from} \\ \textrm{other neurons} \\ \textrm{with } V(t) > 50 \end{matrix}} V_{\text{other neuron}}(t) \right)}_\text{interactions with other neurons}.$$

Update your BiologicalNeuralNetwork using the above model. The resulting graph should stay mostly the same (but this update to the model will be important when we're simulating many neurons).

# Problem 61-3¶

Make suggestions on your assigned classmate's blog post. If anything is unclear, uninteresting, or awkwardly phrased, make a suggestion to improve it. You should the "suggesting" feature of Google Docs and type in how you would rephrase or rewrite the particular portions.

Be sure to look for and correct any grammar mistakes as well. This is the second round of review, so I'm expecting there to be NO grammar mistakes whatsoever after you're done reviewing.

# Problem 61-4¶

Location: Overleaf

(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)

a.

b.

# Problem 61-5¶

Complete Module 3 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

# Problem 60-1¶

Location: Overleaf

For two positive functions $f(n)$ and $g(n),$ we say that $f = O(g)$ if

$$\lim\limits_{n \to \infty} \dfrac{f(n)}{g(n)} < \infty,$$

or equivalently, there exists a constant $c$ such that

$$f(n) < c \cdot g(n)$$

for all $n.$

Using the definition above, prove the following:

a. $3n^2 + 2n + 1 = O(n^2).$

b. If $O(f + g) = O(\max(f,g)).$

• Note: in your proof, you should show that if $h = O(f+g),$ then $h = O(\max(f,g)).$

c. $O(f) \cdot O(g) = O(f \cdot g).$

• Note: in your proof, you should show that if $x = O(f)$ and $y=O(g)$ and $h = O(f \cdot g),$ then $x \cdot y = O(h)$

d. If $f = O(g)$ and $g = O(h)$ then $f = O(h).$

# Problem 60-2¶

Location: Overleaf

(Taken from Introduction to Probability: Statistics and Random Processes by Hossein Pishro-Nik)

a.

b.

• Check: you should get a result of $0.1813.$ If you get stuck, then here's a link to a similar example, worked out.

c.

• Check: you should get a result of $1/4.$ Remember the triangle inequality! And remember that you can visualize this problem geometrically:

# Problem 60-3¶

Location: Overleaf

(Taken from Introduction to Statistical Learning)

IMPORTANT:

• For part (a), write out the model for salary of a male in this dataset, and the model for salary of a female in this dataset, and use these models to justify your answer.

• Perhaps counterintuitively, question (c) is false. I want you to provide a thorough explanation of why this is by coming up with a situation in which there would be a significant interaction, but the interaction term is small.

# Problem 60-4¶

• Complete Module 5 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

• Complete Module 2 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

# Problem 60-5¶

Grading: 5 points (if you've completed these things already, then you get 5 points free)

• Resolve any comments/suggestions in your blog post Google Doc

• Catch up on any problems you haven't fully completed: BiologicalNeuralNetwork, DumbPlayer tests, percent_correct with KNearestNeighborsClassifier

# Problem 59-1¶

Location: Overleaf

Construct a decision tree model for the following data, using the splits shown.

Remember that the formula for Gini impurity for a group with class distribution $\vec p$ is

$$G(\vec p) = \sum_i p_i (1-p_i)$$

and that the "goodness-of-split" is quantified as

$$\text{goodness} = G(\vec p_\text{pre-split}) - \sum_\text{post-split groups} \dfrac{N_\text{group}}{N_\text{pre-split}} G(\vec p_\text{group}).$$

See the updated Eurisko Assignment Template for an example of constructing a decision tree in latex for a graph with given splits.

• Be sure to include the class counts, impurity, and goodness of split at each node

• Be sure to label each edge with the corresponding decision criterion.

This resource may also be helpful for reference.

# Problem 59-2¶

Grading: 10 points (5 points for writing tests, 5 points for passing tests)

Revise tests/test_dumb_player.py, so that it uses the actual game state. You can refer to Problem 23-3 for the tests.

For example, the first test is as follows:

At the end of Turn 1 Movement Phase:
Player 0 has 3 scouts at (4,0)
Player 1 has 3 scouts at (4,4)

Phrased in terms of the game state, we could write the test as

game_state = game.generate_state()
player_0_scout_locations = [u.location for u in game_state.players[0].units if unit.type == Scout]
player_1_scout_locations = [u.location for u in game_state.players[1].units if unit.type == Scout]
assert set(player_0_scout_locations) == set([(4,0), (4,0), (4,0)])
assert set(player_1_scout_locations) == set([(4,4), (4,4), (4,4)])

Given the refactoring that we've been doing, your tests might not run successfully the first time. But don't spend all your time on this problem only. If your tests don't pass, then make sure to complete all the other problems in this assignment before you start debugging your game.

# Problem 59-3¶

Make suggestions on your assigned classmate's blog post. If anything is unclear, uninteresting, or awkwardly phrased, make a suggestion to improve it. You should the "suggesting" feature of Google Docs and type in how you would rephrase or rewrite the particular portions.

Be sure to look for and correct any grammar mistakes as well. You'll be graded on how thorough your suggestions are. Everyone should be making plenty of suggestions (there's definitely at least 10 suggestions to be made on everyone's drafts).

• Elijah: review Colby's

• Riley: review David's

• George: review Riley's

• David: review George's

• Colby: review Elijah's

# Problem 59-4¶

• Complete Module 4 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

• Complete Module 1 of Sololearn's SQL Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

# Problem 58-1¶

Recall the following cookie dataset (that has been augmented with some additional examples):

['Cookie Type' ,'Portion Eggs','Portion Butter','Portion Sugar','Portion Flour' ]
[['Shortbread'  ,     0.14     ,       0.14     ,      0.28     ,     0.44      ],
['Shortbread'  ,     0.10     ,       0.18     ,      0.28     ,     0.44      ],
['Shortbread'  ,     0.12     ,       0.10     ,      0.33     ,     0.45      ],
['Shortbread'  ,     0.10     ,       0.25     ,      0.25     ,     0.40      ],
['Sugar'       ,     0.00     ,       0.10     ,      0.40     ,     0.50      ],
['Sugar'       ,     0.00     ,       0.20     ,      0.40     ,     0.40      ],
['Sugar'       ,     0.02     ,       0.08     ,      0.45     ,     0.45      ],
['Sugar'       ,     0.10     ,       0.15     ,      0.35     ,     0.40      ],
['Sugar'       ,     0.10     ,       0.08     ,      0.35     ,     0.47      ],
['Sugar'       ,     0.00     ,       0.05     ,      0.30     ,     0.65      ],
['Fortune'     ,     0.20     ,       0.00     ,      0.40     ,     0.40      ],
['Fortune'     ,     0.25     ,       0.10     ,      0.30     ,     0.35      ],
['Fortune'     ,     0.22     ,       0.15     ,      0.50     ,     0.13      ],
['Fortune'     ,     0.15     ,       0.20     ,      0.35     ,     0.30      ],
['Fortune'     ,     0.22     ,       0.00     ,      0.40     ,     0.38      ],
['Shortbread'  ,     0.05     ,       0.12     ,      0.28     ,     0.55      ],
['Shortbread'  ,     0.14     ,       0.27     ,      0.31     ,     0.28      ],
['Shortbread'  ,     0.15     ,       0.23     ,      0.30     ,     0.32      ],
['Shortbread'  ,     0.20     ,       0.10     ,      0.30     ,     0.40      ]]

When fitting our k-nearest neighbors models, we have been using this dataset to predict the type of a cookie based on its ingredient portions. We've also seen that issues can arise when $k$ is too small or too large.

So, what is a good value of $k?$

a. To explore this question, plot the function $$y= \text{percent_correct}(k), \qquad k=1,2,3, \ldots, 16,$$ where $\text{percent_correct}(k)$ is the percentage of points in the dataset that the $k$-nearest neighbors model would classify correctly.

for each data point:
1. fit a kNN model to all the data EXCEPT that data point
2. use the kNN model to classify the data point
3. determine whether the kNN classification matches up
with the actual class of the data point

percent_correct = num_correct_classifications / tot_num_data_points

You should get the following result:

b. Based on what you know about what happens when $k$ is too small or too large, does the shape of your plot makes sense?

c. What would be an appropriate value (or range of values) of $k$ for this modeling task? Justify your answer by referring to your plot.

# Problem 58-2¶

Get your BiologicalNeuralNetwork fully working. Note that there's a particular issue with defining functions within for loops that Elijah pointed out on Slack:

Consider the following code:

funcs = []
for i in range(10):
funcs.append(lambda x: x * i)
for f in funcs:
print(f(5))

You'd expect to see 5, 10, 15, 20, 25, etc.
BUT what you actually get is 50, 50, 50, 50, 50, etc.
Instead of i * 5, it is just doing 10 * 5
This is because the lambda refers to the place that i is stored in memory, which ends as 10 So instead of taking the current value of i and using it in the lambda, it will take the last value it was set to, before it is actually called. The way you fix this problem is:

funcs = []
for i in range(10):
funcs.append(lambda x, i=i: x * i)
for f in funcs:
print(f(5))

So, when you're getting your derivatives, you'll need to do the following:

network_derivatives = []
for each neuron:

parent_indices = indices of neurons that send synapses to neuron i

# x is [V0, n0, m0, h0,
#       V1, n1, m1, h1,
#       V2, n2, m2, h2, ...]

network_derivatives += [
(
lambda t, x, i=i, neuron=neuron:
neuron.dV(t, x[4*i : (i+1)*4])
+ 1/neuron.C * sum(x[p*4] for p in parent_indices)
),
(
lambda t, x, i=i, neuron=neuron:
neuron.dn(t, x[4*i : (i+1)*4])
),
(
lambda t, x, i=i, neuron=neuron:
neuron.dm(t, x[4*i : (i+1)*4])
),
(
lambda t, x, i=i, neuron=neuron:
neuron.dh(t, x[4*i : (i+1)*4])
)
]

# Problem 58-3¶

For blog post draft #4, I want you to do the following:

George: Linear and Logistic Regression, Part 1: Understanding the Models

1. Make plots of the images you wanted to include, and insert them into your post. You can use character arguments in plt.plot(); here's a reference

2. You haven't really hit on why the logistic model takes a sigmoid shape. You should talk about the $e^{\beta x}$ term, where $\beta$ is negative. What happens when $x$ gets really negative? What happens when $x$ gets really positive?

3. For your linear regression, you should use coefficients $\beta_0, \beta_1, \ldots$ just like you did in the logistic regression. This will help drive the point home that logistic regression is just a transformation of linear regression.

4. We're ready to move onto the text editing phase! The next time you submit your blost, put it in a Google Doc and share it with me so that I can make "suggestions" on it.

Colby: Linear and Logistic Regression, Part 2: Fitting the Models

1. In your explanation of the pseudoinverse, make sure to state that in most modeling contexts, our matrix $X$ is taller than it is wide, because we have lots of data points. So our matrix $X$ is usually not invertible because it is a tall rectangular matrix.

2. In your explanation of the pseudoinverse, be more careful with your language: the pseudoinverse $(\mathbf{X}^T\mathbf{X})^{-1}$ is not equivalent to the standard inverse $\mathbf{X}^{-1}.$ The equation $\mathbf{X} \vec \beta = \vec y$ is usually not solvable, because the standard inverse $\mathbf{X}^{-1}$ usually does not exist. But the pseudoinverse $(\mathbf{X}^T\mathbf{X})^{-1}$ usually does exist, and the solution $\vec \beta = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \vec y$ minimizes the sum of squared error between the desired output $\vec y$ and the actual output $\mathbf{X} \vec \beta.$

3. Decide between bracket matrices (bmatrix) and parenthesis matrices (pmatrix). You sometiemes use bmatrix, and other times pmatrix. Choose one convention and stick to it.

4. On page 2, show the following intermediate steps (fill in the dots): \begin{align*} \vec \beta &= \ldots \\ \vec \beta &= \begin{pmatrix} \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \end{pmatrix}^{-1} \begin{pmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 4 \\ 0 & 0 & 2 & 5 \end{pmatrix} \begin{pmatrix} 0.1 \\ 0.2 \\ 0.5 \\ 0.6 \end{pmatrix} \\ \vec \beta &= \begin{pmatrix} \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \end{pmatrix} \begin{pmatrix} 0.1 \\ 0.2 \\ 0.5 \\ 0.6 \end{pmatrix} \\ \vec \beta &= \ldots \end{align*}

5. Remove any parts of your code that are not relevant to the blog post (even if you used them for some part of an assignment). For example, you should remove the rounding part from apply_coeffs.

6. Make sure your code follows the same conventions everywhere. For example, you sometimes say coefficients, and other times just coeffs. For the purposes of this blog post, we want to be as clear as possible, so always use coefficients instead of coeffs.

7. Make your code tight so that there is zero redundancy. For example, in __init__, the argument ratings is redundant because these values are already in your dataframe, and you already have a variable prediction_column that indicates the relevant column in your dataframe. So eliminate ratings from your code.

8. After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.

Riley: Linear and Logistic Regression, Part 3: Categorical Variables, Interaction Terms, and Nonlinear Transformations of Variables

1. In the statements of your models, you need a constant term, and you need to standardize your coefficient labels and your variable names. You use different conventions in a lot of places -- sometimes you call the coefficients $a,b,c,\ldots,$ other times $c_1, c_2, c_3,\ldots,$ and other times $\beta_1,\beta_2,\beta_3,\ldots.$ Likewise, you sometimes say "beef" and "pb", while other times you say "roast beef" and "peanut butter". You need to standardize these names. I think using $\beta$'s or $c$'s for coefficients and abbreviations for variable names is preferable. So, for example, one of your equations would turn into $y = \beta_0 + \beta_1(\textrm{beef}) + \beta_2(\textrm{pb}) + \beta_3(\textrm{mayo}) + \beta_4(\textrm{jelly})$ or $y = c_0 + c_1(\textrm{beef}) + c_2(\textrm{pb}) + c_3(\textrm{mayo}) + c_4(\textrm{jelly}).$

2. After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.

David: Predator-Prey Modeling with Euler Estimation

1. Explain this more clearly:

Euler estimation works by adding the derivative of an equation to each given value over and over again. This is because the derivatives are the instantaneous rates of change so we add it at each point to accurately show the the equation. Adding each point up from an equation is also equivalent to an integral.

2. Clean up your code on page 3. It's hard to tell what's going on. Put these code snippets into a single clean function, and change up the naming/structure so that it's clear what's going on. The code doesn't have to be the same as what's actually in your Euler estimator.

3. After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.

Elijah: Solving Magic Squares using Backtracking

1. In your nested for loops, you should be using range(1,10) because 0 is not considered as an element of the magic square.

2. In your code snippets, you should name everything very descriptively so that it's totally obvious what things represent. For example, instead of s = int(len(arr)**0.5), you could say side_length = int(len(arr)**0.5).

3. After you do the steps above, we're ready to move onto the text editing phase! The next time you submit your blog post, put it in a Google Doc and share it with me so that I can make "suggestions" on it.

# Problem 57-1¶

Location: machine-learning/src/k_nearest_neighbors_classifier.py

Create a class KNearestNeighborsClassifier that works as follows. Leverage existing methods in your DataFrame class to do the brunt of the processing.

>>> df = DataFrame.from_array(
[['Shortbread'  ,     0.14     ,       0.14     ,      0.28     ,     0.44      ],
['Shortbread'  ,     0.10     ,       0.18     ,      0.28     ,     0.44      ],
['Shortbread'  ,     0.12     ,       0.10     ,      0.33     ,     0.45      ],
['Shortbread'  ,     0.10     ,       0.25     ,      0.25     ,     0.40      ],
['Sugar'       ,     0.00     ,       0.10     ,      0.40     ,     0.50      ],
['Sugar'       ,     0.00     ,       0.20     ,      0.40     ,     0.40      ],
['Sugar'       ,     0.10     ,       0.08     ,      0.35     ,     0.47      ],
['Sugar'       ,     0.00     ,       0.05     ,      0.30     ,     0.65      ],
['Fortune'     ,     0.20     ,       0.00     ,      0.40     ,     0.40      ],
['Fortune'     ,     0.25     ,       0.10     ,      0.30     ,     0.35      ],
['Fortune'     ,     0.22     ,       0.15     ,      0.50     ,     0.13      ],
['Fortune'     ,     0.15     ,       0.20     ,      0.35     ,     0.30      ],
['Fortune'     ,     0.22     ,       0.00     ,      0.40     ,     0.38      ]],
columns = ['Cookie Type' ,'Portion Eggs','Portion Butter','Portion Sugar','Portion Flour' ]
)
>>> knn = KNearestNeighborsClassifier(df, prediction_column = 'Cookie Type')
>>> observation = {
'Portion Eggs': 0.10,
'Portion Butter': 0.15,
'Portion Sugar': 0.30,
'Portion Flour': 0.45
}

>>> knn.compute_distances(observation)
Returns a dataframe representation of the following array:

[0.158, 'Sugar'],
[0.158, 'Sugar'],
[0.088, 'Sugar'],
[0.245, 'Sugar'],
[0.212, 'Fortune'],
[0.187, 'Fortune'],
[0.396, 'Fortune'],
[0.173, 'Fortune'],
[0.228, 'Fortune']]

Note: the above has been rounded to 3 decimal places for ease of viewing, but you should not round yourself.

>>> knn.nearest_neighbors(observation)
Returns a dataframe representation of the following array:

[0.088, 'Sugar'],
[0.158, 'Sugar'],
[0.158, 'Sugar'],
[0.173, 'Fortune'],
[0.187, 'Fortune'],
[0.212, 'Fortune'],
[0.228, 'Fortune'],
[0.245, 'Sugar'],
[0.396, 'Fortune']]

Note: the above has been rounded to 3 decimal places for ease of viewing, but you should not round yourself.

>>> knn.compute_average_distances(observation)

{
'Sugar': 0.162,
'Fortune': 0.239
}

Note: the above has been rounded to 3 decimal places for ease of viewing, but you should not round yourself.

>>> knn.classify(observation, k=5)

(In the case of a tie, chose whichever class has a lower average distance. If that is still a tie, then pick randomly.)

# Problem 57-2¶

Location: simulation/analysis/3-neuron-network.py

IMPORTANT UPDATE: I'm not going to take points off if your BiologicalNeuralNetwork isn't fully working. I'll expect to see at least the skeleton of it written, but there's this subtle thing with lambda functions that Elijah pointed out, that we need to talk about in class tomorrow.

Create a Github repository named simulation, and organize the code as follows:

simulation/
|- src/
|- euler_estimatory.py
|- biological_neuron.py
|- biological_neural_network.py
|- tests/
|- test_euler_estimator.py
|- analysis/
|- predator_prey.py
|- sir_epidemiology.py

Rename your class Neuron to be BiologicalNeuron. (This is to avoid confusion when we create another neuron class in the context of machine learning.)

Create a class BiologicalNeuralNetwork to simulate a network of interconnected neurons. This class will be initialized with two arguments:

• neurons - a list of neurons in the network

• synapses - a list of "directed edges" that correspond to connections between neurons

To simulate your BiologicalNeuralNetwork, you will use an EulerEstimator where x is a long array of V,n,m,h for each neuron. So, if you are simulating 3 neurons (neurons 0,1,2), then you will be passing in $4 \times 3 = 12$ derivatives:

$$\mathbf{x} = (V_0, n_0, m_0, h_0, V_1, n_1, m_1, h_1, V_2, n_2, m_2, h_2)$$

Note that you will have to add extra terms to the voltage derivatives represent the synapses. The updated derivative of voltage is as follows: $$\dfrac{\textrm dV}{\textrm dt} = \underbrace{\dfrac{1}{C} \left[ s(t) - I_{\text{Na}}(t) - I_{\text K}(t) - I_{\text L}(t) \right]}_\text{neuron in isolation} + \underbrace{\dfrac{1}{C} \left( \sum\limits_{\begin{matrix} \textrm{synapses from} \\ \textrm{other neurons} \end{matrix}} V_{\text{other neuron}}(t) \right)}_\text{interactions with other neurons}.$$

So, in the case of 3 neurons connected as $0 \to 1 \to 2,$ the full system of equations would be as follows:

\begin{align*} \dfrac{\textrm dV_0}{\textrm dt} &= \textrm{neuron_0.dV}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dn_0}{\textrm dt} &= \textrm{neuron_0.dn}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dm_0}{\textrm dt} &= \textrm{neuron_0.dm}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dh_0}{\textrm dt} &= \textrm{neuron_0.dh}(V_0,n_0,m_0,h_0) \\ \dfrac{\textrm dV_1}{\textrm dt} &= \textrm{neuron_1.dV}(V_1,n_1,m_1,h_1) + \dfrac{1}{\textrm{neuron_1.C}} V_0(t) \\ \dfrac{\textrm dn_1}{\textrm dt} &= \textrm{neuron_1.dn}(V_1,n_1,m_1,h_1) \\ \dfrac{\textrm dm_1}{\textrm dt} &= \textrm{neuron_1.dm}(V_1,n_1,m_1,h_1) \\ \dfrac{\textrm dh_1}{\textrm dt} &= \textrm{neuron_1.dh}(V_1,n_1,m_1,h_1) \\ \dfrac{\textrm dV_2}{\textrm dt} &= \textrm{neuron_2.dV}(V_2,n_2,m_2,h_2) + \dfrac{1}{\textrm{neuron_2.C}} V_1(t) \\ \dfrac{\textrm dn_2}{\textrm dt} &= \textrm{neuron_2.dn}(V_2,n_2,m_2,h_2) \\ \dfrac{\textrm dm_2}{\textrm dt} &= \textrm{neuron_2.dm}(V_2,n_2,m_2,h_2) \\ \dfrac{\textrm dh_2}{\textrm dt} &= \textrm{neuron_2.dh}(V_2,n_2,m_2,h_2) \end{align*}

Test your BiologicalNeuralNetwork as follows:

>>> def electrode_voltage(t):
if t > 10 and t < 11:
return 150
elif t > 20 and t < 21:
return 150
elif t > 30 and t < 40:
return 150
elif t > 50 and t < 51:
return 150
elif t > 53 and t < 54:
return 150
elif t > 56 and t < 57:
return 150
elif t > 59 and t < 60:
return 150
elif t > 62 and t < 63:
return 150
elif t > 65 and t < 66:
return 150
return 0

>>> neuron_0 = BiologicalNeuron(stimulus = electrode_voltage)
>>> neuron_1 = BiologicalNeuron()
>>> neuron_2 = BiologicalNeuron()
>>> neurons = [neuron_0, neuron_1, neuron_2]

>>> synapses = [(0,1), (1,2)]
The neural network resembles a directed graph:
0 --> 1 --> 2

>>> network = BiologicalNeuralNetwork(neurons, synapses)
>>> euler = EulerEstimator(
derivatives = network.get_derivatives(),
point = network.get_starting_point()
)
>>> plt.plot([n/2 for n in range(160)], [electrode_voltage(n/2) for n in range(160)])
>>> euler.plot([0, 80], step_size = 0.001)

# Problem 57-3¶

Complete Module 3 of Sololearn's C++ Course. Take a screenshot of the completed module, with your user profile showing, and submit it along with the assignment.

# Problem 56-1¶

Location: machine-learning/tests/test_data_frame.py

Implement the following functionality in your DataFrame, and assert that these tests pass.

a. Loading an array. You'll need to use @classmethod for this one (read about it here).

>>> columns = ['firstname', 'lastname', 'age']
>>> arr = [['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13],
['Sylvia', 'Mendez', 9]]
>>> df = DataFrame.from_array(arr, columns)

b. Selecting columns by name

>>> df.select_columns(['firstname','age']).to_array()
[['Kevin', 5],
['Charles', 17],
['Anna', 13],
['Sylvia', 9]]

c. Selecting rows by index

>>> df.select_rows([1,3]).to_array()
[['Charles', 'Trapp', 17],
['Sylvia', 'Mendez', 9]]

d. Selecting rows which satisfy a particular condition (given as a lambda function)

>>> df.select_rows_where(
lambda row: len(row['firstname']) >= len(row['lastname'])
and row['age'] > 10
).to_array()
[['Charles', 'Trapp', 17]]

e. Ordering the rows by given column

>>> df.order_by('age', ascending=True).to_array()
[['Kevin', 'Fray', 5],
['Sylvia', 'Mendez', 9],
['Anna', 'Smith', 13],
['Charles', 'Trapp', 17]]

>>> df.order_by('firstname', ascending=False).to_array()
[['Sylvia', 'Mendez', 9],
['Kevin', 'Fray', 5],
['Charles', 'Trapp', 17],
['Anna', 'Smith', 13]]

# Problem 56-2¶

For blog post draft #3, I want you to do the following:

George: Linear and Logistic Regression, Part 1: Understanding the Models

1. Make plots of the images you wanted to include, and insert them into your post. You can use character arguments in plt.plot(); here's a reference

2. You haven't really hit on why the logistic model takes a sigmoid shape. You should talk about the $e^{\beta x}$ term, where $\beta$ is negative. What happens when $x$ gets really negative? What happens when $x$ gets really positive?

3. For your linear regression, you should use coefficients $\beta_0, \beta_1, \ldots$ just like you did in the logistic regression. This will help drive the point home that logistic regression is just a transformation of linear regression.

4. We're ready to move onto the text editing phase! The next time you submit your blost, put it in a Google Doc and share it with me so that I can make "suggestions" on it.

Colby: Linear and Logistic Regression, Part 2: Fitting the Models

1. You haven't defined what $y'$ means. Be sure to do that

2. You've got run-on sentences and incorrect comma usage everywhere. Fix that. Read each sentence aloud and make sure it makes sense as a complete sentence.

3. In your explanation of the pseudoinverse, you say that it's a generalization of the matrix inverse when the matrix may not be invertible. That's correct. But you should also explain why, in our case, the matrix $X$ is not expected to be invertible. (Think: only square matrices are invertible. Is $X$ square?)

4. Your matrix entries are backwards. For example, the first row should be $x_{11}, x_{12}, \ldots, x_{1n}.$

5. Wherever you use ln, you should use \ln instead.

6. In your linear/logistic regression functions, you should use $x_1, x_2, \ldots, x_m$ instead of $a,b, \ldots z.$

7. In your examples of the linear and logistic regression, you should use 3-dimensional data points instead of just 2-dimensional data point. This way, your example can demonstrate how you deal with multiple input variables. Also, you should set up some context around the example. Come up with a concrete situation in which your data points could be observations, and you want to predict something.

Riley: Linear and Logistic Regression, Part 3: Categorical Variables, Interaction Terms, and Nonlinear Transformations of Variables

1. Instead of inserting a screenshot of the raw data, put it in a data table. See the template for how to create data tables.

2. You have some models in text: y = a(roast beef)+b(peanut butter) and y = a(roast beef)+b(peanut butter)+c(roast beef)(peanut butter). These should be on their own lines, as equations.

3. There are some sections where the wording is really confusing. Proofread your paper and make sure that everything is expressed clearly. For example, this is not expressed clearly:

So for example we could not regress y = x^a. The most important attribute of this is that we can plot a logistic regression using this method. This is possible because the format of the logistic regression is ....

4. At the very end, when you talk about transforming a dataset to fit a quadratic, it's not clear what you're doing. (I know what you're trying to say, but if I didn't already know, then I'd probably be confused.) You should explain how, in general, if we want to fit a nonlinear regression model $$y= \beta_1 f_2(x_1) + \beta_2 f_2(x_2) + \cdots,$$ then we have to transform the data as $$(x_1, x_2, \ldots ,y) \to (f_1(x_1), f_2(x_2), \ldots, y)$$ and then fit a linear regression to the points of the form $(f_1(x_1), f_2(x_2), \ldots, y).$

David: Predator-Prey Modeling with Euler Estimation

1. Fix your latex formatting -- follow the latex commandments.

2. Use pseudocode formatting (see the template)

3. In the predator-prey model that you stated, you need to explain where each term comes from. You've sort of hit on this below the model, but you haven't explicitly paired each individual term with its explanation. Also, why do we multiply the $DW$ together for some of the terms? Imagine that the reader knows what a derivative is, but has no experience using a differential equation for modeling purposes.

4. You should also explain that this equation is difficult to solve analytically, so that's why we're going to turn to Euler estimation.

5. You should explain where these recurrences come from. Why does this provide a good estimation of the function? (You should talk about rates of change) \begin{align*}D(t + \Delta t) &\approx D(t) + D'(t) \Delta t \\ W(t + \Delta t) &\approx W(t) + W'(t) \Delta t \end{align*}

6. When explaining Euler estimation, you should show the computations for the first few points in the plot. This way, the reader can see a concrete example of the process that you're actually carrying out to generate the plot.

Elijah: Solving Magic Squares using Backtracking

Make sure that the rest of your content is there on the next draft:

1. How can you overcome the inefficiency of brute-force search using "backtracking", i.e. intelligent search? https://en.wikipedia.org/wiki/Sudoku_solving_algorithms#Backtracking

2. Write some code for how to implement backtracking using a bunch of nested for loops (i.e. the ugly solution). Run some actual simulations to see how long it takes you to find a solution to a 3x3 magic square using backtracking. Then try the 4x4, 5x5, etc and make a graph.

3. How can you write the code more compactly using a single while loop?

# Problem 56-3¶

Complete Module 1 AND Module 2 of Sololearn's C++ Course. Take a screenshot of each completed module, with your user profile showing, and submit both screenshots along with the assignment.

# Problem 55-1¶

Location: Overleaf

Naive Bayes classification is a way to classify a new observation consisting of multiple features, if we have data about how other observations were classified. It involves choosing the class that maximizes the posterior distribution of the classes, given the observation.

\begin{align*} \text{class} &= \underset{\text{class}}{\arg\max} \, P(\text{class} \, | \, \text{observed features}) \\ &= \underset{\text{class}}{\arg\max} \, \dfrac{P(\text{observed features} \, | \, \text{class}) P(\text{class})}{P(\text{observed features})} \\ &= \underset{\text{class}}{\arg\max} \, P(\text{observed features} \, | \, \text{class}) P(\text{class})\\ &= \underset{\text{class}}{\arg\max} \, \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{class}) P(\text{class})\\ &= \underset{\text{class}}{\arg\max} \, P(\text{class}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{class}) \\ \end{align*}

The key assumption (used in the final line) is that all the features are independent:

\begin{align*} P(\text{observed features} \, | \, \text{class}) = \prod\limits_{\text{observed} \\ \text{features}} P(\text{feature} \, | \, \text{class}) \end{align*}

Suppose that you want to find a way to classify whether an email is a phishing scam or not, based on whether it has errors and whether it contains links.

After checking 10 emails in your inbox, you came up with the following data set:

1. No errors, no links; NOT scam
2. Contains errors, contains links; SCAM
3. Contains errors, contains links; SCAM
4. No errors, no links; NOT scam
5. No errors, contains links; NOT scam
6. Contains errors, contains links; SCAM
7. Contains errors, no links; NOT scam
8. No errors, contains links; NOT scam
9. Contains errors, no links; SCAM
10. No errors, contains links; NOT scam

Now, you look at 4 new emails. For each of the new emails, compute

$$P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam})$$

and decide whether it is a scam.

a. No errors, no links. You should get

$$P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = 0 \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{4}.$$

b. Contains errors, contains links. You should get

$$P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = \dfrac{3}{10} \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{20}.$$

c. Contains errors, no links. You should get

$$P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = \dfrac{1}{10} \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{20}.$$

d. No errors, contains links. You should get

$$P(\text{scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{scam}) = 0 \\[10pt] \text{and} \\[10pt] P(\text{not scam}) \prod\limits_{\text{observed}\\ \text{features}} P(\text{feature} \, | \, \text{not scam}) = \dfrac{1}{4}.$$

# Problem 55-2¶

Refactor your Hodgkin-Huxley neuron simulation so that the functions governing the internal state of the neuron are encapsulated within a class Neuron.

>>> def stimulus(t):
if t > 10 and t < 11:
return 150
elif t > 20 and t < 21:
return 150
elif t > 30 and t < 40:
return 150
elif t > 50 and t < 51:
return 150
elif t > 53 and t < 54:
return 150
elif t > 56 and t < 57:
return 150
elif t > 59 and t < 60:
return 150
elif t > 62 and t < 63:
return 150
elif t > 65 and t < 66:
return 150
return 0
>>> neuron = Neuron(stimulus)
>>> neuron.plot_activity()

The above code should generate the SAME plot that you generated previously.

Note: Do NOT make plot_activity() into a gigantic function. Rather, you should keep your code modular, using helper functions when appropriate. Multiple helper functions will be needed to achieve this implementation with good code quality.

# Problem 55-3¶

Create an EconomicEngine that handles the following:

• CP income
• maintenance costs
• ship removals
• ship/technology purchases

This will be in the spirit of how MovementEngine handles movement and CombatEngine handles combat.

In your EconomicEngine, include a method generate_economic_state() that generates the following information:

economic_state = {
'income': 20,
'maintenance cost': 5
}

# Problem 54-1¶

Location: Overleaf

Note: Points will be deducted for poor latex quality. If you're writing up your latex and anything looks off, make sure to post about it so you can fix it before you submit. FOLLOW THE LATEX COMMANDMENTS!

The dataset below displays the ratio of ingredients for various cookie recipes.

['ID', 'Cookie Type' ,'Portion Eggs','Portion Butter','Portion Sugar','Portion Flour' ]

[[ 1 , 'Shortbread'  ,     0.14     ,       0.14     ,      0.28     ,     0.44      ],
[ 2  , 'Shortbread'  ,     0.10     ,       0.18     ,      0.28     ,     0.44      ],
[ 3  , 'Shortbread'  ,     0.12     ,       0.10     ,      0.33     ,     0.45      ],
[ 4  , 'Shortbread'  ,     0.10     ,       0.25     ,      0.25     ,     0.40      ],
[ 5  , 'Sugar'       ,     0.00     ,       0.10     ,      0.40     ,     0.50      ],
[ 6  , 'Sugar'       ,     0.00     ,       0.20     ,      0.40     ,     0.40      ],
[ 7  , 'Sugar'       ,     0.10     ,       0.08     ,      0.35     ,     0.47      ],
[ 8  , 'Sugar'       ,     0.00     ,       0.05     ,      0.30     ,     0.65      ],
[ 9  , 'Fortune'     ,     0.20     ,       0.00     ,      0.40     ,     0.40      ],
[ 10 , 'Fortune'     ,     0.25     ,       0.10     ,      0.30     ,     0.35      ],
[ 11 , 'Fortune'     ,     0.22     ,       0.15     ,      0.50     ,     0.13      ],
[ 12 , 'Fortune'     ,     0.15     ,       0.20     ,      0.35     ,     0.30      ],
[ 13 , 'Fortune'     ,     0.22     ,       0.00     ,      0.40     ,     0.38      ]]

Suppose you're given a cookie recipe and you want to determine whether it is a shortbread cookie, a sugar cookie, or a fortune cookie. The cookie recipe consists of 0.10 portion eggs, 0.15 portion butter, 0.30 portion sugar, and 0.45 portion flour. We will infer the classification of this cookie using the "$k$ nearest neighbors" approach.

Part 1: How to do $k$ nearest neighbors.

a. This cookie can be represented as the point $P(0.10, 0.15, 0.30, 0.45).$ Compute the Euclidean distance between $P$ and each of the points corresponding to cookies in the dataset.

• NOTE: YOU DON'T HAVE TO SHOW YOUR CALCULATIONS. Just write a Python script to do the calculations for you and print out the results, and in your writeup you can just include the final results.

b. Consider the 5 points that are closest to $P.$ (These are the 5 "nearest neighbors".) What cookie IDs are they, and what types of cookies are represented by these points?

c. What cookie classification showed up most often in the 5 nearest neighbors? What inference can you make about the recipe corresponding to the point $P$?

Part 2: The danger of using too large a $k$

a. What happens if we try to perform the $k$ nearest neighbors approach with $k=13$ (i.e. the full dataset) to infer the cookie classification of point $P?$ What issue occurs, and why does it occur?

b. For each classification of cookie, find the average distance between $P$ and the points corresponding to the cookies in that classification. Explain how this resolves the issue you identified in part (a).

# Problem 54-2¶

Location: Overleaf

Note: Points will be deducted for poor latex quality. If you're writing up your latex and anything looks off, make sure to post about it so you can fix it before you submit. FOLLOW THE LATEX COMMANDMENTS!

Suppose you want to estimate the probability that you will get into a particular competitive college. You had a bunch of friends a year ahead of you that applied to the college, and these are their results:

• Martha was accepted. She was the 95th percentile of her class, got a 33 on the ACT, and had an internship at a well-known company the summer before she applied to college.

• Jeremy was rejected. He was in the 95th percentile of his class and got a 34 on the ACT.

• Alphie was accepted. He was in the 92nd percentile of his class, got a 35 on the ACT, and had agreed to play on the college's basketball team if accepted.

• Dennis was rejected. He was in the 85th percentile of his class, got a 30 on the ACT, and had committed to run on the college's track team if accepted.

# Problem 34-3¶

(Space Empires; 45 min)

Create another game events file for 3 turns using Combat Players. This time, use DESCENDING die rolls: 6, 5, 4, 3, 2, 1, 6, 5, 4, ...

The game events file for ascending die rolls is provided below. So, you should use the same template, using DESCENDING die rolls.

Save your file as notes/combat_player_game_events_descending_rolls.txt, and rename your other file to notes/combat_player_game_events_ascending_rolls.txt

This problem is worth 2 points for completion, and 2 points for being correct.

STARTING CONDITIONS

Players 1 and 2
are CombatTestingPlayers
have an initial fleet of 3 scouts, 3 colony ships, 4 ship yards

---

TURN 1 - MOVEMENT PHASE

Player 1:
Scouts 1,2,3: (2,0) --> (2,2)
Colony Ships 4,5,6: (2,0) --> (2,2)

Player 2:
Scouts 1,2,3: (2,4) --> (2,2)
Colony Ships 4,5,6: (2,4) --> (2,2)

COMBAT PHASE

Colony Ships are removed

| PLAYER |        SHIP        | HEALTH  |
----------------------------------------
|    1   |         Scout 1    |    1    |
|    1   |         Scout 2    |    1    |
|    1   |         Scout 3    |    1    |
|    2   |         Scout 1    |    1    |
|    2   |         Scout 2    |    1    |
|    2   |         Scout 3    |    1    |

Attack 1
Attacker: Player 1 Scout 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Die Roll: 1
Hit or Miss: Hit

| PLAYER |          SHIP          | HEALTH  |
---------------------------------------------
|    1   |         Scout 1        |    1    |
|    1   |         Scout 2        |    1    |
|    1   |         Scout 3        |    1    |
|    2   |         Scout 2        |    1    |
|    2   |         Scout 3        |    1    |

Attack 2
Attacker: Player 1 Scout 2
Defender: Player 2 Scout 2
Largest Roll to Hit: 3
Die Roll: 2
Hit or Miss: Hit

| PLAYER |          SHIP          | HEALTH  |
---------------------------------------------
|    1   |         Scout 1        |    1    |
|    1   |         Scout 2        |    1    |
|    1   |         Scout 3        |    1    |
|    2   |         Scout 3        |    1    |

Attack 3
Attacker: Player 1 Scout 3
Defender: Player 2 Scout 3
Largest Roll to Hit: 3
Dice Roll: 3
Hit or Miss: Hit

| PLAYER |          SHIP          | HEALTH  |
---------------------------------------------
|    1   |         Scout 1        |    1    |
|    1   |         Scout 2        |    1    |
|    1   |         Scout 3        |    1    |

Combat phase complete

------------------------

TURN 1 - ECONOMIC PHASE

Player 1

INCOME/MAINTENANCE (starting CP: 20)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: -1 CP/Scout x 3 Scouts = -3 CP

PURCHASES (starting CP: 20)
ship size technology 2: -10 CP
destroyer: -9 CP

REMAINING CP: 1

Player 2

INCOME/MAINTENANCE (starting CP: 20)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: 0

PURCHASES (starting CP: 23)
ship size technology 2: -10 CP
destroyer: -9 CP

REMAINING CP: 4

------------------------

TURN 2 - MOVEMENT PHASE

Player 1:
Scouts 1,2,3: stay at (2,2)
Destroyer 1 : (2,0) --> (2,2)

Player 2:
Destroyer 1: (2,4) --> (2,2)

------------------------

TURN 2 - COMBAT PHASE

| PLAYER |          SHIP          | HEALTH  |
---------------------------------------------
|    1   |         Destroyer 1    |    1    |
|    2   |         Destroyer 1    |    1    |
|    1   |         Scout 1        |    1    |
|    1   |         Scout 2        |    1    |
|    1   |         Scout 3        |    1    |

Attack 1
Attacker: Player 1 Destroyer 1
Defender: Player 2 Destroyer 1
Largest Roll to Hit: 4
Dice Roll: 4
Hit or Miss: Hit

| PLAYER |          SHIP          | HEALTH  |
---------------------------------------------
|    1   |         Destroyer 1    |    1    |
|    1   |         Scout 1        |    1    |
|    1   |         Scout 2        |    1    |
|    1   |         Scout 3        |    1    |

------------------------

TURN 2 - ECONOMIC PHASE

Player 1

INCOME/MAINTENANCE (starting CP: 1)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: -1 CP/Scout x 3 Scouts -1 CP/Destroyer x 1 Destroyer = -4 CP

REMAINING CP: 0

Player 2

INCOME/MAINTENANCE (starting CP: 4)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: 0

PURCHASES (starting CP: 7)
scout: -6 CP

REMAINING CP: 1

------------------------

TURN 3 - MOVEMENT PHASE

Player 1:
Scouts 1,2,3: stay at (2,2)
Destroyer 1 : stay at (2,2)

Player 2:
Scout 1: (2,4) --> (2,2)

------------------------

TURN 3 - COMBAT PHASE

| PLAYER |          SHIP          | HEALTH  |
---------------------------------------------
|    1   |         Destroyer 1    |    1    |
|    1   |         Scout 1        |    1    |
|    1   |         Scout 2        |    1    |
|    1   |         Scout 3        |    1    |
|    2   |         Scout 1        |    1    |

Attack 1
Attacker: Player 1 Destroyer 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 4
Dice Roll: 5
Hit or Miss: Miss

Attack 2
Attacker: Player 1 Scout 1
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Dice Roll: 6
Hit or Miss: Miss

Attack 3
Attacker: Player 1 Scout 2
Defender: Player 2 Scout 1
Largest Roll to Hit: 3
Dice Roll: 1
Hit or Miss: Hit

| PLAYER |          SHIP          | HEALTH  |
---------------------------------------------
|    1   |         Destroyer 1    |    1    |
|    1   |         Scout 1        |    1    |
|    1   |         Scout 2        |    1    |
|    1   |         Scout 3        |    1    |

------------------------

TURN 3 - ECONOMIC PHASE

Player 1

INCOME/MAINTENANCE (starting CP: 0)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: -1 CP/Scout x 3 Scouts -1 CP/Destroyer x 1 Destroyer = -4 CP

REMOVALS
remove scout 3 due to inability to pay maintenance costs

REMAINING CP: 0

Player 2

INCOME/MAINTENANCE (starting CP: 1)
colony income: +3 CP/Colony x 1 Colony = +3 CP
maintenance costs: 0

REMAINING CP: 4