2 minute read

SQL, precision, recall, and fun with Numpy

What I did

  • Reviewed, practiced, and improved my skills with SQL
  • Read some practice data science questions and answers

What I learned

  • The precision-recall trade-off (again)
\[\text{Precision} = \frac{\text{\# of True Positives}}{\text{\# of True Positives + \# of False Positives}} = \frac{\text{True Positives}}{\text{Predicted Positives}}\]


\(\text{Recall} = \frac{\text{\# of True Positives}}{\text{\# of True Positives + \# of False Negatives}} = \frac{\text{True Positives}}{\text{Actual Positives}}\)

  • Think about precision as the percent of your prediction that are relevant. Recall is the percent of your predictions that are actually correct. For example, say you are classifying cats and dogs, and your classifier predicts all 20 pictures as cats. If, in reality, 10 are cats and 10 are dogs, then your precision for predicting cats is 50% but your recall for predicting cats is 100%.

\(Precision = \frac{\text{10 cats}}{10 + 10 ~ \text{cats}}\) \(Recall = \frac{\text{10 cats}}{10 + 0 ~ \text{cats}}\)

The tradeoff is that the more you recall, the more likely you are to make a mistake. High precision is important for things like search results where you want only relevant results. JSTOR, for example, has shitty precision and high recall, unless you really dig deep into the advanced search features.

  • Numpy Euclidean Distance
    • np.linalg.norm(np_array1, np_array2). This numpy method returns the euclidean distance between two numpy vectors. Technically, its default parameter returns the Frobenius norm of two matrices, but with 1D matrices this is the equivalent of the Euclidean distance. These are variations of Minkowski distance (different than metrics from what I can tell).
import numpy as np

vec1 = [1,2,3,4,5]
vec2 = [6,7,8,9,10]

def euclidean_dist(vec1, vec2):
  return np.linalg.norm(np.array(vec1) - np.array(vec2))
  • Numpy Cosine Distance
    Recall that the cosine distance (cosine similarity) between two vectors is \(\frac{a \cdot b}{\mid a \mid \mid b \mid}\). Where |a| is the euclidean distance between its points, i.e. \(\sqrt{a_1^2 + a_2^2 + ...}\)
vec1 = [1,2,3,4]
vec2 = [5,6,7,8]

# 1st way
def cosine_dist(vec1, vec2):
  vec1 = np.array(vec1)
  vec2 = np.array(vec2)
  return np.dot(vec1, vec2)/np.linalg.norm(vec1 * vec2)

# 2nd way
def cosine_dist(vec1, vec2):
  vec1 = np.array(vec1)
  vec2 = np.array(vec2)
  return (vec1 @ vec2.T) / np.linalg.norm(vec1 * vec2) 
  • Don’t forget about np.where(booleen condition, if true, if false)
vec1 = [1,2,3,4], vec2 = [5,6,7,8]

def truncate(vec):
  vec = np.array(vec)
  return np.where(vec > 1, 1, vec).tolist()
  # without the tolist it returns a dict

# this is the equivilent
def truncate(vec):
  return [1 if val > 1 else val for val in vec]
  • An unstable NN will give you NaN’s and it’s important to check whether you have any.
vec = [1,2,3,4]

def any_nans(vec):
  return np.isnan(np.array(vec)).any()

# or also, less elegantly
def any_nans(vec):
  for value in np.array(vec).ravel():
    if np.isnan(value):
      return True
  • *args as a parameter lets you pass more arguments than explicitly written.
def print_2x(*args):
  print(2*num for num in args)

count(1,2,5,5,6,7)

What I will do Next

  • More SQL practice and take the IBM coding test