##### A bit of statistics and probability

###### 16-Jan-2016 | Anton Alex Nesterov

While evil geniuses are trying to born new age terminator in google ad lab cellars I would like to share my attempt to make predictions based on naive bayes classification. If you know something about statistics and probability theory you might know that ‘naive bayes’ isn’t actually so naive. If you don’t know anything about probability theory, please don’t be afraid - it is a very intuitive field.

#### Probability in simple terms

We can’t know the future but we can count it - this is pretty much of what probability theory does. When you throw a coin you don’t know which side is going to be atop (tail or head), but you know that there is 50% chance for each side. What would be chances if you throw it two times? Well, actually you can list all possible outcomes ([Head,Head] [Head,Tail] [Tail,Head] [Tail,Tail]), when a mathematician counts proportions of all outcomes he usually calls it ‘probability distribution’. In this example we can see that possibility to get both tails or both heads is lower than to get tail-head or head-tail, this type of distribution called ‘normal’ or ‘Gaussian’ probability distribution and this propritions work for any number of possibilities, there are gaussian formula to distribute ‘continues’ values and pascals rule for ‘disctrete’. We will take this principle and the Bayes’ Law to find probabilities. If you want to master this kind of analytics just type ‘probability theory’ on youtube and you’ll find a lot of decent examples.

#### Imagine a ~~company~~ client

Let’s imagine that in some parallel universe we owe a music label. Our label has a decent collection of experimental music and we decided to create a web site to provide our customers an access and attract new customers. Our first issue is define some common vector of new client’s preferences using only his social profile. Thankfully for years of work our company gained thousands of social network profiles with related observations and we can benefit from that. So we decided to add a simple suggestion algorithm to our website. Let’s say if an user decides to share his social network profile our algorithm will try to guess what kind of music the user would tend to listen.

#### Approach

I should say that under the hood of such services the suggestion algorithms usually use something like k-nearest neighbor. Anyway bayes pretty good at predicting a main vector of user’s preferences.

Let’s assume that we have a thousand randomly selected user profiles. The profiles have following data: Age, Sex, Relationship status, Professional Area, Number of Friends, Views on smoking, Views on alcohol, Education Level, Prefered Movies/Books Genre, Preferred Music, etc.

I used a random dataset generator so It happened that in our parrallel universe people can be comletely mature in 14 years, here’s the dataset we have.

#### Preparing the data

In order to realize now to count possibilities let’s take a sample from our dataset.

Age | Sex | Relationship Status | Professional Area | Number of Friends | Views on smoking | Views on alcohol | Education Level | Prefered Movie Genre | Prefered Music Genres |
---|---|---|---|---|---|---|---|---|---|

47 | Male | Divorced | Research and Development | 194 | Negative | Positive | PhD | Detective | Industrial Rock, Neo Classical, Rap, IDM |

We have to take in consideration that the genre preferences can be somehow related to each other and other attributes. That means we need to find all genres in dataset at first place

In our estimation it would look like following:

Age | Sex | Relationship Status | Professional Area | ..n.. | Music Genre #1 | Music Genre #2 | Music Genre #3 | Music Genre #..n.. |
---|---|---|---|---|---|---|---|---|

47 | Male | Divorced | Research and Development | ..n.. | Positive | Positive | Negative | Positive |

22 | Female | Single | Advertising | ..n.. | Negative | Positive | Positive | Positive |

30 | Male | Divorced | Research and Development | ..n.. | Negative | Positive | Negative | Positive |

#### What to count?

First, we need to split dataset to training set which will be used to make predictions, and a testing set which we will use to test our suggestion algorithm. Naturally a training set has to be larger than test values. Let’s split it 70% to 30%.

Secondly, we have to do some frequency estimations. We need to count how many times each attribute corresponds to the preferred music genre in train set. We also need to know total number of people related to each category.

#### Following code counts and structurizes all needed values using dataset:

This code counts all occurrences of a particular attribute and frequency of attribute occurrences when a genre is specified.

```
import csv, math
class DataSet(object):
def __init__(self, path):
self.row_names = [
'sex',
'relationship status',
'professional field',
'number of friends',
'views on smoking',
'views on alcohol',
'education level',
'prefered movie genre'
]
self.trainset = []
self.testset = []
with open(path) as f:
self.data = list(csv.reader(f, delimiter='|'))[1:]
#split dataset
import random
trainSize = int(len(self.data)*0.7)
data = list(self.data)
while len(self.trainset) < trainSize:
self.trainset.append(data.pop(random.randrange(len(data))))
self.testset = data
self.genres = {}
self.count()
# find average values
def stdev(numbers):
avg = sum(numbers) / len(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
for k in self.genres.keys():
age = self.genres[k]['age']
self.genres[k]['age_stdev'] = stdev(age)
self.genres[k]['age'] = sum(age) / len(age)
friends = self.genres[k]['number of friends']
self.genres[k]['number of friends stdev'] = stdev(friends)
self.genres[k]['number of friends'] = sum(friends) / len(friends)
def count(self):
"""Count frequecies for each genre"""
for row in self.trainset:
if (not len(row)):
continue
row_names = self.row_names
#count total
for i,e in enumerate(row[1:-1]):
attr_ = self.__dict__.get(row_names[i].replace(" ","_"),{})
attr_[e] = attr_.get(e,0)+1
setattr(self, row_names[i].replace(" ","_"), attr_)
genres = row[-1].split(',')
for genre in genres:
genre = genre.strip()
if not genre: continue
if not self.genres.get(genre):
self.genres[genre] = {}
genr = self.genres[genre]
genr['prefered'] = genr.get('prefered', 0) + 1
#add age
ages_ = genr.get('age',[])
ages_.append(int(row[0]))
genr['age'] = ages_
# count frequecy values for base attributes
for i,e in enumerate(row[1:-1]):
if (row_names[i]=='number of friends'):
data_ = genr.get(row_names[i],[])
data_.append(int(e))
genr[row_names[i]] = data_
continue
#add to genre
data_ = genr.get(row_names[i],{})
data_[e] = data_.get(e,0)+1
genr[row_names[i]] = data_
#add frequecies genres that occured with this genre
for other in [i for i in genres if i!=genre]:
other = other.strip()
if not other: continue
genr[other] = genr.get(other,0) + 1
self.genres[genre] = genr
def gauss(self, x, mean, stdev):
exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
```

#### Usage:

```
dataset = DataSet('parrallel-universe-people.csv')
#How many females?
dataset.sex['Female'] #-> ~278
#How many smokers?
dataset.views_on_smoking['Positive'] #-> ~296
#How many smokers do listen Raggy?
dataset.genres['Raggy']['views on smoking']['Positive'] #-> ~55
```

#### A bit of math

Now when we’ve got some calculations it is time to find out how can we benefit from this numbers. Let’s see what can say previous code example. The data says that 55 out of 296 smokers would prefer to listen raggy among other music genres. So we have reasons to believe that a smoker from the parallel universe has roughly 18.5% chance to listen raggy music and 81.5% chance to listen other genres. But we have more independent conditions which can affect preferences so this is the case when we can use Joint Probability Distribution.

Joint distribution for independent variables:

In our case we will need to find probability of the genre prefernce when particular profile attributes take place. And since we know exact proportions from the dataset we don’t need to distribute most of attribute possibilities. We only will need to use distribution formulas on the age and number of friend attributes. The genre with maximal probability value will be considered as a best guess, which means we will sort suggestions by argmax.

I will use the Gaussian formula to find probability values for the ‘age’ and ‘number of friends’ attributes.

#### Express It in Python:

```
def gauss(x, mean, dev):
return (1 / (math.sqrt(2*math.pi) * dev)) * math.exp(-(math.pow(x-mean,2)/(2*math.pow(dev,2)))
```

#### Let’s put it all together.

It’s time to express all this principles in python. Following code estimates probablities for each presented genre and sort it by argmax. Since we get a sorted list of suggestions I will normalize probability values by some integer in order to avoid issues with floating point arithmetics.

```
dataset = DataSet('parrallel-universe-people.csv')
import operator
def guessGenres(row):
genres = dataset.genres.keys()
values = {}
for genre in genres:
data = dataset.genres[genre]
length = len(dataset.trainset)
double = lambda x: int(x * 1000) / 1000.0 #round to thousands
age_prob = double( dataset.gauss(int(row[0]),data['age'], data['age_stdev']) )
sex_prob = double( data['sex'].get(row[1],0) / length )
rel_prob = double( data['relationship status'].get(row[2],0) / length )
pro_prob = double( data['professional field'].get(row[3],0) / length )
fre_prob = double( dataset.gauss(int(row[4]),data['number of friends'], data['number of friends stdev']) )
smo_prob = double( data['views on smoking'].get(row[5],0) / length )
alc_prob = double( data['views on alcohol'].get(row[6],0) / length )
edu_prob = double( data['education level'].get(row[7],0) / length )
gen_prob = double( data['prefered movie genre'].get(row[8],0) / length )
probs = [age_prob, sex_prob, rel_prob, pro_prob, fre_prob, smo_prob, alc_prob, edu_prob, gen_prob]
for genr in row[9].split(','): #append other genres to the estimation
genr = genr.strip()
if not genr: continue
probs.append(double(data.get(genr,0)/ length ))
result = 1
for x in probs:
result *= (x*100) #normalize x
#print(result)
values[genre] = result
return dict(sorted(values.items(), key=operator.itemgetter(1)))
```

Finally, in order to check efficiency of the algorightm lets try it on our testing set. Since the data in this example was generated randomly, I will estimate efficiency using 3 genres instead of one - it has to increase chances.

```
success=0
fail = 0
for i in dataset.testset:
guess = guessGenres(dataset.testset[0])
#print(guess)
if i[-1].strip() and \
list(guess.keys())[0] in i[-1] or\
list(guess.keys())[1] in i[-1] or\
list(guess.keys())[2] in i[-1]:
success += 1
elif i[-1].strip():
fail += 1
print(len(dataset.testset))
print(success, fail)
```

I ran this kind of test a dosen of times generating random datasets every time , the algorithm has shown 60% accuracy in average. Math says that if I do it a big number of times it would be near to 50% which is just right value for randomly generated data. Out of experience i can say that we can expect from similar algorithm roughly 75% accuracy on real data.