TECH: Ganitha – Naive-Bayes Classifiers

By logging in, you agree to our terms of service and privacy policy

Start Creating
Data For AI

Try for Free

Upload Upload data to the tresata cloud and start building data products

Access Connect to your cloud, point to your data, start building data products

Want to talk to our sales team instead?

Contact Sales

By logging in, you agree to our terms of service and privacy policy

Blog

TECH: Ganitha – Naive-Bayes Classifiers

blog-details-user Andres Perez

blog-details-eye-slash Jun 23, 2014

This post discusses the implementation of Naive-Bayes classification in Ganitha, Tresata’s open-source machine-learning library built on Scalding. A Naive-Bayes classifier is a probabilistic classifier used in machine-learning that involves the application of Bayes’ theorem. The underlying model is “naive” because of the assumption that the attributes are conditionally independent of each other. Naive-Bayes learning is surprisingly effective in a wide range of applications, given the simplifying assumption of feature independence. Though not as powerful as decision-tree learning, it is considerably less computationally complex than many other forms of classifiers, and in many cases, the naive assumption has little impact on the quality of predictions.

Naive-Bayes Classifying

Ganitha supplies three of the more popular forms of Naive-Bayes classifiers: Gaussian, Multinomial, and Bernoulli. In gaussian Naive-Bayes, a type of classifier used for continuous data, we are making the assumption that the features associated with each class lie along a normal distribution. In a multinomial or Bernoulli event model, we are dealing with discrete features, a common example being the classification of a document given the presence of words (features) in the text. In this case, each word has a score assigned to it for each label, or class. In multinomial Naive-Bayes, each feature vector relates to the term frequency of the words found in the document or class. We make the ‘bag-of-words’ assumption, in which documents are represented as a multiset of words, disregarding grammar or word order. In Bernoulli Naive-Bayes, features represent binary occurrences, and in this classification model, the absence of a word/feature has an effect on the calculated probabilities.

Each classifier consists of a training phase, where an NBModel is constructed from the training set of data, and a classifying, or predicting, phase. In the classifying phase, each data point that is to be classified is given a probability (in this case a log probability is used) for each label, and the label with the highest, or *maximum a posteriori* probability is assigned to the data point.
Support for Vector Types

Ganitha provides a simple framework for supporting additional vector types. By creating an object extending the VectorHelper or DenseVectorHelper class and implementing the supported methods, you can add support for a custom vector type from an outside library to use with Naive-Bayes. As an example, the code to add support for Jblas using vectors backed by org.jblas.DoubleMatrix objects is as follows:

import org.jblas.{ DoubleMatrix => JblasVector }
object JblasVectorHelper extends DenseVectorHelper[JblasVector] {
def plus(v1: JblasVector, v2: JblasVector) = v1.add(v2)
def scale(v: JblasVector, k: Double) = { val v2 = new JblasVector(v.data.clone); v2.mmuli(k) }
def toString(v: JblasVector) = v.toString
def size(v: JblasVector) = v.rows
def sum(v: JblasVector) = v.sum
def dot(v1: JblasVector, v2: JblasVector) = v1.dot(v2)
def map(v: JblasVector, f: Double => Double) = new JblasVector(v.data.clone.map(f))
def l1Distance(v1: JblasVector, v2: JblasVector): Double = v1.distance1(v2)
def euclidean(v1: JblasVector, v2: JblasVector): Double = v1.distance2(v2)
def cosine(v1: JblasVector, v2: JblasVector): Double = {
val dotProd = v1.dot(v2)
if (dotProd < 0.00000001) 1.0 // don't waste calculations on orthogonal vectors or 0
val denom = v1.norm2 * v2.norm2
1.0 - abs(dotProd / denom)
}
def iterator(v: JblasVector) = v.data.iterator
}

We’ve added support for some popular vector representations from open-source libraries, including Mahout, Breeze, Jblas, and Saddle.

You can see our code for implementing Naive-Bayes classifiers, and give it a run yourself, at our GitHub page. We welcome contributions to Ganitha, as well as suggestions for what machine-learning applications built on Scalding you’d like to see open-sourced next!

Start Creating Data For AI

TECH: Ganitha – Naive-Bayes Classifiers

Start Creating
Data For AI