David Young
  • Home
  • About
  • Contact
  • Personal Projects
    • Computer Science >
      • Computer Vision >
        • 2016 - Homography w/ RANSAC
        • 2016 - Fundamental Matrix & Triangulation
        • 2016 - Laplacian Blob Detector
        • 2016 - Photometric Stereo: Shape From Shading
        • 2015 - Optical Character Recognition w/ OpenCV and Deep Learning
        • 2015 - Feature Detection
        • 2015 - Feature Description
        • 2015 - Feature Matching
        • 2015 - Panoramas (Alignment, Stitching, Blending)
        • 2015 - Facial Detection & Recognition
        • 2015 - Single View Modeling
      • Artificial Intelligence >
        • 2019 - Talk: How Neural Networks See the World
        • 2018 - Generating Text and Poetry
        • 2015 - Optical Character Recognition w/ OpenCV and Deep Learning
        • 2015 - Constraint Satisfaction Problems
        • 2015 - Adversarial Search
        • 2015 - Path Planning (Mazes + Pacman)
        • 2015 - Digit Classification (Bayes)
        • 2015 - Text Document Classification (Bayes)
        • 2015 - Multi-Class Perceptrons
        • 2015 - Markov Decision Processes & Reinforcement Q-Learning
        • 2015 - Simulating Neuronal Learning during Brain-Machine Interface
      • Machine Learning >
        • 2016 - Naive Bayes Classifiers in R
        • 2016 - Stochastic Gradient Descent (SVM in R)
        • 2016 - Comparing Classifiers in R
        • 2016 - Visualize High Dim Data: Blob Analysis + PCA
        • 2016 - Image Segmentation w/ EM
        • 2016 - Regression Kernel Smoothing
        • 2016 - Multinomial Regression on Wide Datasets
      • Robotics >
        • 2017 - 3dof Parallel Motion Simulator
        • 2015 - Designing a Hybrid Controller
        • 2015 - Controlling Pendubot with a Kinect
      • Computer Architecture >
        • 2016 - Architecture Support for Accelerator Rich CMPs
        • 2014 - Weighted Vector Addition with Cuda Framework
        • 2014 - Parallel Reduction with Cuda Framework
        • 2014 - Designing a Pipelined CPU
        • 2014 - Intel SSE Intrinsics Applications in Rudimetary Matrix Algorithms
        • 2014 - LIFC to MIPS Compiler and Assembler
      • Web Development >
        • 2014 - Javascript Calendar
        • 2014 - Multi-Room Chat Server
      • Graphics >
        • 2015 - Basic Animation w/ WebGL
        • 2015 - Diamond Square Terrain Generator
        • 2015 - Flight Simulator w/ WebGL
        • 2015 - Multi-Program Texture Mapping WebGL
      • Software >
        • 2015 - Consumer Grade Gaze Pattern Recognition Software
        • 2015 -Test History Jenkins Plugin
      • Other >
        • 2014 - Hashtable for Genomic DNA Sequences
        • 2014 - Closest Pair of Points
    • Virtual Reality, Game Design, & Animation >
      • 2019 - Interactive Music Visualization
      • 2016 - Visualizing Runtime Flowpath in VR
      • 2016 - Fiducial Marker Tracking for Augmented Reality
      • 2015 - Experimenting with PhysX & APEX Destruction
      • 2015 - Rigging Tank Treads using MEL in Maya
      • 2015 - Automated Simulation Teddy Bear Bin
      • 2014 - Networked Multiplayer Game of Set
      • 2014 - Asymmetrical Multiplayer Destruction
      • 2016 - Tracking & Depth Perception
      • 2014 - 8 Week Game Design (Cave Survival)
      • 2015 - Experimenting with Nvidia FLEX
    • Computers >
      • Custom and Watercooled PCs
      • Component Reviews
      • Installation Guides
    • Quantitative Physiology >
      • Computational >
        • 2015 - Modelling Neurons & Action Potentials
        • 2015 - Simulating Neuronal Learning during Brain-Machine Interface
        • 2014 - Imaging: Rabbit Optical Mapping
        • 2014 - Simulating Electrical Stimulation w/ Comsol
        • 2014 - Ion Channels
        • 2013 - Designing Filters to Simulate Olfactory Sensation
        • 2014 - CardioVascular Mechanics
        • 2014 - Renal
        • 2013 - Principal Component Analysis & Singlar Value Decomposition
        • 2013 - 3D Printed Frog Muscle Holder
      • Physical >
        • 2013 - Biomedical Signal Acquisition
        • 2013 - Electrooculogram
        • 2013 - Compound Action Potential in Frog Sciatic Nerve
        • 2013 - Contractile Properties of Frog Skeletal Muscle
        • 2013 - Locust Olfaction
        • 2014 - Voltage Clamp
        • 2013 - Dive Response
        • 2014 - Frog Heart Muscle
        • 2013 - Ultrasound
        • 2014 - Biological Signal Conditioning
        • 2014 - EKG, Vector Cardiograms & Pulse Wave Velocity
    • Electrical Projects >
      • Self Balancing Robot Pendulum
      • Custom Beer Pong Tables
      • 4-axis Robotic Arm
      • Modified Electric MiniBike
      • Secret Knock Detecting Automatic Door Opener
      • Car Audio
      • Tree-House Wiring
      • Laser Harp
    • Auto & Mechanical Projects >
      • Single Turbo Lexus SC300
      • Track Day Mx-5
      • Karting
      • Racing Simulator Rig
      • 50cc Barbie Jeep
    • Random Other Projects >
      • Talk: Embodied Cognition
      • Bathymetry Coffee Table
      • Not your average Tree House
      • Pneumatic Tennis Ball Cannon
  • Experience
    • Resume
    • Work Experience
    • Programming Experience
    • Research Experience
    • Service Work
  • Education
  • Hobbies
    • Motorsports
    • Art
    • Music
    • Gaming
    • Dancing

​Comparing Classifiers in R
(2016)


Dataset

The following writeup compares various classifiers trying to match faces. The training dataset to do with matching faces can be found here:
https://courses.engr.illinois.edu/cs498df3/data/pubfig_dev_50000_pairs.txt​
​

Each vector consists of a label, followed by measurements of attributes from two faces as produced by some complex vision program. The label indicates if the faces belong to the same class (ie: same person), while the rest of the vector consists of (attribute values for face 1) (attribute values for face 2). The dataset is large (around 100Mb). Various classifiers were trained on this dataset and accuracy was reported on an unknown evaluation set via an online Kaggle competition.

Results

​For each of the classifiers below, the training was performed on all data from “pubfig_train_50000_pairs.txt”. Where applicable and when time permitted, parameter tuning was performed using the “pubfig_kaggle_#.txt” files. The classification accuracy of the trained classifier is shown for the same training data, one example of the validation data, and the test/evaluation data. For the test accuracy, the value was reported by Kaggle as we do not have access to the ground-truth labels. 
Picture
Accuracies for all classifiers

Implementation

Basic Setup

#------------------------------------- SETUP WORK --------------------------------------
#clear the workspace and console
rm(list=ls())
cat("\014") #code to send ctrl+L to the console and therefore clear the screen
 
#install packages as necessary:
#install.packages("RANN");
#install.packages("cluster")
#install.packages("stats")
 
#import libraries
library(klaR)
library(caret)
library(stringr)
library(randomForest)
library(svmlight)
library(RANN) #for finding approximate nearest neighbor
library(cluster)
library(stats)
 
#----------------------------------------Setup the workspace--------------------------------
setwd('~/WorkingDirectoryNameGoesHere')

Created by Pretty R at inside-R.org


Pre-Process Features

The key to an initial performance jump was the pre-processing of features. Each example was made up of two feature vectors, which were replaced by the difference between the two feature vectors (i.e. the scaled Euclidian distance between the two feature vectors in high dimensional space).
###-------------------------------TRAINING DATA---------------------------------------------
#retrieve all the data (each row is 147 items long: 1 label followed by 2x73 feature vectors)
data <- read.table('pubfig_dev_50000_pairs_no_header.txt',sep='\t',header=F);
 
#grab the labels
labels <- data[,1];
#grab the features (2x 73 features vectors)
features <- matrix(NA,nrow=nrow(data),ncol=(ncol(data)-1)/2);
#features <- data[,-1]; #accuracy: .53
 
#pre-process the features to yield a single 73 item feature vector that is the scaled euclidian distance between each respective feature
#feature set 1:
features <- scale(abs(as.matrix(data[,2:74]) - as.matrix(data[,75:ncol(data)]))); #accuracy: 0.7684
 
#feature set 2: feature set 1 + squared values of feature set 1
#features <- scale(abs(as.matrix(data[,2:74]) - as.matrix(data[,75:ncol(data)]))); #accuracy: 0.7684
#features <- cbind(features,features^2);
#sum(features[,74] == features[,1]^2) #checked

Created by Pretty R at inside-R.org


Defining Data Partitions

During the initial prototyping of various classifiers, a smaller subset of data was used (~20% -50% of the training dataset). This subset was split into training and testing portions for the various training procedures. Surprisingly, the results from these smaller training sets were better than the final results using the entire training set and the validation sets for testing. For example over 80% accuracy was obtained using an SVM on a smaller training portion split into training and test sets, but 80% accuracy could not be reached with any SVM when training on all the data.  This may be due to human bias in selectively reporting our results to ourselves.  Prototyping on these smaller sets was necessary however, as re-training a classifier such as the RBF SVM on the entire data set for each of many gamma values would have taken a long time.  ​
###-------------------------------Validation data-------------------------------------------
val_featureData_1 <- read.table('pubfig_kaggle_1.txt', sep='\t', header=F);
val_featureData_2 <- read.table('pubfig_kaggle_2.txt', sep='\t', header=F);
val_featureData_3 <- read.table('pubfig_kaggle_3.txt', sep='\t', header=F);
val_labelsData_1 <- read.table('pubfig_kaggle_1_solution.txt', sep=",", header=F);
val_labelsData_2 <- read.table('pubfig_kaggle_2_solution.txt', sep=",", header=F);
val_labelsData_3 <- read.table('pubfig_kaggle_3_solution.txt', sep=",", header=F);
 
#grab the labels
val_labels_1 <- val_labelsData_1[,2];
val_labels_2 <- val_labelsData_2[,2];
val_labels_3 <- val_labelsData_3[,2];
 
val_features_1 <- matrix(NA,nrow=nrow(val_featureData_1),ncol=ncol(val_featureData_1)/2);
val_features_1 <- scale(abs(as.matrix(val_featureData_1[,1:73]) - as.matrix(val_featureData_1[,74:ncol(val_featureData_1)])));
 
val_features_2 <- matrix(NA,nrow=nrow(val_featureData_2),ncol=ncol(val_featureData_2)/2);
val_features_2 <- scale(abs(as.matrix(val_featureData_2[,1:73]) - as.matrix(val_featureData_2[,74:ncol(val_featureData_2)])));
 
val_features_3 <- matrix(NA,nrow=nrow(val_featureData_3),ncol=ncol(val_featureData_3)/2);
val_features_3 <- scale(abs(as.matrix(val_featureData_3[,1:73]) - as.matrix(val_featureData_3[,74:ncol(val_featureData_3)])));
 
 
 
###-------------------------------Evaluation/Testing DATA ----------------------------------
eval_featureData <- read.table('pubfig_kaggle_eval.txt', sep='\t', header=F);
eval_features <- matrix(NA,nrow=nrow(eval_featureData),ncol=ncol(eval_featureData)/2);
eval_features <- scale(abs(as.matrix(eval_featureData[,1:73]) - as.matrix(eval_featureData[,74:ncol(eval_featureData)])));
 
 
 
#-------------------------split up the data for testing and training------------------------------
 
###IF USING THE VALIDATION DATA...
trainingLabels <- labels;
trainingFeatures <- features;
testLabels <- val_labels_1; #val_labels_1; val_labels_2; val_labels_3
testFeatures <- eval_features; #val_features_1; val_features_2; val_features_3;
 
###IF ONLY USING THE TRAINING DATA... need to split it up into test and train portions
#there is too much data for rapid iteration, use this to scale down how big the initial pool is during prototyping
useDataIndices <- createDataPartition(y=labels, p=.5, list=FALSE); 
testDataIndices <- createDataPartition(y=labels[useDataIndices], p=.2, list=FALSE);
trainingLabels <- labels[useDataIndices]; trainingLabels <- trainingLabels[-testDataIndices];
testLabels <- labels[useDataIndices]; testLabels <- testLabels[testDataIndices];
trainingFeatures <- features[useDataIndices,]; trainingFeatures <- trainingFeatures[-testDataIndices,];
testFeatures <- features[useDataIndices,]; testFeatures <- testFeatures[testDataIndices,];

Created by Pretty R at inside-R.org


Linear SVM

​A linear SVM worked surprisingly well given that the data intuitively seems to extend radially out from an ideal 0 distance. It was not expected that a linear SVM would do so well.
#run an SVM
svm <- svmlight(trainingFeatures,trainingLabels,pathsvm='Path to svm goes here')
predictedLabels<-predict(svm, testFeatures) 
foo<-predictedLabels$class #"foo" = class labels (1 or 0) for each item in test set
#get classification accuracy:
accuracy<-sum(foo==testLabels)/(sum(foo==testLabels)+sum(!(foo==testLabels)))

Created by Pretty R at inside-R.org


Naive Bayes

The naïve bayes classifier took a long time to train, but performed adequately. It did have the lowest final accuracy. 
#run Naive Bayes
model<-train(trainingFeatures, as.factor(trainingLabels), 'nb', trControl=trainControl(method='cv', number=10))
teclasses<-predict(model,newdata=testFeatures)
cm<-confusionMatrix(data=teclasses, testLabels)
accuracy<-cm$overall[1]

Created by Pretty R at inside-R.org


Random Forests

The random forest classifier took a long time to train, but performed well. It was nearly as accurate as the SVMs, with a training time somewhere in between the linear SVM and the RBF SVM.
#run Random Forest, 
faceforest.allvals <- randomForest(x=trainingFeatures,y=trainingLabels,
                                   xtest=testFeatures,ytest=testLabels);
 
faceforest.allvals <- randomForest(x=trainingFeatures,y=trainingLabels,
                                   xtest=testFeatures);
predictedLabels <- faceforest.allvals$test$predicted > .5
predictedLabels[predictedLabels] = 1;
predictedLabels[!predictedLabels] = 0;
foo <- predictedLabels;
accuracy<-sum(foo==testLabels)/(sum(foo==testLabels)+sum(!(foo==testLabels)))

Created by Pretty R at inside-R.org


Approximate Nearest Neighbors

The classification strategy using nearest neighbors (comparison of labels from the lookup table) was guaranteed to yield 100% accuracy. The question was whether an approximate nearest neighbor package could yield nearly the same performance as a nearest neighbor package. The results would indicate yes. The training accuracy and the testing accuracy were both 100%. This would indicate the correct nearest neighbor was being accurately selected via the approximate algorithm, and significantly faster.
#######################
#Part 2: Approximate Nearest Neighbors
#######################
# -have a reference dictionary of people (names) and many images of that person's face represented as attribute vectors
# -want to determine if two feature vectors represent face images of the same person... 
# -will use the reference dictionary as a lookup table, and simply find the nearest neighbor 
#   (in this case exact same feature vector) from the dictionary for both of the example's 
#   feature vectors. Compare the labels (names) of each found neighbor and see if they're the same name.
# -expect 100% accuracy here with a nearest neighbors classifier, but how close can we get to 100% with an 
#   approximate nearest neighbors package? Answer: basically 100%. 
# - Purpose: to demonstrate that approximate nearest neighbors can work just about the same as nearest neighbors
 
#retrieve the reference dictionary of names and associated face images
namedata <- read.table("pubfig_attributes.txt",header=F,sep="\t");
namedata.att <- namedata[,-c(1:2)];
 
n <- 10000;
face1.nn <- nn2(namedata.att,data[1:n,2:74],k=1);
face2.nn <- nn2(namedata.att,data[1:n,75:ncol(data)],k=1);
 
predicted <- namedata[face1.nn$nn.idx,1] == namedata[face2.nn$nn.idx,1]; #creates a boolean vector
predicted[predicted] = 1; #turn that boolean vector into 0's and 1's
accuracy = sum(data[1:n,1] == predicted) / length(predicted);
print(accuracy)
 
 
#using the evaluation data...
 
#retrieve the reference dictionary of names and associated face images
namedata <- read.table("pubfig_attributes.txt",header=F,sep="\t");
namedata.att <- namedata[,-c(1:2)];
 
face1.nn <- nn2(namedata.att, eval_featureData[,1:73], k=1);
face2.nn <- nn2(namedata.att, eval_featureData[,74:ncol(eval_featureData)], k=1);
 
predicted <- namedata[face1.nn$nn.idx,1] == namedata[face2.nn$nn.idx,1]; #creates a boolean vector
predicted[predicted] = 1; #turn that boolean vector into 0's and 1's
 
accuracy = sum(data[1:n,1] == predicted) / length(predicted);

Created by Pretty R at inside-R.org


​Radial basis function SVM 

​A linear SVM worked surprisingly well given that the data intuitively seems to extend radially out from an ideal 0 distance. It was not expected that a linear SVM would do so well. Instead it was predicted that an SVM using a radial-basis-function (RBF) would perform better, which it did.  An RBF SVM is capable of choosing a non-linear decision boundary by mapping to a higher dimension feature space. But, although the RBF SVM yielded the highest accuracy, the performance gain was minimal and at the cost of significantly longer training time. Both versions of the SVM were tried a second time with appended features equal to the square of the pre-processed feature vector. This did not improve performance, and the results were not generated for the entire dataset so they are absent. Most likely these additional features were not as beneficial given the initial pre-processing.

Most likely the RBF SVM worked well because the data was appropriately pre-processed and the RBF SVM somewhat subsumes both the linear and polynomial variants of an SVM classifier. Unfortunately the slightly better performance is not likely worth the much longer training time required by the RBF SVM.

#run a radial basis kernal SVM (in theory, should subsume both linear or polynomial)
svm <- svmlight(trainingFeatures,trainingLabels, svm.options = "-t 2 -g .03", pathsvm='Path to svm goes here')
predictedLabels<-predict(svm, testFeatures) 
foo<-predictedLabels$class #"foo" = class labels (1 or 0) for each item in test set
#get classification accuracy:
accuracy<-sum(foo==testLabels)/(sum(foo==testLabels)+sum(!(foo==testLabels))) 

Created by Pretty R at inside-R.org


Other Variants

Lastly, there was an attempt to implement a classifier defined by a K-Means Clustering with a corresponding SVM for each cluster. But the performance here was right on par with the other SVMs. A voting system didn’t help much, and soft/fuzzy clustering packages did not seem to cooperate for this classifier. Neither approach yielded better performance than the RBF SVM.
#######################
#Other Variant: K Means Clustering + SVMLight
#######################
# Training:
#   -run kmeans clustering on the training data to create k clusters
#   -for each cluster
#     -train a separate svm classifier using the training data in that cluster
#
# Classify a test example:
#   -determine which cluster is closest to the example point
#   -use the svm from that chosen cluster to classify the example point
 
#accuracy at k = 5, 50% of dataset used, 20% test 80% train was 0.7614
 
 
k = 5;  #number of clusters to use... recall that there are only two classes
kmeans.train.output <- kmeans(trainingFeatures,centers=k); 
 
#the trainingLabels are provided as 0 and 1, but the svm wants them to map to -1 and 1... so convert 0's to -1
trainingLabels[trainingLabels==0] = -1;
testLabels[testLabels==0] = -1;
 
#train an SVM for each cluster
svm_list <- list();
for (i in 1:k){
  svm_list[[i]]<-svmlight(trainingFeatures[kmeans.train.output$cluster==i,], trainingLabels[kmeans.train.output$cluster==i], pathsvm='Path to svm goes here')
}
 
#select the nearest cluster center for each test example
nearest_cluster_center <- nn2(kmeans.train.output$centers,testFeatures,k=1);
 
 
#create a structure to hold the predicted labels of any example near each SVM
predictedLabelsList = vector(mode = "list", length = k);
for(i in 1:k){
  if(sum(nearest_cluster_center$nn.idx==i) > 0){
    predictedLabelsList[[i]] = predict(svm_list[[i]], testFeatures[nearest_cluster_center$nn.idx==i,]) 
  }
}
 
#predict the label of the test examples using the svm associated with the nearest cluster to that test example
predictedLabels = rep(NA,nrow(testFeatures));
for(i in 1:k){
  if(sum(nearest_cluster_center$nn.idx==i) > 0 & !is.null(predictedLabelsList[[i]]$class)){
    predictedLabels[nearest_cluster_center$nn.idx==i] = as.numeric(predictedLabelsList[[i]]$class);
  }
}
 
predictedLabels[predictedLabels==1] = -1;
predictedLabels[predictedLabels==2] = 1;
 
accuracy = sum(predictedLabels == testLabels)/length(predictedLabels);
print(accuracy)

Created by Pretty R at inside-R.org


Proudly powered by Weebly