Skip to content

Using unsupervised clustering methods, I seek to extract writing styles hidden within a few class labels within my dataset

Notifications You must be signed in to change notification settings

mstrand1/Identifying-hentaigana-w-unsupervised-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 

Repository files navigation

If the notebook fails to load, please use this viewer

The Kuzushiji-49 dataset contains 232,365 Japanese characters in the training data alone. The original author of this dataset states there may be mulitiple ways of writing certain characters (called Hentaigana, a typical feature of ancient Japanese writing). However, data labels do not account for these different styles and collapse all the Hentaigana into 1 label. Training data for a computer vision program would therefore likely have more difficulty learning a label if it covers two very different styles of writing the same character.

It is the interest of this project to identify and extract these different writing styles to see if prior knowledge of these differences will lead to more accurate classification. To find these Hentaigana, I create and employ an unsupervised clustering methods pipeline, including K-means, Guassian Mixture, and t-SNE. Specifically, I experiment with tuning the parameters of the t-SNE to better visualize the high-dimensional patterns in the images. Since t-SNE is not invertible, I simultaneously tune other, invertile, clustering methods for a best-guess match to t-SNE, in an attempt to capture stand-out clusters which may be Hentaigana. Accuracy is gauged using Pycart, which streamlines the modeling process and allows the convienent comparison of many models accross a range of metrics.

This project was inspired by my undergraduate machine learning course and my goal is to improve my understanding of unsupervised learning methods and working with large datasets.

About

Using unsupervised clustering methods, I seek to extract writing styles hidden within a few class labels within my dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published