GitHub - mstrand1/Identifying-hentaigana-w-unsupervised-learning: Using unsupervised clustering methods, I seek to extract writing styles hidden within a few class labels within my dataset

If the notebook fails to load, please use this viewer

The Kuzushiji-49 dataset contains 232,365 Japanese characters in the training data alone. The original author of this dataset states there may be mulitiple ways of writing certain characters (called Hentaigana, a typical feature of ancient Japanese writing). However, data labels do not account for these different styles and collapse all the Hentaigana into 1 label. Training data for a computer vision program would therefore likely have more difficulty learning a label if it covers two very different styles of writing the same character.

It is the interest of this project to identify and extract these different writing styles to see if prior knowledge of these differences will lead to more accurate classification. To find these Hentaigana, I create and employ an unsupervised clustering methods pipeline, including K-means, Guassian Mixture, and t-SNE. Specifically, I experiment with tuning the parameters of the t-SNE to better visualize the high-dimensional patterns in the images. Since t-SNE is not invertible, I simultaneously tune other, invertile, clustering methods for a best-guess match to t-SNE, in an attempt to capture stand-out clusters which may be Hentaigana. Accuracy is gauged using Pycart, which streamlines the modeling process and allows the convienent comparison of many models accross a range of metrics.

This project was inspired by my undergraduate machine learning course and my goal is to improve my understanding of unsupervised learning methods and working with large datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Hentaigana Analysis.ipynb		Hentaigana Analysis.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

mstrand1/Identifying-hentaigana-w-unsupervised-learning

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages