Skip to content

Latest commit

 

History

History
159 lines (133 loc) · 10.8 KB

how-to-make-yomichan-dictionaries.md

File metadata and controls

159 lines (133 loc) · 10.8 KB

How Do I Make A Yomichan Dictionary?

Warning

Documentation on this page will no longer be updated as it been moved to the official Yomitan documentation.


I get this question a lot, so here's an overview of how to make your own Yomichan dictionary.

As a prerequisite, you need to be somewhat familiar with JSON and coding in a language of choice. The general process is as follows:

  1. Acquire data (from a website, app, dump, etc.)
  2. Parse the data so you can make sense of it
  3. Format the data into jsons that are compliant with the Yomichan dictionary schema
  4. Export the data into a zip file with the relevant jsons

Tools

  • Yomichan Dictionary Builder - This is a node package I built to help with making dictionaries. It greatly simplifies the process of making dictionaries, please try it out if you use TypeScript or JavaScript.
  • hasUTF16SurrogatePairAt - This is important for checking if a kanji/hanzi is a surrogate pair. If so, its length is 2 in JavaScript so you need to account for that when doing string operations.
  • japanese-furigana-normalize - This package provides a utility function to normalize Japanese readings containing furigana. It is particularly useful for creating Yomitan dictionaries and ensuring the readings are properly aligned with the kanji characters.

Read the Schemas

You'll want to get very familiar with the Yomichan/Yomitan schemas for dictionaries - these schemas define how Yomichan dictionaries are structured. You can read about how JSON Schemas work here. I recommend trying codebeautify, json-schema-viewer, and jsonhero for help breaking down the schemas. For looking at raw json files in the browser, I use json-viewer for a better json viewing experience.

Below is a list of the Yomichan dictionary schemas and what they're used for, as well as the expected filename. Note that for data files with numbers in them, the number starts at 1 and enumerates upwards.

Schema Expected Filename Usage
dictionary-index-schema.json index.json The schema for the index.json file that contains metadata about the dictionary. PLEASE ALWAYS PUT AS MUCH DETAIL IN THIS AS POSSIBLE. Note that this information can be displayed in Yomichan by going to the dictionaries overview page and clicking the three dots, then Details....
dictionary-kanji-bank-v3-schema.json kanji_bank_${number}.json Contains information used in the kanji viewer - meaning, readings, statistics, and codepoints. Unfortunately a lot of the structuring is hardcoded and can't be customized nearly as much as with term definitions.
dictionary-kanji-meta-bank-v3-schema.json kanji_meta-bank_${number}.json The meta bank for kanji information. Right now, this is only used to store kanji frequency data.
dictionary-tag-bank-v3-schema.json tag_bank_${number}.json The tag bank for term information. This is where you'll define tags for kanji and term dictionaries, like for example specifying parts of speech or kanken level. These are generally displayed in Yomichan as grey tags next to the dictionary name.
dictionary-term-bank-v3-schema.json term_bank_${number}.json The term bank for term information. This is where dictionary readings, definitions, and such are stored.
dictionary-term-meta-bank-v3-schema.json term_meta_bank_${number}.json Where meta information about terms is stored. This currently includes frequency data and pitch accent data.

Packaging A Dictionary

A dictionary is not restricted to being only a kanji dictionary, term dictionary, frequency dictionary, or accent dictionary. It can have multiple types of kanji/term/tag information within the zip file, as is shown in the official test dictionary. Once you have an index.json and the relevant data files for your dictionary, you simply zip them up with all the data .json files in the root directory of the zip, NOT in subfolders. I recommend zipping them at the highest compression level possible - generally the json data files can be compressed to a fraction of their original size.

Examples

  • The term origins dictionary is a small example of a simple dictionary without any bells or whistles.
  • The official test dictionary is a great resource to see an example of a dictionary that utilizes the full range of features currently defined in the schema.
  • The latest JMDict has complex (and good) formatting.
  • Dictionaries made by stephenmk's jitenbot like the jitenon-dictionaries, 大辞林第四版, and 新明解第八版 have very nice formatting.
  • Dictionaries made by the dictionary anon like 岩波, 三省堂, 広辞苑 have nice formatting.

Schema Validation

For schema validation, I recommend configuring VSCode to validate schemas, though you could also use a website like jsonschemavalidator to test.

If you want to use VSCode to validate schemas, here's the relevant settings JSON value to use following the above instructions.

  "json.schemas": [
    {
      "fileMatch": ["kanji_bank_*.json"],
      "url": "https://github.com/themoeway/yomitan/raw/master/ext/data/schemas/dictionary-kanji-bank-v3-schema.json"
    },
    {
      "fileMatch": ["kanji_meta_bank_*.json"],
      "url": "https://github.com/themoeway/yomitan/raw/master/ext/data/schemas/dictionary-kanji-meta-bank-v3-schema.json"
    },
    {
      "fileMatch": ["tag_bank_*.json"],
      "url": "https://github.com/themoeway/yomitan/raw/master/ext/data/schemas/dictionary-tag-bank-v3-schema.json"
    },
    {
      "fileMatch": ["term_bank_*.json"],
      "url": "https://github.com/themoeway/yomitan/raw/master/ext/data/schemas/dictionary-term-bank-v3-schema.json"
    },
    {
      "fileMatch": ["term_meta_bank_*.json"],
      "url": "https://github.com/themoeway/yomitan/raw/master/ext/data/schemas/dictionary-term-meta-bank-v3-schema.json"
    }
  ],

Conjugation

For Japanese terms to be conjugated by Yomichan, they need to have an appropriate part of speech tag (as can be seen in the term bank schema). The part of speech labels are documented on the official JMDict page here. If you're making a Japanese dictionary without too many terms, you might be able to simply copy the parts of speech from JMDict as long as the terms mostly overlap. I have developed an npm package that can help with stealing conjugations from JMDict - you can see an example of getDeinflectorsForTermReading in the logic used to create the JP-Mongolian dictionary.

Tag Categories

There isn't any official documentation on the second item in an array item in the tag bank schema, but it can be found here in the Yomitan source code. The tag categories are as follows:

  • name
  • expression
  • popular
  • frequent
  • archaism
  • dictionary
  • frequency
  • partOfSpeech
  • search
  • pronunciation-dictionary
  • search

The main user-facing effect of choosing a tag category for a tag in the tag bank is that there will be css applied to change the color of the tag. You can view the colors here.