The repository contains the HUE dataset and Hanja language models, and the code baselines to implement them. See our paper for more details about HUE and Hanja PLMs.
HUE is composed of 4 tasks:
- Chronological Attribution (CA)
- Topic Classification (TC)
- Named Entity Recognition (NER)
- Summary Retrieval (SR)
HUE aims to encourage training Hanja language models that help to analyze the Korean historical documents written in Hanja which is an extinct language.
You can download all 12 here, or individually from the table below:
Model Name | Size |
---|---|
AnchiBERT + AJD/DRS | 379.2 MB |
mBERT + AJD/DRS | 379.2 MB |
Make sure you have installed the packages listed in environment.yml. If you use conda, you can create an environment from this package with the following command:
conda env create -f environment.yml
Codes, data, and models should be placed in this directory tree.
HUE
βββ code
βΒ Β βββ HUE_fine-tuning_Chronological_Attribution.ipynb
βΒ Β βββ HUE_fine-tuning_Named_Entity_Recognition.ipynb
βΒ Β βββ HUE_fine-tuning_Summary_Retrieval.ipynb
βΒ Β βββ HUE_fine-tuning_Topic_Classification.ipynb
βββ dataset
βΒ Β βββ HUE_Chronological_Attribution
βΒ Β βΒ Β βββ HUE_Chronological_Attribution.csv
βΒ Β βΒ Β βββ HUE_Chronological_Attribution_dev.csv
βΒ Β βΒ Β βββ HUE_Chronological_Attribution_test.csv
βΒ Β βΒ Β βββ HUE_Chronological_Attribution_train.csv
βΒ Β βββ HUE_Named_Entity_Recognition
βΒ Β βΒ Β βββ HUE_Named_Entity_Recognition_dev.csv
βΒ Β βΒ Β βββ HUE_Named_Entity_Recognition_test.csv
βΒ Β βΒ Β βββ HUE_Named_Entity_Recognition_train.csv
βΒ Β βββ HUE_Summary_Retrieval
βΒ Β βΒ Β βββ HUE_Summary_Retrieval_dev.csv
βΒ Β βΒ Β βββ HUE_Summary_Retrieval_test.csv
βΒ Β βΒ Β βββ HUE_Summary_Retrieval_train.csv
βΒ Β βββ HUE_Topic_Classification
βΒ Β βββ HUE_Topic_Classification.csv
βΒ Β βββ HUE_Topic_Classification_dev.csv
βΒ Β βββ HUE_Topic_Classification_test.csv
βΒ Β βββ HUE_Topic_Classification_train.csv
βββ model
βΒ Β βββ AnchiBERT+AJD-DRS
βΒ Β βΒ Β βββ config.json
βΒ Β βΒ Β βββ pytorch_model.bin
βΒ Β βΒ Β βββ special_tokens_map.json
βΒ Β βΒ Β βββ tokenizer_config.json
βΒ Β βΒ Β βββ vocab.txt
βΒ Β βββ mBERT+AJD-DRS
βΒ Β βββ config.json
βΒ Β βββ pytorch_model.bin
βΒ Β βββ special_tokens_map.json
βΒ Β βββ tokenizer_config.json
βΒ Β βββ vocab.txt
βββ tokenizer
βββ AnchiBERT+AJD-DRS
βΒ Β βββ special_tokens_map.json
βΒ Β βββ tokenizer_config.json
βΒ Β βββ vocab.txt
βββ mBERT+AJD-DRS
βββ special_tokens_map.json
βββ tokenizer_config.json
βββ vocab.txt