Skip to content

innerNULL/mia

Repository files navigation

My Implementations' Archive

Archive Index

Reproduced Papers

Model Training/Inference Runners

Crawlers

ETL

  • Line-Based Splitter to Generate Train/Dev/Test Dataset
    • bash ./bin/etl/train_dev_test_splitter_for_lines_data.sh ${DATA_LINES_PATH} ${DEV_DATA_SIZE} ${TEST_DATA_SIZE}
  • SlimPajama-DC Text Corpus Low-Length Filtering and Deduplication
    • python ./bin/etl/dataset/text_corpus/text_corpus_slimpajama_dc_processor.py ./bin/etl/dataset/text_corpus/text_corpus_slimpajama_dc_processor.json