Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefix_idxs如何确定 #15

Open
RENNY-Jenius opened this issue Jan 9, 2024 · 2 comments
Open

prefix_idxs如何确定 #15

RENNY-Jenius opened this issue Jan 9, 2024 · 2 comments

Comments

@RENNY-Jenius
Copy link

想问一下,在这个icl/util_classes/predictor_classes.py中的Predictor类中,prefix_idxs到底是怎么确定的,我看到不同的数据集有不同的设置方式。
if task_name == 'sst2':
self.prefix_idxs = [tokenizer.encode('Sentiment', add_special_tokens=False)[-1],
tokenizer.encode(':', add_special_tokens=False)[0]]
elif task_name == 'agnews':
self.prefix_idxs = [tokenizer.encode('Answer', add_special_tokens=False)[-1],
tokenizer.encode(':', add_special_tokens=False)[0]]
elif task_name == 'trec':
self.prefix_idxs = [tokenizer.encode(' Type', add_special_tokens=False)[-1],
tokenizer.encode(':', add_special_tokens=False)[0]]
elif task_name == 'emo':
self.prefix_idxs = [tokenizer.encode('Emotion', add_special_tokens=False)[-1],
tokenizer.encode(':', add_special_tokens=False)[0]]
我想问一下,如果对于其他的数据集(如gsm8k)应该怎么确定呢?谢谢

@leanwang326
Copy link
Collaborator

啊这个就是按照附录里的每个任务的prompt模版(这个模版本身也是抄的别人在这个数据集上是怎么做的),然后取了label前面的两个token

@leanwang326
Copy link
Collaborator

gsm8k的问题是,如果你用的prompt里要求模型显式输出The answer is xxx,那这儿就是'The answer is'的最后两个token(xxx之前的两个token),但如果答案要自行抽取的话,那就不能用上面的代码了。(我看gsm8k上怎么抽取答案好像也五花八门的,我也不确定怎么干好https://github.com/facebookresearch/llama/issues/325)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants