文章目录
- 前言
- multilingual fairseq-preprocess1
-
- 方法简介
- cli_main
- main
- _make_all
- multilingual fairseq-preprocess2
-
- 方法简介
- generate_split-iwslt14.sh
- learn_and_encode_spm-iwslt14.sh
- binarize.sh
- 参考
前言 fairseq给出的preprocess代码只支持一个语言对的binarize,而笔者在[机器翻译] 记一次多语言机器翻译模型的训练想要对多个语言对同时进行binarize,过程中能够创建一个共享的词典。
和师兄交流之后,实现这一结果有两种方式:1. 在学习bpe之后,就会得到一个共享词表,需要对这个词表进行一些修改,然后作为binarize的参数;2. 不使用bpe得到的词表,而是做两次binarize,第一次是为每一个语言对进行一次binarize,然后得到不同的词表。接着将这些词表进行合并,作为第二次binarize的参数。
本文记录笔者通过对fairseq-preprocess流程的理解,参考https://github.com/RayeRen/multilingual-kd-pytorch/blob/master/preprocess_universal.py,实现更加简便的、一步到位的多个语言对binarize流程。
在这之后,本文也给出了上面所说的第一种的预处理方式。(关于第二种,随缘补充)
multilingual fairseq-preprocess1 方法简介 该方法通过对fairseq.fairseq_cli.preprocess.py的理解,修改得到一个用于multilingual fairseq-preprocess的代码。
从fairseq.fairseq_cli.preprocess.py中可以看到:
如果提供srcdict或者tgtdict,则会通过
task.load_dictionary(args.srcdict)
来读取词典。task.load_dictionary的执行流程为:[fairseq.tasks.translation.TranslationTask]->[fairseq.tasks.fairseq_task.FairseqTask.load_dictionary]->[fairseq.data.dictionary.Dictionary.load]->[fairseq.data.dictionary.Dictionary.add_from_file]。如果不提供dict,则会通过
task.build_dictionary
来创建词典,[fairseq.tasks.fairseq_task.FairseqTask.build_dictionary代码如下:d = Dictionary()
for filename in filenames:
Dictionary.add_file_to_dictionary(
filename, d, tokenizer.tokenize_line, workers
)
d.finalize(threshold=threshold, nwords=nwords, padding_factor=padding_factor)
return d
只要把所有语言对的train data都加入到filenames中,就可以直接创建一个共享的词表,接下来只要用这个词表对所有语言对进行binarize就可以了。因此,修改过程如下:
笔者首先将fairseq.fairseq_cli.preprocess.py复制到当前目录一份,然后修改以下3个函数:
cli_main
def cli_main():
parser = options.get_preprocessing_parser()
parser.add_argument('--pref', metavar='FP', default=None, help='data prefix')
args = parser.parse_args()
main(args)
main
def main(args):
# setup some basic things
utils.import_user_module(args)os.makedirs(args.destdir, exist_ok=True)logger.addHandler(
logging.FileHandler(
filename=os.path.join(args.destdir, "preprocess.log"),
)
)
logger.info(args)assert (
args.dataset_impl != "huffman"
), "preprocessing.py doesn't support Huffman yet, use HuffmanCodeBuilder directly."# build shared dictionaries# target = not args.only_sourcetrain_files = glob.glob('{}/train.*-*.*'.format(args.pref))
train_files = [f for f in train_files if len(f.split('.')) in [3, 4, 5]]
test_files = glob.glob('{}/test.*-*.*'.format(args.pref))
test_files = [f for f in test_files if len(f.split('.')) in [3, 4, 5]]
valid_files = glob.glob('{}/valid.*-*.*'.format(args.pref))
valid_files = [f for f in valid_files if len(f.split('.')) in [3, 4, 5]]
lng_pairs = set([f.split('/')[-1].split(".")[1] for f in (train_files + test_files + valid_files)])
task = tasks.get_task(args.task)
shared_dictionary = _build_dictionary(
train_files,
task=task,
args=args,
src=https://www.it610.com/article/True,
)
# save dictionaries
if args.joined_dictionary:
shared_dictionary.save(os.path.join(args.destdir,"dict.txt"))
else:
for lng_pair in lng_pairs:
src, tgt = lng_pair.split('-')
tmp_src_dict_path = os.path.join(args.destdir, f'dict.{src}.txt')
tmp_tgt_dict_path = os.path.join(args.destdir, f'dict.{tgt}.txt')
if not os.path.exists(tmp_src_dict_path):
shared_dictionary.save(tmp_src_dict_path)
if not os.path.exists(tmp_tgt_dict_path):
shared_dictionary.save(tmp_tgt_dict_path)if args.dict_only:
returnfor lng_pair in lng_pairs:
src_and_tgt = lng_pair.split('-')
if len(src_and_tgt) != 2:
continue
src, tgt = src_and_tgt
print("| building: ", src, tgt)
args.source_lang = src
args.target_lang = tgt
_make_all(src, shared_dictionary, args)
_make_all(tgt, shared_dictionary, args)logger.info("Wrote preprocessed data to {}".format(args.destdir))
_make_all
def _make_all(lang, vocab, args):
lng_pair = f"{args.source_lang}-{args.target_lang}"
_make_dataset( ## iwslt14.tokenized/train.en-ar
vocab, os.path.join(args.pref, f"train.{lng_pair}"), "train", lang, args=args, num_workers=args.workers
)
_make_dataset(
vocab, os.path.join(args.pref, f"valid.{lng_pair}"), "valid", lang, args=args, num_workers=args.workers
)
_make_dataset(
vocab, os.path.join(args.pref, f"test.{lng_pair}"), "test", lang, args=args, num_workers=args.workers
)
multilingual fairseq-preprocess2 方法简介 该方法在学习bpe之后,就会得到一个共享词表,需要对这个词表进行一些修改,然后作为binarize的参数。
generate_split-iwslt14.sh 【机器翻译|[机器翻译] multilingual fairseq-preprocess的两种做法】当前目录有以下文件:
![机器翻译|[机器翻译] multilingual fairseq-preprocess的两种做法](http://img.readke.com/220619/141122OW-0.jpg)
文章图片
执行下面脚本,完成数据的划分后,得到下面的文件,其中的train.all用于学习sentencepiece:
![机器翻译|[机器翻译] multilingual fairseq-preprocess的两种做法](http://img.readke.com/220619/141122L50-1.png)
文章图片
#!/usr/bin/env bash
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.shecho 'Cloning Moses github repository (for tokenization scripts)...'
git clone git://github.com/moses-smt/mosesdecoder.git###
# just generate train\test\valid data for iwslt14
# with same simple preprocess steps and without tokenization, because the next step is learn spm
###
SCRIPTS=mosesdecoder/scripts
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
tmp=tmp
orig=orig
tgt=en
rm -r $orig
rm -r $tmp
mkdir -p $orig $tmpfor src in ar de es fa he it nl pl;
do
lang=$src-en
echo "pre-processing train data..."
for l in $src $tgt;
do
if [[ ! -f $src-en.tgz ]];
then
wget https://wit3.fbk.eu/archive/2014-01//texts/$src/en/$src-en.tgz
fi
cd $orig
tar zxvf ../$src-en.tgz
cd ..f=train.tags.$lang.$lcat $orig/$lang/$f | \
grep -v '' | \
grep -v '' | \
grep -v '' | \
grep -v '' | \
sed -e 's///g' | \
sed -e 's/<\/title>//g' | \
sed -e 's///g' | \
sed -e 's/<\/description>//g' > $tmp/$f
echo ""
done
for l in $src $tgt;
do
perl $LC < $tmp/train.tags.$lang.$l > $tmp/train.$lang.$l
rm $tmp/train.tags.$lang.$l
done
echo "pre-processing valid/test data..."
for l in $src $tgt;
do
for o in `ls $orig/$lang/IWSLT14.TED*.$l.xml`;
do
fname=${o##*/}
f=$tmp/${fname%.*}
echo $o $f
grep '$o | \
sed -e 's/\s*//g' | \
sed -e 's/\s*<\/seg>\s*//g' | \
sed -e "s/\’/\'/g" | \
perl $LC > $f
echo ""
done
doneecho "creating train, valid, test..."
for l in $src $tgt;
do
mv $tmp/train.$src-$tgt.$l $tmp/train-valid.$src-$tgt.$l
awk '{if (NR%23 == 0)print $0;
}' $tmp/train-valid.$src-$tgt.$l > $tmp/valid.en-$src.$l
awk '{if (NR%23 != 0)print $0;
}' $tmp/train-valid.$src-$tgt.$l > $tmp/train.en-$src.$l
rm $tmp/train-valid.$src-$tgt.$l
cat $tmp/IWSLT14.TED.dev2010.$src-$tgt.$l \
$tmp/IWSLT14.TEDX.dev2012.$src-$tgt.$l \
$tmp/IWSLT14.TED.tst2010.$src-$tgt.$l \
$tmp/IWSLT14.TED.tst2011.$src-$tgt.$l \
$tmp/IWSLT14.TED.tst2012.$src-$tgt.$l \
> $tmp/test.en-$src.$l
rm $tmp/IWSLT14.TED*.$src-$tgt.$l
doneTRAIN=$tmp/train.all
for l in $src $tgt;
do
cat $tmp/train.en-$src.$l >> $TRAIN
done
doneecho "counting..."
for src in ar de es fa he it nl pl;
do
for split in train valid test;
do
for l in $src $tgt;
do
wc -l $tmp/$split.en-$src.$l
done
done
doneecho "done"
learn_and_encode_spm-iwslt14.sh 学习spm,并apply,得到下面文件,用于binarize。
![机器翻译|[机器翻译] multilingual fairseq-preprocess的两种做法](http://img.readke.com/220619/141122M13-2.jpg)
文章图片
#!/usr/bin/env bash
echo 'Cloning fairseq repository...'
git clone git@github.com:facebookresearch/fairseq.git
# learn bpe
bpe=bpe
tmp=tmp
tgt=en
SCRIPTS=mosesdecoder/scripts
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
rm -r $bpe
mkdir -p $bpepython -u fairseq/scripts/spm_train.py \
--input=$tmp/train.all \
--model_prefix=spm.bpe \
--vocab_size=30000 \
--character_coverage=1.0 \
--model_type=bpe \
--num_threads=45 \
--shuffle_input_sentence# apply bpe
for split in train valid test;
do
for src in ar de es fa he it nl pl;
do
echo ${split} en-${src}
python fairseq/scripts/spm_encode.py \
--model spm.bpe.model \
--output_format=piece \
--inputs ${tmp}/${split}.en-${src}.${src} ${tmp}/${split}.en-${src}.en \
--outputs ${bpe}/${split}.en-${src}.bpe.unclean.${src} ${bpe}/${split}.en-${src}.bpe.unclean.en
restrict length ratio
perl $CLEAN -ratio 1.5 ${bpe}/${split}.en-${src}.bpe.unclean ${src} en ${bpe}/${split}.en-${src}.bpe 1 256
rm ${bpe}/${split}.en-${src}.bpe.unclean.*
done
done
binarize.sh
#!/usr/bin/env bash
# create share dict
path=data-bin
rm -r $path
mkdir -p $pathcut -f1 spm.bpe.vocab | tail -n +4 | sed "s/$/ 100/g" > $path/dict.txt
#for lang in ar de es fa he it nl pl en;
do
#cp $path/dict.txt $path/dict.${lang}.txt
#donefor split in train valid test;
do
for src in ar de es fa he it nl pl;
do
echo ${split} en-${src}
fairseq-preprocess \
--source-lang $src --target-lang en \
--trainpref bpe/train.en-${src}.bpe \
--validpref bpe/valid.en-${src}.bpe \
--testpref bpe/test.en-${src}.bpe \
--destdir $path \
--srcdict $path/dict.txt \
--tgtdict $path/dict.txt
done
done
参考 https://github.com/RayeRen/multilingual-kd-pytorch/blob/master/data/iwslt/raw/prepare-iwslt14.sh
https://github.com/facebookresearch/fairseq/issues/2110#issue-614837309
https://github.com/facebookresearch/fairseq/tree/main/examples/m2m_100
推荐阅读
- 个人随笔|关于BLEU值计算的学习笔记
- 安装|深度学习环境配置Win10+CUDA+cuDNN+Tensorflow2.0+PyTorch1.2+Python3.7.6
- matlab|图像处理压缩Huffman编码方法实现
- pytorch|pytorch_lesson16.1 OpenCV索贝尔算子/拉普拉斯算子调用+pytorch中构建cnn+复现经典模型(LeNet5+AlexNet)
- 卷积神经网络概念及使用 PyTorch 简单实现
- ★MATLAB算法仿真经验|【眼底检测】视网膜动静脉血管检测和特征计算matlab仿真
- 深度学习|李宏毅深度学习笔记 ---- 简介
- 机器学习|李宏毅机器学习GAN的笔记
- 论文解读|Free-Form Image Inpainting with Gated Convolution