技巧

基础

NLP Trick

  • https://zhuanlan.zhihu.com/p/549605526

  • https://github.com/michaelzhouy/Linux/blob/main/04-kaggle/03-tricks.md

  • https://zhuanlan.zhihu.com/p/537304957

  • https://github.com/zhengyanzhao1997/NLP-model

任务

  • 附加任务

  • 辅助任务 UDA https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/161100

数据

  • 数据增强

    • 回译 nllb

    • Random mask

    • 伪标签(hard or soft)

    • 通过其他样本增强: search the most similar text B of text A, and concat B, score of B and A

    • UDA(Unsupervised Data Augmentation)

  • 数据合成

模型

  • pooling

    • cls

    • mean

    • attention

  • multi sample dropout

  • layer re-init

  • custom head

    • lstm

    • cnn

  • 中文模型nezha/ERNIE/chinese-roberta-wwm

    • https://lonepatient.top/2021/01/20/awesome-pretrained-chinese-nlp-models.html

  • 多语言

    • Multilingual-Bert,XLM-R,InfoXLM

    • 'bert-base-multilingual-uncased'

  • 第一个transformer block后加入卷积

    • 这个技巧在token classification、span prediction任务里经常用到

训练

  • 继续预训练

  • layer-wise learning rate decay

  • 对抗训练

    • AWP

    • FGM

  • 蒸馏

  • LoRA

  • swa/ema

  • 加速训练

    • tokenizer结果保存, 减少GPU等待时间

文本分类/回归

  • 分类任务本身技巧可结合图像、结构化的分类任务一起

  • 预训练

  • 根据下游应用任务,选择合适的预训练任务,例如随机Mask、whole word mask,n-gram mask

  • https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/330356

  • https://www.kaggle.com/rhtsingh/commonlit-readability-prize-roberta-torch-itpt

    • CLRP https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune-fixed-minor-issues

  • https://www.kaggle.com/competitions/AI4Code/discussion/335294

  • 微调 https://zhuanlan.zhihu.com/p/386603816

  • 发展 https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert

  • https://zhuanlan.zhihu.com/p/183852900

  • https://zhuanlan.zhihu.com/p/337212893

  • https://zhuanlan.zhihu.com/p/109992475

  • https://github.com/lonePatient/Bert-Multi-Label-Text-Classification

  • https://github.com/NavePnow/Google-BERT-on-fake_or_real-news-dataset

  • https://github.com/songyouwei/ABSA-PyTorch

  • https://github.com/zhoujx4/NLP-Series-text-cls

  • https://github.com/jsksxs360/How-to-use-Transformers

NER/信息抽取

pipeline

  • token classification

  • span classification

  • global pointer

  • mrc

  • llm

信息检索与RAG

样本选取

  • 困难负样本挖掘

训练方式

  • point

  • pair

  • list

QA

  • 当一个文本需要多个tokenizer()时,需要注意'附加token'的问题. 例如聚合时去掉EOS的

阅读

大模型Notebook

源码阅读

  • https://github.com/bojone/bert4keras

  • https://github.com/fastnlp/fastNLP

  • https://github.com/huggingface/transformers

  • https://github.com/xv44586/toolkit4nlp

Good pipeline

  • https://github.com/thuwyh/Jigsaw-Unintended-Bias-in-Toxicity-Classification

  • https://github.com/GuanshuoXu/Jigsaw-Rate-Severity-of-Toxic-Comments

  • https://github.com/mathislucka/kaggle_clrp_1st_place_solution

Blog

任务

  • https://github.com/yoheikikuta/bert-japanese

  • https://github.com/zhedongzheng/tensorflow-nlp

  • https://github.com/jasoncao11/nlp-notebook

  • https://github.com/jsksxs360/How-to-use-Transformers

课程

  • https://github.com/suhara/cis6930-fall2021

  • llm: https://github.com/mlabonne/llm-course

  • https://learn.microsoft.com/en-us/training/modules/fundamentals-generative-ai/3-language%20models

蒸馏

  • https://github.com/qiangsiwei/bert_distill

  • https://zhuanlan.zhihu.com/p/273378905

  • https://github.com/xv44586/Knowledge-Distillation-NLP:

  • https://zhuanlan.zhihu.com/p/93287223

  • https://zhuanlan.zhihu.com/p/92166184

  • https://github.com/Syencil/mobile-yolov5-pruning-distillation

  • CCF汽车行业用户观点主题及情感识别

    • https://github.com/yilifzf/BDCI_Car_2018

  • AI Challenger 2018

    • https://github.com/xueyouluo/fsauor2018

  • CCKS地址相关性任务

    • https://github.com/wodejiafeiyu/ccks2021-track3-top1

文本匹配

  • https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb

  • https://github.com/pengming617/text_matching

  • https://zhuanlan.zhihu.com/p/348093157

  • https://www.zhihu.com/question/426631698/answer/1598777190

  • https://github.com/cjymz886/sentence-similarity

  • https://github.com/yanqiangmiffy/sentence-similarity

  • https://github.com/UKPLab/sentence-transformers

  • word2vec: https://www.zhihu.com/question/29978268/answer/86906345

  • https://github.com/zhaogaofeng611/TextMatch

  • https://github.com/MachineLP/TextMatch

NER

  • https://github.com/crownpku/Information-Extraction-Chinese

  • https://github.com/dengxc1220/bert4keras_ner_demo/blob/master/0216bert4keras_Demo.ipynb

  • https://github.com/DLLXW/data-science-competition/tree/main/heywhale/gaiic2022

  • https://zhuanlan.zhihu.com/p/152463745?from_voters_page=true

  • https://github.com/BaberMuyu/2020CCF-NER

BMES

  • https://github.com/jackhuntcn/BIENDATA_BMES_top12

https://github.com/wbchief/2022_GAIIC_Task2_5st

https://github.com/taishan1994/classical_chinese_extraction

参考

  • https://github.com/abhishekkrthakur/bert-sentiment

  • https://blog.csdn.net/weixin_45839693

  • https://github.com/NielsRogge/Transformers-Tutorials

  • https://github.com/mathislucka/kaggle_clrp_1st_place_solution

  • https://github.com/TingFree/NLPer-Arsenal

  • https://www.cnblogs.com/gogoSandy/

  • https://github.com/zhaogaofeng611/TextMatch

  • 对话系统: https://zhuanlan.zhihu.com/p/358001553

  • https://github.com/fighting41love/funNLP

  • https://github.com/NielsRogge/Transformers-Tutorials

  • 分类 https://www.kaggle.com/competitions/feedback-prize-effectiveness/discussion/326998

  • https://github.com/UKPLab/sentence-transformers

  • https://zhuanlan.zhihu.com/p/371198818

  • FAISS: https://github.com/DunZhang/DFPassageRetrieve

  • https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle

  • https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently

  • https://github.com/graykode/nlp-tutorial

  • https://github.com/ManuelAngel99/PLMpapers

  • https://github.com/SWHL/AI-Competition-Collections

简单版本助力结构化数据

  • https://www.kaggle.com/code/saurabhbagchi/nlp-starter-on-blogpost-dataset/notebook

资料:

  • https://github.com/graykode/nlp-tutorial

  • https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/discussion/228227

  • https://github.com/zhpmatrix/nlp-competitions-list-review

  • https://zhuanlan.zhihu.com/p/166608727

  • https://github.com/yym6472?tab=stars

  • https://zhuanlan.zhihu.com/p/371198818

  • https://github.com/NielsRogge/Transformers-Tutorials

  • https://zhuanlan.zhihu.com/p/33901181

  • https://github.com/lonePatient

  • https://github.com/HuipengXu/oppo-end2end/tree/master/src

  • https://github.com/oleg-yaroshevskiy/quest_qa_labeling

  • https://github.com/abhishekkrthakur/long-text-token-classification

  • https://github.com/suicao/tweet-extraction

  • https://github.com/PaddlePaddle/PaddleNLP

  • https://github.com/affjljoo3581/Feedback-Prize-Competition

  • https://github.com/yanqiangmiffy/transformers-tutorial

  • Natural Language Processing with Transformers

  • https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/307488

知识图谱

Last updated