技巧

基础

视频：李宏毅Transformer
书籍：Natural language processing with Transformers
文档: transformers
attention is all you need - tensorflow
多种下游任务: 文本分类、NER、关系抽取、阅读理解、文本匹配，知识图谱
https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners

NLP Trick

https://zhuanlan.zhihu.com/p/549605526
https://github.com/michaelzhouy/Linux/blob/main/04-kaggle/03-tricks.md
https://zhuanlan.zhihu.com/p/537304957
https://github.com/zhengyanzhao1997/NLP-model

任务

附加任务
辅助任务 UDA https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/161100

数据

数据增强
- 回译 nllb
- Random mask
- 伪标签(hard or soft)
- 通过其他样本增强: search the most similar text B of text A, and concat B, score of B and A
- UDA（Unsupervised Data Augmentation）
数据合成

模型

pooling
- cls
- mean
- attention
multi sample dropout
layer re-init
custom head
- lstm
- cnn
中文模型nezha/ERNIE/chinese-roberta-wwm
- https://lonepatient.top/2021/01/20/awesome-pretrained-chinese-nlp-models.html
多语言
- Multilingual-Bert，XLM-R，InfoXLM
- 'bert-base-multilingual-uncased'
第一个transformer block后加入卷积
- 这个技巧在token classification、span prediction任务里经常用到

训练

继续预训练
layer-wise learning rate decay
对抗训练
- AWP
- FGM
蒸馏
LoRA
swa/ema
- Swa
加速训练
- tokenizer结果保存, 减少GPU等待时间

文本分类/回归

分类任务本身技巧可结合图像、结构化的分类任务一起
预训练
根据下游应用任务，选择合适的预训练任务，例如随机Mask、whole word mask，n-gram mask
https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/330356
https://www.kaggle.com/rhtsingh/commonlit-readability-prize-roberta-torch-itpt
- CLRP https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune-fixed-minor-issues
https://www.kaggle.com/competitions/AI4Code/discussion/335294
微调 https://zhuanlan.zhihu.com/p/386603816
发展 https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
https://zhuanlan.zhihu.com/p/183852900
https://zhuanlan.zhihu.com/p/337212893
https://zhuanlan.zhihu.com/p/109992475
https://github.com/lonePatient/Bert-Multi-Label-Text-Classification
https://github.com/NavePnow/Google-BERT-on-fake_or_real-news-dataset
https://github.com/songyouwei/ABSA-PyTorch
https://github.com/zhoujx4/NLP-Series-text-cls
https://github.com/jsksxs360/How-to-use-Transformers

NER/信息抽取

pipeline

token classification
span classification
global pointer
mrc
llm

信息检索与RAG

样本选取

困难负样本挖掘

训练方式

point
pair
list

QA

当一个文本需要多个tokenizer()时，需要注意'附加token'的问题. 例如聚合时去掉EOS的

阅读

大模型Notebook

源码阅读

https://github.com/bojone/bert4keras
https://github.com/fastnlp/fastNLP
https://github.com/huggingface/transformers
https://github.com/xv44586/toolkit4nlp

Good pipeline

https://github.com/thuwyh/Jigsaw-Unintended-Bias-in-Toxicity-Classification
https://github.com/GuanshuoXu/Jigsaw-Rate-Severity-of-Toxic-Comments
https://github.com/mathislucka/kaggle_clrp_1st_place_solution

Blog

https://zhuanlan.zhihu.com/p/416002644
https://zhuanlan.zhihu.com/p/109992475
增加一些sep等 https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
前处理 https://www.kaggle.com/code/longtng/nlp-preprocessing-feature-extraction-methods-a-z
Transformer Representations https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
How to Fine-Tune BERT for Text Classification?
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/discussion/372782

任务

https://github.com/yoheikikuta/bert-japanese
https://github.com/zhedongzheng/tensorflow-nlp
pointwise 、 pairwise 、listwise
https://github.com/jasoncao11/nlp-notebook
https://github.com/jsksxs360/How-to-use-Transformers

课程

https://github.com/suhara/cis6930-fall2021
llm: https://github.com/mlabonne/llm-course
https://learn.microsoft.com/en-us/training/modules/fundamentals-generative-ai/3-language%20models

蒸馏

https://github.com/qiangsiwei/bert_distill
https://zhuanlan.zhihu.com/p/273378905
https://github.com/xv44586/Knowledge-Distillation-NLP：
https://zhuanlan.zhihu.com/p/93287223
https://zhuanlan.zhihu.com/p/92166184
https://github.com/Syencil/mobile-yolov5-pruning-distillation
CCF汽车行业用户观点主题及情感识别
- https://github.com/yilifzf/BDCI_Car_2018
AI Challenger 2018
- https://github.com/xueyouluo/fsauor2018
CCKS地址相关性任务
- https://github.com/wodejiafeiyu/ccks2021-track3-top1

文本匹配

https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb
https://github.com/pengming617/text_matching
https://zhuanlan.zhihu.com/p/348093157
https://www.zhihu.com/question/426631698/answer/1598777190
https://github.com/cjymz886/sentence-similarity
https://github.com/yanqiangmiffy/sentence-similarity
https://github.com/UKPLab/sentence-transformers
word2vec: https://www.zhihu.com/question/29978268/answer/86906345
https://github.com/zhaogaofeng611/TextMatch
https://github.com/MachineLP/TextMatch

NER

https://github.com/crownpku/Information-Extraction-Chinese
https://github.com/dengxc1220/bert4keras_ner_demo/blob/master/0216bert4keras_Demo.ipynb
https://github.com/DLLXW/data-science-competition/tree/main/heywhale/gaiic2022
https://zhuanlan.zhihu.com/p/152463745?from_voters_page=true
https://github.com/BaberMuyu/2020CCF-NER

BMES

https://github.com/jackhuntcn/BIENDATA_BMES_top12

https://github.com/wbchief/2022_GAIIC_Task2_5st

https://github.com/taishan1994/classical_chinese_extraction

参考

https://github.com/abhishekkrthakur/bert-sentiment
https://blog.csdn.net/weixin_45839693
https://github.com/NielsRogge/Transformers-Tutorials
https://github.com/mathislucka/kaggle_clrp_1st_place_solution
https://github.com/TingFree/NLPer-Arsenal
https://www.cnblogs.com/gogoSandy/
https://github.com/zhaogaofeng611/TextMatch
对话系统： https://zhuanlan.zhihu.com/p/358001553
https://github.com/fighting41love/funNLP
https://github.com/NielsRogge/Transformers-Tutorials
分类 https://www.kaggle.com/competitions/feedback-prize-effectiveness/discussion/326998
https://github.com/UKPLab/sentence-transformers
https://zhuanlan.zhihu.com/p/371198818
FAISS: https://github.com/DunZhang/DFPassageRetrieve
https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
https://github.com/graykode/nlp-tutorial
https://github.com/ManuelAngel99/PLMpapers
https://github.com/SWHL/AI-Competition-Collections

简单版本助力结构化数据

https://www.kaggle.com/code/saurabhbagchi/nlp-starter-on-blogpost-dataset/notebook

资料：

https://github.com/graykode/nlp-tutorial
https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/discussion/228227
https://github.com/zhpmatrix/nlp-competitions-list-review
https://zhuanlan.zhihu.com/p/166608727
https://github.com/yym6472?tab=stars
https://zhuanlan.zhihu.com/p/371198818
https://github.com/NielsRogge/Transformers-Tutorials
https://zhuanlan.zhihu.com/p/33901181
https://github.com/lonePatient
https://github.com/HuipengXu/oppo-end2end/tree/master/src
https://github.com/oleg-yaroshevskiy/quest_qa_labeling
https://github.com/abhishekkrthakur/long-text-token-classification
https://github.com/suicao/tweet-extraction
https://github.com/PaddlePaddle/PaddleNLP
https://github.com/affjljoo3581/Feedback-Prize-Competition
https://github.com/yanqiangmiffy/transformers-tutorial
Natural Language Processing with Transformers
https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/307488

知识图谱

https://github.com/zhpmatrix/nlp-competitions-list-review
https://github.com/jackhuntcn/BIENDATA_BMES_top12
https://github.com/percent4?tab=stars
如何继续预训练汉语bert? - codebird的回答 - 知乎
NLP领域，你推荐哪些综述性的文章？ - 小川Ryan的回答 - 知乎
SemEval 2022 Task2: 对抗训练和对比学习 - cuixuange的文章 - 知乎

Previous自然语言处理 Next音频

Last updated 1 month ago