技巧
基础
视频:李宏毅Transformer
多种下游任务: 文本分类、NER、关系抽取、阅读理解、文本匹配,知识图谱
https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners
NLP Trick
https://zhuanlan.zhihu.com/p/549605526
https://github.com/michaelzhouy/Linux/blob/main/04-kaggle/03-tricks.md
https://zhuanlan.zhihu.com/p/537304957
https://github.com/zhengyanzhao1997/NLP-model
任务
附加任务
辅助任务 UDA https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/161100
数据
数据增强
回译 nllb
Random mask
伪标签(hard or soft)
通过其他样本增强: search the most similar text B of text A, and concat B, score of B and A
UDA(Unsupervised Data Augmentation)
数据合成
模型
pooling
cls
mean
attention
multi sample dropout
layer re-init
custom head
lstm
cnn
中文模型nezha/ERNIE/chinese-roberta-wwm
https://lonepatient.top/2021/01/20/awesome-pretrained-chinese-nlp-models.html
多语言
Multilingual-Bert,XLM-R,InfoXLM
'bert-base-multilingual-uncased'
第一个transformer block后加入卷积
这个技巧在token classification、span prediction任务里经常用到
训练
继续预训练
layer-wise learning rate decay
对抗训练
AWP
FGM
蒸馏
LoRA
swa/ema
加速训练
tokenizer结果保存, 减少GPU等待时间
文本分类/回归
分类任务本身技巧可结合图像、结构化的分类任务一起
预训练
根据下游应用任务,选择合适的预训练任务,例如随机Mask、whole word mask,n-gram mask
https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/330356
https://www.kaggle.com/rhtsingh/commonlit-readability-prize-roberta-torch-itpt
CLRP https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune-fixed-minor-issues
https://www.kaggle.com/competitions/AI4Code/discussion/335294
微调 https://zhuanlan.zhihu.com/p/386603816
发展 https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
https://zhuanlan.zhihu.com/p/183852900
https://zhuanlan.zhihu.com/p/337212893
https://zhuanlan.zhihu.com/p/109992475
https://github.com/lonePatient/Bert-Multi-Label-Text-Classification
https://github.com/NavePnow/Google-BERT-on-fake_or_real-news-dataset
https://github.com/songyouwei/ABSA-PyTorch
https://github.com/zhoujx4/NLP-Series-text-cls
https://github.com/jsksxs360/How-to-use-Transformers
NER/信息抽取
pipeline
token classification
span classification
global pointer
mrc
llm
信息检索与RAG
样本选取
困难负样本挖掘
训练方式
point
pair
list
QA
当一个文本需要多个tokenizer()时,需要注意'附加token'的问题. 例如聚合时去掉EOS的
阅读
大模型Notebook
源码阅读
https://github.com/bojone/bert4keras
https://github.com/fastnlp/fastNLP
https://github.com/huggingface/transformers
https://github.com/xv44586/toolkit4nlp
Good pipeline
https://github.com/thuwyh/Jigsaw-Unintended-Bias-in-Toxicity-Classification
https://github.com/GuanshuoXu/Jigsaw-Rate-Severity-of-Toxic-Comments
https://github.com/mathislucka/kaggle_clrp_1st_place_solution
Blog
https://zhuanlan.zhihu.com/p/416002644
https://zhuanlan.zhihu.com/p/109992475
增加一些sep等 https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
前处理 https://www.kaggle.com/code/longtng/nlp-preprocessing-feature-extraction-methods-a-z
Transformer Representations https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations/discussion/372782
任务
https://github.com/yoheikikuta/bert-japanese
https://github.com/zhedongzheng/tensorflow-nlp
https://github.com/jasoncao11/nlp-notebook
https://github.com/jsksxs360/How-to-use-Transformers
课程
https://github.com/suhara/cis6930-fall2021
llm: https://github.com/mlabonne/llm-course
https://learn.microsoft.com/en-us/training/modules/fundamentals-generative-ai/3-language%20models
蒸馏
https://github.com/qiangsiwei/bert_distill
https://zhuanlan.zhihu.com/p/273378905
https://github.com/xv44586/Knowledge-Distillation-NLP:
https://zhuanlan.zhihu.com/p/93287223
https://zhuanlan.zhihu.com/p/92166184
https://github.com/Syencil/mobile-yolov5-pruning-distillation
CCF汽车行业用户观点主题及情感识别
https://github.com/yilifzf/BDCI_Car_2018
AI Challenger 2018
https://github.com/xueyouluo/fsauor2018
CCKS地址相关性任务
https://github.com/wodejiafeiyu/ccks2021-track3-top1
文本匹配
https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb
https://github.com/pengming617/text_matching
https://zhuanlan.zhihu.com/p/348093157
https://www.zhihu.com/question/426631698/answer/1598777190
https://github.com/cjymz886/sentence-similarity
https://github.com/yanqiangmiffy/sentence-similarity
https://github.com/UKPLab/sentence-transformers
word2vec: https://www.zhihu.com/question/29978268/answer/86906345
https://github.com/zhaogaofeng611/TextMatch
https://github.com/MachineLP/TextMatch
NER
https://github.com/crownpku/Information-Extraction-Chinese
https://github.com/dengxc1220/bert4keras_ner_demo/blob/master/0216bert4keras_Demo.ipynb
https://github.com/DLLXW/data-science-competition/tree/main/heywhale/gaiic2022
https://zhuanlan.zhihu.com/p/152463745?from_voters_page=true
https://github.com/BaberMuyu/2020CCF-NER
BMES
https://github.com/jackhuntcn/BIENDATA_BMES_top12
https://github.com/wbchief/2022_GAIIC_Task2_5st
https://github.com/taishan1994/classical_chinese_extraction
参考
https://github.com/abhishekkrthakur/bert-sentiment
https://blog.csdn.net/weixin_45839693
https://github.com/NielsRogge/Transformers-Tutorials
https://github.com/mathislucka/kaggle_clrp_1st_place_solution
https://github.com/TingFree/NLPer-Arsenal
https://www.cnblogs.com/gogoSandy/
https://github.com/zhaogaofeng611/TextMatch
对话系统: https://zhuanlan.zhihu.com/p/358001553
https://github.com/fighting41love/funNLP
https://github.com/NielsRogge/Transformers-Tutorials
分类 https://www.kaggle.com/competitions/feedback-prize-effectiveness/discussion/326998
https://github.com/UKPLab/sentence-transformers
https://zhuanlan.zhihu.com/p/371198818
FAISS: https://github.com/DunZhang/DFPassageRetrieve
https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
https://github.com/graykode/nlp-tutorial
https://github.com/ManuelAngel99/PLMpapers
https://github.com/SWHL/AI-Competition-Collections
简单版本助力结构化数据
https://www.kaggle.com/code/saurabhbagchi/nlp-starter-on-blogpost-dataset/notebook
资料:
https://github.com/graykode/nlp-tutorial
https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/discussion/228227
https://github.com/zhpmatrix/nlp-competitions-list-review
https://zhuanlan.zhihu.com/p/166608727
https://github.com/yym6472?tab=stars
https://zhuanlan.zhihu.com/p/371198818
https://github.com/NielsRogge/Transformers-Tutorials
https://zhuanlan.zhihu.com/p/33901181
https://github.com/lonePatient
https://github.com/HuipengXu/oppo-end2end/tree/master/src
https://github.com/oleg-yaroshevskiy/quest_qa_labeling
https://github.com/abhishekkrthakur/long-text-token-classification
https://github.com/suicao/tweet-extraction
https://github.com/PaddlePaddle/PaddleNLP
https://github.com/affjljoo3581/Feedback-Prize-Competition
https://github.com/yanqiangmiffy/transformers-tutorial
Natural Language Processing with Transformers
https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/307488
知识图谱
https://github.com/zhpmatrix/nlp-competitions-list-review
https://github.com/jackhuntcn/BIENDATA_BMES_top12
https://github.com/percent4?tab=stars
Last updated