{"id":4741,"date":"2025-09-24T11:06:17","date_gmt":"2025-09-24T11:06:17","guid":{"rendered":"https:\/\/palmer-consulting.com\/transform-vs-older-nlp-models\/"},"modified":"2025-09-24T11:06:17","modified_gmt":"2025-09-24T11:06:17","slug":"transform-vs-older-nlp-models","status":"publish","type":"post","link":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/","title":{"rendered":"Transform vs. older NLP models"},"content":{"rendered":"<h2 data-start=\"214\" data-end=\"262\">What Transformer fundamentally changes<\/h2>\n<ul data-start=\"263\" data-end=\"1254\">\n<li data-start=\"263\" data-end=\"454\">\n<p data-start=\"265\" data-end=\"454\"><strong data-start=\"265\" data-end=\"322\">Central mechanism:<em data-start=\"303\" data-end=\"319\">self-attention<\/em><\/strong><br data-start=\"322\" data-end=\"325\">\u2192 The model &#8220;looks&#8221; at <strong data-start=\"349\" data-end=\"379\">all the words in parallel<\/strong> and learns which relationships are important, even <strong data-start=\"432\" data-end=\"453\">at long distance<\/strong>.<\/p>\n<\/li>\n<li data-start=\"455\" data-end=\"798\">\n<p data-start=\"457\" data-end=\"798\"><strong data-start=\"457\" data-end=\"484\">Massive parallelization<\/strong>: no strictly sequential processing as in <strong data-start=\"543\" data-end=\"612\">Recurrent <em data-start=\"582\" data-end=\"609\">Neural<\/em> Networks (RNN<\/strong> ) \u2192 <strong data-start=\"615\" data-end=\"652\">much faster training<\/strong> on <strong data-start=\"657\" data-end=\"726\">Graphics Processing <em data-start=\"696\" data-end=\"723\">Units<\/em>(GPU)<\/strong> and <strong data-start=\"730\" data-end=\"797\">Tensor Processing <em data-start=\"769\" data-end=\"794\">Units<\/em>(TPU<\/strong>).<\/p>\n<\/li>\n<li data-start=\"799\" data-end=\"943\">\n<p data-start=\"801\" data-end=\"943\"><strong data-start=\"801\" data-end=\"818\">Long context<\/strong>: handles <strong data-start=\"829\" data-end=\"863\">large contextual windows<\/strong> (thousands of <em data-start=\"881\" data-end=\"889\">tokens<\/em>), where RNN and variants lose remote memory.<\/p>\n<\/li>\n<li data-start=\"944\" data-end=\"1114\">\n<p data-start=\"946\" data-end=\"1114\"><strong data-start=\"946\" data-end=\"971\">Scale (scalability)<\/strong>: <strong data-start=\"977\" data-end=\"992\">scales<\/strong> very well (parameters, data, computation) \u2192 hence modern <strong data-start=\"1044\" data-end=\"1104\">Large Language <em data-start=\"1078\" data-end=\"1101\">Models<\/em>(LLMs)<\/strong>.<\/p>\n<\/li>\n<li data-start=\"1115\" data-end=\"1254\">\n<p data-start=\"1117\" data-end=\"1254\"><strong data-start=\"1117\" data-end=\"1132\">Flexibility<\/strong>: extends to <strong data-start=\"1146\" data-end=\"1160\">multimodal<\/strong> (text, image, audio), in-context learning <em data-start=\"1187\" data-end=\"1212\"><strong data-start=\"1188\" data-end=\"1211\">in-context learning<\/strong><\/em> and efficient <strong data-start=\"1219\" data-end=\"1244\">refinement\/fine-tuning<\/strong>.<\/p>\n<\/li>\n<\/ul>\n<p data-start=\"1256\" data-end=\"1300\"><strong data-start=\"1256\" data-end=\"1300\">Limits still true on the Transformer side<\/strong><\/p>\n<ul data-start=\"1301\" data-end=\"1647\">\n<li data-start=\"1301\" data-end=\"1404\">\n<p data-start=\"1303\" data-end=\"1404\"><strong data-start=\"1303\" data-end=\"1323\">Quadratic cost<\/strong> with context length (&#8220;classic&#8221; attention) \u2192 high memory and computation.<\/p>\n<\/li>\n<li data-start=\"1405\" data-end=\"1481\">\n<p data-start=\"1407\" data-end=\"1481\"><strong data-start=\"1407\" data-end=\"1451\">High data and computation requirements<\/strong> for very large models.<\/p>\n<\/li>\n<li data-start=\"1482\" data-end=\"1647\">\n<p data-start=\"1484\" data-end=\"1647\"><strong data-start=\"1484\" data-end=\"1519\">Less local inductive bias<\/strong> than <strong data-start=\"1528\" data-end=\"1602\">Convolutional <em data-start=\"1568\" data-end=\"1599\">Neural<\/em> Networks (CNN),<\/strong> which naturally capture local patterns.<\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"1649\" data-end=\"1652\">\n<h2 data-start=\"1654\" data-end=\"1701\">Older models (and what they did)<\/h2>\n<h3 data-start=\"1703\" data-end=\"1726\">Sequential networks<\/h3>\n<ul data-start=\"1727\" data-end=\"2332\">\n<li data-start=\"1727\" data-end=\"1898\">\n<p data-start=\"1729\" data-end=\"1898\"><strong data-start=\"1729\" data-end=\"1797\">RNN &#8211; Recurrent Neural Networks<\/strong>: <strong data-start=\"1809\" data-end=\"1824\">word-by-word<\/strong> processing; difficulty with long-term memory<em data-start=\"1865\" data-end=\"1896\">(vanishing\/exploding gradients<\/em>).<\/p>\n<\/li>\n<li data-start=\"1899\" data-end=\"2064\">\n<p data-start=\"1901\" data-end=\"2064\"><strong data-start=\"1901\" data-end=\"1934\">LSTM &#8211; Long Short-Term Memory<\/strong>: adds <strong data-start=\"1980\" data-end=\"1990\">gates<\/strong> for better memory \u2192 long the state of the art in translation and speech.<\/p>\n<\/li>\n<li data-start=\"2065\" data-end=\"2161\">\n<p data-start=\"2067\" data-end=\"2161\"><strong data-start=\"2067\" data-end=\"2097\">GRU &#8211; Gated Recurrent Unit<\/strong>: a <strong data-start=\"2142\" data-end=\"2152\">lighter<\/strong> variant of the LSTM.<\/p>\n<\/li>\n<li data-start=\"2162\" data-end=\"2332\">\n<p data-start=\"2164\" data-end=\"2332\"><strong data-start=\"2164\" data-end=\"2213\">Seq2Seq &#8211; Sequence-to-Sequence with attention<\/strong> (Bahdanau\/Luong): the first big leap in translation;<strong data-start=\"2270\" data-end=\"2283\">attention<\/strong> is a <strong data-start=\"2293\" data-end=\"2303\">module<\/strong>, not the entire architecture.<\/p>\n<\/li>\n<\/ul>\n<h3 data-start=\"2334\" data-end=\"2366\">Convolutions for language<\/h3>\n<ul data-start=\"2367\" data-end=\"2646\">\n<li data-start=\"2367\" data-end=\"2573\">\n<p data-start=\"2369\" data-end=\"2573\"><strong data-start=\"2369\" data-end=\"2453\">CNN\/ConvS2S &#8211; Convolutional Neural Networks \/ Convolutional Sequence-to-Sequence<\/strong>: <strong data-start=\"2472\" data-end=\"2486\">locally<\/strong> parallelizable, good on <strong data-start=\"2501\" data-end=\"2518\">local patterns<\/strong>, less at ease with <strong data-start=\"2544\" data-end=\"2572\">very long dependencies<\/strong>.<\/p>\n<\/li>\n<li data-start=\"2574\" data-end=\"2646\">\n<p data-start=\"2576\" data-end=\"2646\"><strong data-start=\"2577\" data-end=\"2588\">(WaveNet<\/strong> for audio: generative convolutional architecture).<\/p>\n<\/li>\n<\/ul>\n<h3 data-start=\"2648\" data-end=\"2684\">Pre-deep&#8221; statistical methods<\/h3>\n<ul data-start=\"2685\" data-end=\"3133\">\n<li data-start=\"2685\" data-end=\"2745\">\n<p data-start=\"2687\" data-end=\"2745\"><strong data-start=\"2687\" data-end=\"2708\">n-gram models<\/strong> (counting language models),<\/p>\n<\/li>\n<li data-start=\"2746\" data-end=\"2839\">\n<p data-start=\"2748\" data-end=\"2839\"><strong data-start=\"2748\" data-end=\"2778\">HMM &#8211; Hidden Markov Models<\/strong> for sequence labeling,<\/p>\n<\/li>\n<li data-start=\"2840\" data-end=\"2942\">\n<p data-start=\"2842\" data-end=\"2942\"><strong data-start=\"2842\" data-end=\"2877\">CRF &#8211; Conditional Random Fields<\/strong> for structured labeling,<\/p>\n<\/li>\n<li data-start=\"2943\" data-end=\"3133\">\n<p data-start=\"2945\" data-end=\"3133\"><strong data-start=\"2945\" data-end=\"2991\">PCFG &#8211; Probabilistic Context-Free Grammars<\/strong>.<br data-start=\"3033\" data-end=\"3036\">\u2192 Little <strong data-start=\"3045\" data-end=\"3073\">semantic understanding<\/strong>, strong <strong data-start=\"3081\" data-end=\"3109\"> <em data-start=\"3097\" data-end=\"3107\">feature<\/em> engineering<\/strong>, limited performance.<\/p>\n<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2480 aligncenter\" src=\"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-1024x766.png\" alt=\"\" width=\"769\" height=\"575\" srcset=\"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-1024x766.png 1024w, https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-300x224.png 300w, https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-768x574.png 768w, https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-1536x1148.png 1536w, https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-2048x1531.png 2048w\" sizes=\"auto, (max-width: 769px) 100vw, 769px\" \/><\/p>\n<hr data-start=\"3135\" data-end=\"3138\">\n<h2 data-start=\"3140\" data-end=\"3174\">Key differences (overview)<\/h2>\n<div class=\"_tableContainer_1rjym_1\">\n<div class=\"group _tableWrapper_1rjym_13 flex w-fit flex-col-reverse\" tabindex=\"-1\">\n<table class=\"w-fit min-w-(--thread-content-width)\" data-start=\"3176\" data-end=\"3961\">\n<thead data-start=\"3176\" data-end=\"3270\">\n<tr data-start=\"3176\" data-end=\"3270\">\n<th data-start=\"3176\" data-end=\"3188\" data-col-size=\"sm\">Dimension<\/th>\n<th data-start=\"3188\" data-end=\"3206\" data-col-size=\"sm\"><strong data-start=\"3190\" data-end=\"3205\">Transformer<\/strong><\/th>\n<th data-start=\"3206\" data-end=\"3225\" data-col-size=\"sm\"><strong data-start=\"3208\" data-end=\"3224\">RNN\/LSTM\/GRU<\/strong><\/th>\n<th data-start=\"3225\" data-end=\"3243\" data-col-size=\"sm\"><strong data-start=\"3227\" data-end=\"3242\">CNN\/ConvS2S<\/strong><\/th>\n<th data-start=\"3243\" data-end=\"3270\" data-col-size=\"sm\"><strong data-start=\"3245\" data-end=\"3268\">n-gram\/HMM\/CRF\/PCFG<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody data-start=\"3293\" data-end=\"3961\">\n<tr data-start=\"3293\" data-end=\"3428\">\n<td data-start=\"3293\" data-end=\"3310\" data-col-size=\"sm\"><strong data-start=\"3295\" data-end=\"3309\">Processing<\/strong><\/td>\n<td data-start=\"3310\" data-end=\"3339\" data-col-size=\"sm\">Parallel (self-attention)<\/td>\n<td data-start=\"3339\" data-end=\"3375\" data-col-size=\"sm\">Sequential (recurrent hidden state)<\/td>\n<td data-start=\"3375\" data-end=\"3403\" data-col-size=\"sm\">Local parallel (filters)<\/td>\n<td data-start=\"3403\" data-end=\"3428\" data-col-size=\"sm\">Counting\/statistics<\/td>\n<\/tr>\n<tr data-start=\"3429\" data-end=\"3516\">\n<td data-start=\"3429\" data-end=\"3455\" data-col-size=\"sm\"><strong data-start=\"3431\" data-end=\"3454\">Long outbuildings<\/strong><\/td>\n<td data-start=\"3455\" data-end=\"3469\" data-col-size=\"sm\">Excellent<\/td>\n<td data-start=\"3469\" data-end=\"3494\" data-col-size=\"sm\">Difficult (gradients)<\/td>\n<td data-start=\"3494\" data-end=\"3505\" data-col-size=\"sm\">Medium<\/td>\n<td data-start=\"3505\" data-end=\"3516\" data-col-size=\"sm\">Weak<\/td>\n<\/tr>\n<tr data-start=\"3517\" data-end=\"3603\">\n<td data-start=\"3517\" data-end=\"3546\" data-col-size=\"sm\"><strong data-start=\"3519\" data-end=\"3545\">Drive speed<\/strong><\/td>\n<td data-start=\"3546\" data-end=\"3573\" data-col-size=\"sm\">High (GPU\/TPU-friendly)<\/td>\n<td data-start=\"3573\" data-end=\"3586\" data-col-size=\"sm\">Slower<\/td>\n<td data-start=\"3586\" data-end=\"3594\" data-col-size=\"sm\">High<\/td>\n<td data-start=\"3594\" data-end=\"3603\" data-col-size=\"sm\">High<\/td>\n<\/tr>\n<tr data-start=\"3604\" data-end=\"3692\">\n<td data-start=\"3604\" data-end=\"3624\" data-col-size=\"sm\"><strong data-start=\"3606\" data-end=\"3623\">Long context<\/strong><\/td>\n<td data-start=\"3624\" data-end=\"3653\" data-col-size=\"sm\">Large window (\u2191 <em data-start=\"3643\" data-end=\"3651\">tokens<\/em>)<\/td>\n<td data-start=\"3653\" data-end=\"3662\" data-col-size=\"sm\">Limited<\/td>\n<td data-start=\"3662\" data-end=\"3677\" data-col-size=\"sm\">Limited-medium<\/td>\n<td data-start=\"3677\" data-end=\"3692\" data-col-size=\"sm\">Very limited<\/td>\n<\/tr>\n<tr data-start=\"3693\" data-end=\"3755\">\n<td data-start=\"3693\" data-end=\"3715\" data-col-size=\"sm\"><strong data-start=\"3695\" data-end=\"3714\">LLM Scalability<\/strong><\/td>\n<td data-start=\"3715\" data-end=\"3728\" data-col-size=\"sm\">Very good<\/td>\n<td data-start=\"3728\" data-end=\"3738\" data-col-size=\"sm\">Limited<\/td>\n<td data-start=\"3738\" data-end=\"3748\" data-col-size=\"sm\">Average<\/td>\n<td data-start=\"3748\" data-end=\"3755\" data-col-size=\"sm\">N\/A<\/td>\n<\/tr>\n<tr data-start=\"3756\" data-end=\"3823\">\n<td data-start=\"3756\" data-end=\"3779\" data-col-size=\"sm\"><strong data-start=\"3758\" data-end=\"3778\">Data requirements<\/strong><\/td>\n<td data-start=\"3779\" data-end=\"3789\" data-col-size=\"sm\">High<\/td>\n<td data-start=\"3789\" data-end=\"3800\" data-col-size=\"sm\">Lower<\/td>\n<td data-start=\"3800\" data-end=\"3811\" data-col-size=\"sm\">Low<\/td>\n<td data-start=\"3811\" data-end=\"3823\" data-col-size=\"sm\">Low<\/td>\n<\/tr>\n<tr data-start=\"3824\" data-end=\"3899\">\n<td data-start=\"3824\" data-end=\"3851\" data-col-size=\"sm\"><strong data-start=\"3826\" data-end=\"3850\">Memory\/compute cost<\/strong><\/td>\n<td data-start=\"3851\" data-end=\"3871\" data-col-size=\"sm\">High (attention)<\/td>\n<td data-start=\"3871\" data-end=\"3880\" data-col-size=\"sm\">Moderate<\/td>\n<td data-start=\"3880\" data-end=\"3889\" data-col-size=\"sm\">Moderate<\/td>\n<td data-start=\"3889\" data-end=\"3899\" data-col-size=\"sm\">Low<\/td>\n<\/tr>\n<tr data-start=\"3900\" data-end=\"3961\">\n<td data-start=\"3900\" data-end=\"3929\" data-col-size=\"sm\"><strong data-start=\"3902\" data-end=\"3928\">Local inductive bias<\/strong><\/td>\n<td data-start=\"3929\" data-end=\"3944\" data-col-size=\"sm\">Weaker<\/td>\n<td data-start=\"3944\" data-end=\"3948\" data-col-size=\"sm\">&#8211;<\/td>\n<td data-start=\"3948\" data-end=\"3956\" data-col-size=\"sm\">Stronger<\/td>\n<td data-start=\"3956\" data-end=\"3961\" data-col-size=\"sm\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<hr data-start=\"3963\" data-end=\"3966\">\n<h2 data-start=\"3968\" data-end=\"4007\">Or do we still prefer the old ones?<\/h2>\n<ul data-start=\"4008\" data-end=\"4356\">\n<li data-start=\"4008\" data-end=\"4121\">\n<p data-start=\"4010\" data-end=\"4121\"><strong data-start=\"4010\" data-end=\"4044\">Strong resource constraints<\/strong> (embedded\/edge, small datasets) \u2192 <strong data-start=\"4087\" data-end=\"4099\">GRU\/LSTM<\/strong> remain relevant.<\/p>\n<\/li>\n<li data-start=\"4122\" data-end=\"4242\">\n<p data-start=\"4124\" data-end=\"4242\"><strong data-start=\"4124\" data-end=\"4151\">Dominant local patterns<\/strong> (small sequences, regular patterns) \u2192 <strong data-start=\"4194\" data-end=\"4209\">CNN\/ConvS2S<\/strong> efficient, simple and fast.<\/p>\n<\/li>\n<li data-start=\"4243\" data-end=\"4356\">\n<p data-start=\"4245\" data-end=\"4356\"><strong data-start=\"4245\" data-end=\"4283\">Historical labeling pipelines<\/strong> (little data, need for interpretability) \u2192 <strong data-start=\"4330\" data-end=\"4341\">CRF\/HMM<\/strong> still useful.<\/p>\n<\/li>\n<\/ul>\n<p data-start=\"5407\" data-end=\"5449\">\n","protected":false},"excerpt":{"rendered":"<p>What Transformer fundamentally changes Central mechanism:self-attention\u2192 The model &#8220;looks&#8221; at all the words in parallel and learns which relationships are important, even at long distance. Massive parallelization: no strictly sequential processing as in Recurrent Neural Networks (RNN ) \u2192 much faster training on Graphics Processing Units(GPU) and Tensor Processing Units(TPU). Long context: handles large contextual [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[78],"tags":[],"class_list":["post-4741","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Transform vs. older NLP models | Palmer<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Transform vs. older NLP models | Palmer\" \/>\n<meta property=\"og:description\" content=\"What Transformer fundamentally changes Central mechanism:self-attention\u2192 The model &#8220;looks&#8221; at all the words in parallel and learns which relationships are important, even at long distance. Massive parallelization: no strictly sequential processing as in Recurrent Neural Networks (RNN ) \u2192 much faster training on Graphics Processing Units(GPU) and Tensor Processing Units(TPU). Long context: handles large contextual [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/\" \/>\n<meta property=\"og:site_name\" content=\"Palmer\" \/>\n<meta property=\"article:published_time\" content=\"2025-09-24T11:06:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-scaled.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1914\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Laurent Zennadi\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Laurent Zennadi\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/\"},\"author\":{\"name\":\"Laurent Zennadi\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/person\\\/7ea52877fd35814d1d2f8e6e03daa3ed\"},\"headline\":\"Transform vs. older NLP models\",\"datePublished\":\"2025-09-24T11:06:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/\"},\"wordCount\":410,\"publisher\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2025\\\/09\\\/matrice_axes_transformer_vs_anciens-1024x766.png\",\"articleSection\":[\"Artificial intelligence\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/\",\"name\":\"Transform vs. older NLP models | Palmer\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2025\\\/09\\\/matrice_axes_transformer_vs_anciens-1024x766.png\",\"datePublished\":\"2025-09-24T11:06:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/#primaryimage\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2025\\\/09\\\/matrice_axes_transformer_vs_anciens-1024x766.png\",\"contentUrl\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2025\\\/09\\\/matrice_axes_transformer_vs_anciens-1024x766.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/transform-vs-older-nlp-models\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Transform vs. older NLP models\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/\",\"name\":\"Palmer\",\"description\":\"Evolve at the speed of change\",\"publisher\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\",\"name\":\"Palmer\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Palmer_Logo_Full_PenBlue_1x1-2.jpg\",\"contentUrl\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Palmer_Logo_Full_PenBlue_1x1-2.jpg\",\"width\":480,\"height\":480,\"caption\":\"Palmer\"},\"image\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/company\\\/palmer-consulting\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/person\\\/7ea52877fd35814d1d2f8e6e03daa3ed\",\"name\":\"Laurent Zennadi\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"caption\":\"Laurent Zennadi\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Transform vs. older NLP models | Palmer","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/","og_locale":"en_US","og_type":"article","og_title":"Transform vs. older NLP models | Palmer","og_description":"What Transformer fundamentally changes Central mechanism:self-attention\u2192 The model &#8220;looks&#8221; at all the words in parallel and learns which relationships are important, even at long distance. Massive parallelization: no strictly sequential processing as in Recurrent Neural Networks (RNN ) \u2192 much faster training on Graphics Processing Units(GPU) and Tensor Processing Units(TPU). Long context: handles large contextual [&hellip;]","og_url":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/","og_site_name":"Palmer","article_published_time":"2025-09-24T11:06:17+00:00","og_image":[{"width":2560,"height":1914,"url":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-scaled.png","type":"image\/png"}],"author":"Laurent Zennadi","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Laurent Zennadi","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/#article","isPartOf":{"@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/"},"author":{"name":"Laurent Zennadi","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/person\/7ea52877fd35814d1d2f8e6e03daa3ed"},"headline":"Transform vs. older NLP models","datePublished":"2025-09-24T11:06:17+00:00","mainEntityOfPage":{"@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/"},"wordCount":410,"publisher":{"@id":"https:\/\/palmer-consulting.com\/en\/#organization"},"image":{"@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/#primaryimage"},"thumbnailUrl":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-1024x766.png","articleSection":["Artificial intelligence"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/","url":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/","name":"Transform vs. older NLP models | Palmer","isPartOf":{"@id":"https:\/\/palmer-consulting.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/#primaryimage"},"image":{"@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/#primaryimage"},"thumbnailUrl":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-1024x766.png","datePublished":"2025-09-24T11:06:17+00:00","breadcrumb":{"@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/#primaryimage","url":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-1024x766.png","contentUrl":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2025\/09\/matrice_axes_transformer_vs_anciens-1024x766.png"},{"@type":"BreadcrumbList","@id":"https:\/\/palmer-consulting.com\/en\/transform-vs-older-nlp-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/palmer-consulting.com\/en\/home\/"},{"@type":"ListItem","position":2,"name":"Transform vs. older NLP models"}]},{"@type":"WebSite","@id":"https:\/\/palmer-consulting.com\/en\/#website","url":"https:\/\/palmer-consulting.com\/en\/","name":"Palmer","description":"Evolve at the speed of change","publisher":{"@id":"https:\/\/palmer-consulting.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/palmer-consulting.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/palmer-consulting.com\/en\/#organization","name":"Palmer","url":"https:\/\/palmer-consulting.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/08\/Palmer_Logo_Full_PenBlue_1x1-2.jpg","contentUrl":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/08\/Palmer_Logo_Full_PenBlue_1x1-2.jpg","width":480,"height":480,"caption":"Palmer"},"image":{"@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.linkedin.com\/company\/palmer-consulting\/"]},{"@type":"Person","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/person\/7ea52877fd35814d1d2f8e6e03daa3ed","name":"Laurent Zennadi","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","caption":"Laurent Zennadi"}}]}},"_links":{"self":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts\/4741","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/comments?post=4741"}],"version-history":[{"count":0,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts\/4741\/revisions"}],"wp:attachment":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/media?parent=4741"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/categories?post=4741"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/tags?post=4741"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}