{"id":4861,"date":"2025-10-19T21:03:24","date_gmt":"2025-10-19T21:03:24","guid":{"rendered":"https:\/\/palmer-consulting.com\/multimodal-ai\/"},"modified":"2026-04-16T15:09:09","modified_gmt":"2026-04-16T15:09:09","slug":"multimodal-ai","status":"publish","type":"post","link":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/","title":{"rendered":"Multimodal AI"},"content":{"rendered":"<h1 data-start=\"354\" data-end=\"446\"><strong data-start=\"356\" data-end=\"446\">Multimodal AI: when artificial intelligence sees, listens and understands as we do<\/strong><\/h1>\n<p data-start=\"448\" data-end=\"944\">Artificial intelligence (AI) has entered a new era. After dominating the processing of text or images in isolation, modern systems are opening up to the understanding of several types of information simultaneously: text, sound, image, video, even sensor signals. This convergence has a name: <strong data-start=\"774\" data-end=\"794\">multimodal AI<\/strong>.<br data-start=\"795\" data-end=\"798\">It symbolizes a giant step towards a more natural, more human and more useful AI, capable of interpreting the world in the same way as we do.  <\/p>\n<hr data-start=\"946\" data-end=\"949\">\n<h2 data-start=\"951\" data-end=\"993\"><strong data-start=\"954\" data-end=\"993\">1. What is multimodal AI?<\/strong><\/h2>\n<p data-start=\"995\" data-end=\"1326\">A <strong data-start=\"999\" data-end=\"1011\">modality<\/strong> designates a type of data perceived or processed: text, image, audio, video, or sensor data.<br data-start=\"1123\" data-end=\"1126\">Until recently, AI models were <strong data-start=\"1170\" data-end=\"1183\">unimodal<\/strong>: a language model only included text, a vision model only processed images, a speech model only manipulated sound.<\/p>\n<p data-start=\"1328\" data-end=\"1600\">Multimodal AI, on the other hand, is capable of <strong data-start=\"1367\" data-end=\"1438\">understanding, integrating and producing several modalities<\/strong> at <strong data-start=\"1367\" data-end=\"1438\">once<\/strong>.<br data-start=\"1439\" data-end=\"1442\">In other words, it can read a text, analyze an image, listen to a sound and cross-reference this information to produce a more complete and coherent response.<\/p>\n<p data-start=\"1602\" data-end=\"1846\">For example, a multimodal assistant can look at a photo of a dish, read a recipe, listen to an instruction and then explain how to reproduce it. This ability to merge perceptions is at the heart of the concept of multimodality. <\/p>\n<hr data-start=\"1848\" data-end=\"1851\">\n<h2 data-start=\"1853\" data-end=\"1902\"><strong data-start=\"1856\" data-end=\"1902\">2. How does multimodal AI work?<\/strong><\/h2>\n<p data-start=\"1904\" data-end=\"2043\">Multimodal architectures combine several specialized subsystems, called <strong data-start=\"1990\" data-end=\"2003\">encoders<\/strong>, each designed for a specific type of data.<\/p>\n<ul data-start=\"2044\" data-end=\"2247\">\n<li data-start=\"2044\" data-end=\"2112\">\n<p data-start=\"2046\" data-end=\"2112\">A text encoder transforms words into numerical vectors.<\/p>\n<\/li>\n<li data-start=\"2113\" data-end=\"2187\">\n<p data-start=\"2115\" data-end=\"2187\">An image encoder converts pixels into visual representations.<\/p>\n<\/li>\n<li data-start=\"2188\" data-end=\"2247\">\n<p data-start=\"2190\" data-end=\"2247\">An audio encoder extracts the sound characteristics.<\/p>\n<\/li>\n<\/ul>\n<p data-start=\"2249\" data-end=\"2565\">These representations are then <strong data-start=\"2282\" data-end=\"2318\">merged into a common space<\/strong>, where the model learns to establish links between the different types of information. This alignment stage is crucial: it enables the model to understand that a word, image or sound can refer to the same entity or concept. <\/p>\n<p data-start=\"2567\" data-end=\"2649\">Once this fusion has been achieved, AI can perform complex tasks such as :<\/p>\n<ul data-start=\"2650\" data-end=\"2827\">\n<li data-start=\"2650\" data-end=\"2686\">\n<p data-start=\"2652\" data-end=\"2686\">Describe a picture in words.<\/p>\n<\/li>\n<li data-start=\"2687\" data-end=\"2736\">\n<p data-start=\"2689\" data-end=\"2736\">Answer a question from a photo.<\/p>\n<\/li>\n<li data-start=\"2737\" data-end=\"2779\">\n<p data-start=\"2739\" data-end=\"2779\">Generate an image from text.<\/p>\n<\/li>\n<li data-start=\"2780\" data-end=\"2827\">\n<p data-start=\"2782\" data-end=\"2827\">Understand a video and produce a summary.<\/p>\n<\/li>\n<\/ul>\n<p data-start=\"2829\" data-end=\"2970\"><strong data-start=\"2832\" data-end=\"2858\">Multimodal generation<\/strong> can even switch from one modality to another, for example transforming text into image or sound into text.<\/p>\n<hr data-start=\"2972\" data-end=\"2975\">\n<h2 data-start=\"2977\" data-end=\"3020\"><strong data-start=\"2980\" data-end=\"3020\">3. The benefits of multimodal AI<\/strong><\/h2>\n<h3 data-start=\"3022\" data-end=\"3082\"><strong data-start=\"3026\" data-end=\"3080\">A more human-like understanding<\/strong><\/h3>\n<p data-start=\"3083\" data-end=\"3389\">Multimodal AI reproduces our natural way of perceiving. By combining sight, hearing and language, it better understands the overall context of a situation.<br data-start=\"3246\" data-end=\"3249\">Where an isolated text may be ambiguous, or an image insufficient, the combination of the two gives a finer, more reliable interpretation. <\/p>\n<h3 data-start=\"3391\" data-end=\"3454\"><strong data-start=\"3395\" data-end=\"3452\">Enhanced performance and superior robustness<\/strong><\/h3>\n<p data-start=\"3455\" data-end=\"3820\">A multimodal model is often more accurate, as it can <strong data-start=\"3513\" data-end=\"3570\">compensate for the weaknesses of one modality with another<\/strong>.<br data-start=\"3571\" data-end=\"3574\">If an image is blurred, the associated text helps to understand it. If the text is incomplete, the video provides the missing clues.<br data-start=\"3703\" data-end=\"3706\">This makes these systems particularly effective in real-life environments, which are often noisy or imperfect. <\/p>\n<h3 data-start=\"3822\" data-end=\"3886\"><strong data-start=\"3826\" data-end=\"3884\">More natural interactions with users<\/strong><\/h3>\n<p data-start=\"3887\" data-end=\"4193\">One of the greatest benefits of multimodality is the <strong data-start=\"3943\" data-end=\"3969\">fluidity of interaction<\/strong>.<br data-start=\"3970\" data-end=\"3973\">The user can speak, show, write, point &#8211; and the AI understands it all.<br data-start=\"4055\" data-end=\"4058\">This approach makes virtual assistants, robots and AI interfaces much more intuitive and closer to human behavior.<\/p>\n<h3 data-start=\"4195\" data-end=\"4235\"><strong data-start=\"4199\" data-end=\"4233\">Versatile use<\/strong><\/h3>\n<p data-start=\"4236\" data-end=\"4563\">Multimodal models are <strong data-start=\"4265\" data-end=\"4281\">cross-disciplinary<\/strong>: they apply to healthcare, robotics, security, design, education, marketing and even autonomous driving.<br data-start=\"4420\" data-end=\"4423\">They are no longer limited to a single domain, but can be adapted to different contexts thanks to their sensory integration capacity.<\/p>\n<hr data-start=\"4565\" data-end=\"4568\">\n<h2 data-start=\"4570\" data-end=\"4606\"><strong data-start=\"4573\" data-end=\"4606\">4. Main use cases<\/strong><\/h2>\n<h3 data-start=\"4608\" data-end=\"4645\"><strong data-start=\"4612\" data-end=\"4643\">Health and medical diagnostics<\/strong><\/h3>\n<p data-start=\"4646\" data-end=\"4830\">Multimodal AI can combine medical images (MRI, CT) with physician reports and patient data to produce a more accurate, personalized analysis.<\/p>\n<h3 data-start=\"4832\" data-end=\"4863\"><strong data-start=\"4836\" data-end=\"4861\">Research and safety<\/strong><\/h3>\n<p data-start=\"4864\" data-end=\"5070\">When searching for images or videos, multimodal AI can include a natural language query such as: &#8220;Show me all the videos of a person wearing a red hard hat on a building site&#8221;.<\/p>\n<h3 data-start=\"5072\" data-end=\"5114\"><strong data-start=\"5076\" data-end=\"5112\">Robotics and autonomous vehicles<\/strong><\/h3>\n<p data-start=\"5115\" data-end=\"5322\">Intelligent cars and robots use multiple sensory streams: cameras, radar, microphones, GPS. Multimodal AI fuses these data to understand their environment and act in real time. <\/p>\n<h3 data-start=\"5324\" data-end=\"5360\"><strong data-start=\"5328\" data-end=\"5358\">Customer service and sales<\/strong><\/h3>\n<p data-start=\"5361\" data-end=\"5537\">A multimodal chatbot can interpret a photo of a damaged product, read the user&#8217;s complaint and respond in a contextualized way, combining vision and text.<\/p>\n<h3 data-start=\"5539\" data-end=\"5575\"><strong data-start=\"5543\" data-end=\"5573\">Creation and entertainment<\/strong><\/h3>\n<p data-start=\"5576\" data-end=\"5788\">Models capable of switching from text to image or from sound to video are revolutionizing artistic creation, advertising and film. They enable multimedia content to be generated from a simple idea. <\/p>\n<hr data-start=\"5790\" data-end=\"5793\">\n<h2 data-start=\"5795\" data-end=\"5834\"><strong data-start=\"5798\" data-end=\"5834\">5. The challenges of multimodality<\/strong><\/h2>\n<p data-start=\"5836\" data-end=\"5934\">Despite its potential, multimodal AI poses many technical, ethical and economic challenges.<\/p>\n<h3 data-start=\"5936\" data-end=\"5970\"><strong data-start=\"5940\" data-end=\"5968\">Technological complexity<\/strong><\/h3>\n<p data-start=\"5971\" data-end=\"6151\">Merging multiple modalities requires more sophisticated architectures, large amounts of aligned data and perfect synchronization between information flows.<\/p>\n<h3 data-start=\"6153\" data-end=\"6200\"><strong data-start=\"6157\" data-end=\"6198\">Massive need for data and computation<\/strong><\/h3>\n<p data-start=\"6201\" data-end=\"6405\">Forming a multimodal model requires millions of examples combining text, image and sound.<br data-start=\"6291\" data-end=\"6294\">These datasets are expensive to produce and clean, and require considerable computing power.<\/p>\n<h3 data-start=\"6407\" data-end=\"6455\"><strong data-start=\"6411\" data-end=\"6453\">Alignment and consistency problems<\/strong><\/h3>\n<p data-start=\"6456\" data-end=\"6649\">Ensuring that the model correctly understands the correspondence between text and image (for example, that &#8220;a dog&#8221; corresponds to the figure of a dog in the image) remains a major challenge.<\/p>\n<h3 data-start=\"6651\" data-end=\"6697\"><strong data-start=\"6655\" data-end=\"6695\">Ethical and governance issues<\/strong><\/h3>\n<p data-start=\"6698\" data-end=\"6955\">Multimodal models often manipulate personal data: faces, voices, documents.<br data-start=\"6793\" data-end=\"6796\">This raises issues of privacy, bias and liability.<br data-start=\"6873\" data-end=\"6876\">Clear governance and control mechanisms become indispensable.<\/p>\n<h3 data-start=\"6957\" data-end=\"6988\"><strong data-start=\"6961\" data-end=\"6986\">Limited explicability<\/strong><\/h3>\n<p data-start=\"6989\" data-end=\"7212\">As with large language models, multimodality makes the explanation of model decisions even more complex.<br data-start=\"7112\" data-end=\"7115\">Knowing why a model has produced a particular interpretation or image is difficult to trace.<\/p>\n<hr data-start=\"7214\" data-end=\"7217\">\n<h2 data-start=\"7219\" data-end=\"7272\"><strong data-start=\"7222\" data-end=\"7272\">6. Comparison: unimodal AI vs. multimodal AI<\/strong><\/h2>\n<div class=\"_tableContainer_1rjym_1\">\n<div class=\"group _tableWrapper_1rjym_13 flex w-fit flex-col-reverse\" tabindex=\"-1\">\n<table class=\"w-fit min-w-(--thread-content-width)\" data-start=\"7274\" data-end=\"7925\">\n<thead data-start=\"7274\" data-end=\"7329\">\n<tr data-start=\"7274\" data-end=\"7329\">\n<th data-start=\"7274\" data-end=\"7288\" data-col-size=\"sm\"><strong data-start=\"7276\" data-end=\"7287\">Criteria<\/strong><\/th>\n<th data-start=\"7288\" data-end=\"7307\" data-col-size=\"sm\"><strong data-start=\"7290\" data-end=\"7306\">Unimodal AI<\/strong><\/th>\n<th data-start=\"7307\" data-end=\"7329\" data-col-size=\"md\"><strong data-start=\"7309\" data-end=\"7327\">Multimodal AI<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody data-start=\"7387\" data-end=\"7925\">\n<tr data-start=\"7387\" data-end=\"7505\">\n<td data-start=\"7387\" data-end=\"7414\" data-col-size=\"sm\">Type of data processed<\/td>\n<td data-col-size=\"sm\" data-start=\"7414\" data-end=\"7455\">Single modality (text, image, sound)<\/td>\n<td data-col-size=\"md\" data-start=\"7455\" data-end=\"7505\">Several modalities (text, image, sound, video)<\/td>\n<\/tr>\n<tr data-start=\"7506\" data-end=\"7572\">\n<td data-start=\"7506\" data-end=\"7534\" data-col-size=\"sm\">Understanding the context<\/td>\n<td data-col-size=\"sm\" data-start=\"7534\" data-end=\"7544\">Limited<\/td>\n<td data-col-size=\"md\" data-start=\"7544\" data-end=\"7572\">Deep and contextual<\/td>\n<\/tr>\n<tr data-start=\"7573\" data-end=\"7653\">\n<td data-start=\"7573\" data-end=\"7586\" data-col-size=\"sm\">Ruggedness<\/td>\n<td data-col-size=\"sm\" data-start=\"7586\" data-end=\"7609\">Low noise level<\/td>\n<td data-col-size=\"md\" data-start=\"7609\" data-end=\"7653\">High thanks to source redundancy<\/td>\n<\/tr>\n<tr data-start=\"7654\" data-end=\"7742\">\n<td data-start=\"7654\" data-end=\"7680\" data-col-size=\"sm\">User interaction<\/td>\n<td data-col-size=\"sm\" data-start=\"7680\" data-end=\"7717\">Restricted to a single input mode<\/td>\n<td data-col-size=\"md\" data-start=\"7717\" data-end=\"7742\">Natural and multiple<\/td>\n<\/tr>\n<tr data-start=\"7743\" data-end=\"7791\">\n<td data-start=\"7743\" data-end=\"7766\" data-col-size=\"sm\">Technical complexity<\/td>\n<td data-col-size=\"sm\" data-start=\"7766\" data-end=\"7776\">Medium<\/td>\n<td data-col-size=\"md\" data-start=\"7776\" data-end=\"7791\">Very high<\/td>\n<\/tr>\n<tr data-start=\"7792\" data-end=\"7839\">\n<td data-start=\"7792\" data-end=\"7812\" data-col-size=\"sm\">Data requirements<\/td>\n<td data-col-size=\"sm\" data-start=\"7812\" data-end=\"7821\">Moderate<\/td>\n<td data-col-size=\"md\" data-start=\"7821\" data-end=\"7839\">Very high<\/td>\n<\/tr>\n<tr data-start=\"7840\" data-end=\"7878\">\n<td data-start=\"7840\" data-end=\"7854\" data-col-size=\"sm\">Versatility<\/td>\n<td data-col-size=\"sm\" data-start=\"7854\" data-end=\"7864\">Limited<\/td>\n<td data-col-size=\"md\" data-start=\"7864\" data-end=\"7878\">Very wide<\/td>\n<\/tr>\n<tr data-start=\"7879\" data-end=\"7925\">\n<td data-start=\"7879\" data-end=\"7894\" data-col-size=\"sm\">Applications<\/td>\n<td data-col-size=\"sm\" data-start=\"7894\" data-end=\"7908\">Specific<\/td>\n<td data-col-size=\"md\" data-start=\"7908\" data-end=\"7925\">Cross-functional<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p data-start=\"7927\" data-end=\"8094\">This comparison clearly shows that multimodal AI is the next logical step in the evolution of artificial intelligence, at the cost of increased complexity.<\/p>\n<hr data-start=\"8096\" data-end=\"8099\">\n<h2 data-start=\"8101\" data-end=\"8152\"><strong data-start=\"8104\" data-end=\"8152\">7. Why multimodal AI is strategic<\/strong><\/h2>\n<h3 data-start=\"8154\" data-end=\"8189\"><strong data-start=\"8158\" data-end=\"8187\">Towards more general AI<\/strong><\/h3>\n<p data-start=\"8190\" data-end=\"8436\">Multimodality is a key step towards what is known as<strong data-start=\"8252\" data-end=\"8296\">General Artificial Intelligence (GAI<\/strong>).<br data-start=\"8297\" data-end=\"8300\">A system capable of perceiving, understanding and acting across multiple types of data comes close to human cognitive functioning.<\/p>\n<h3 data-start=\"8438\" data-end=\"8491\"><strong data-start=\"8442\" data-end=\"8489\">A lever for innovation<\/strong><\/h3>\n<p data-start=\"8492\" data-end=\"8798\">Companies can exploit multimodality to create richer experiences: combined data analysis, immersive marketing, interactive assistants, autonomous production robots.<br data-start=\"8688\" data-end=\"8691\">It represents a major competitive advantage for players capable of integrating it into their processes.<\/p>\n<h3 data-start=\"8800\" data-end=\"8848\"><strong data-start=\"8804\" data-end=\"8846\">The challenge of technological sovereignty<\/strong><\/h3>\n<p data-start=\"8849\" data-end=\"9184\">Mastering multimodality means mastering future man-machine interfaces.<br data-start=\"8930\" data-end=\"8933\">The major technological powers are investing massively in this field to avoid dependence on foreign systems.<br data-start=\"9064\" data-end=\"9067\">Europe, and France in particular, are seeking to catch up by developing their own multimodal models.<\/p>\n<h3 data-start=\"9186\" data-end=\"9228\"><strong data-start=\"9190\" data-end=\"9226\">A step forward for accessibility<\/strong><\/h3>\n<p data-start=\"9229\" data-end=\"9325\">Multimodal AI opens up new perspectives for people with disabilities:<\/p>\n<ul data-start=\"9326\" data-end=\"9550\">\n<li data-start=\"9326\" data-end=\"9368\">\n<p data-start=\"9328\" data-end=\"9368\">Reading images for the visually impaired.<\/p>\n<\/li>\n<li data-start=\"9369\" data-end=\"9403\">\n<p data-start=\"9371\" data-end=\"9403\">Instant voice translation.<\/p>\n<\/li>\n<li data-start=\"9404\" data-end=\"9550\">\n<p data-start=\"9406\" data-end=\"9550\">Gestural and visual interaction for the hearing impaired.<br data-start=\"9474\" data-end=\"9477\">It brings technology and people closer together in the most inclusive sense.<\/p>\n<\/li>\n<\/ul>\n<hr data-start=\"9552\" data-end=\"9555\">\n<h2 data-start=\"9557\" data-end=\"9592\"><strong data-start=\"9560\" data-end=\"9592\">8. Future prospects<\/strong><\/h2>\n<p data-start=\"9594\" data-end=\"9777\">The current evolution of multimodal models is moving towards an even deeper integration between perception, reasoning and action.<br data-start=\"9732\" data-end=\"9735\">Several strong trends are emerging:<\/p>\n<ul data-start=\"9779\" data-end=\"10474\">\n<li data-start=\"9779\" data-end=\"9906\">\n<p data-start=\"9781\" data-end=\"9906\"><strong data-start=\"9781\" data-end=\"9809\">Giant foundation models<\/strong> capable of processing text, image, sound, video and actions in a single representation space.<\/p>\n<\/li>\n<li data-start=\"9907\" data-end=\"10069\">\n<p data-start=\"9909\" data-end=\"10069\"><strong data-start=\"9909\" data-end=\"9925\">Embedded AI<\/strong>: miniaturization and deployment of multimodal models on mobile devices or connected objects, for local and private processing.<\/p>\n<\/li>\n<li data-start=\"10070\" data-end=\"10234\">\n<p data-start=\"10072\" data-end=\"10234\"><strong data-start=\"10072\" data-end=\"10094\">Multimodal agents<\/strong>: assistants capable not only of understanding, but also of actively interacting with their environment (speech, movement, vision).<\/p>\n<\/li>\n<li data-start=\"10235\" data-end=\"10351\">\n<p data-start=\"10237\" data-end=\"10351\"><strong data-start=\"10237\" data-end=\"10266\">Content automation<\/strong>: create videos, podcasts and visuals from a simple text prompt.<\/p>\n<\/li>\n<li data-start=\"10352\" data-end=\"10474\">\n<p data-start=\"10354\" data-end=\"10474\"><strong data-start=\"10354\" data-end=\"10379\">Regulation and ethics<\/strong>: developing legal frameworks to guarantee transparency and control over usage.<\/p>\n<\/li>\n<\/ul>\n<p data-start=\"10476\" data-end=\"10609\">These developments herald a fusion between the fields of vision, language and robotics, towards a truly cognitive AI.<\/p>\n<hr data-start=\"10611\" data-end=\"10614\">\n<h2 data-start=\"10616\" data-end=\"10665\"><strong data-start=\"10619\" data-end=\"10665\">9. Conclusion: a sensory revolution<\/strong><\/h2>\n<p data-start=\"10667\" data-end=\"11056\"><strong data-start=\"10669\" data-end=\"10687\">Multimodal AI<\/strong> doesn&#8217;t just improve technical performance: it profoundly changes the nature of interaction between man and machine.<br data-start=\"10829\" data-end=\"10832\">By integrating text, image, sound and video, it enables artificial intelligence to achieve a holistic understanding of the world and create more natural, relevant and powerful experiences.<\/p>\n<p data-start=\"11058\" data-end=\"11300\">This approach opens up a new chapter for innovation, productivity and creativity.<br data-start=\"11151\" data-end=\"11154\">But it also imposes new responsibilities: protecting privacy, guaranteeing transparency and mastering technological complexity.<\/p>\n<p data-start=\"11302\" data-end=\"11501\">Multimodal AI is not just a technical evolution. It&#8217;s a <strong data-start=\"11374\" data-end=\"11400\">sensory revolution<\/strong>, redefining the way we conceive, use and live with artificial intelligence. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Multimodal AI: when artificial intelligence sees, listens and understands as we do Artificial intelligence (AI) has entered a new era. After dominating the processing of text or images in isolation, modern systems are opening up to the understanding of several types of information simultaneously: text, sound, image, video, even sensor signals. This convergence has a [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[78],"tags":[],"class_list":["post-4861","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Multimodal AI | Palmer<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Multimodal AI | Palmer\" \/>\n<meta property=\"og:description\" content=\"Multimodal AI: when artificial intelligence sees, listens and understands as we do Artificial intelligence (AI) has entered a new era. After dominating the processing of text or images in isolation, modern systems are opening up to the understanding of several types of information simultaneously: text, sound, image, video, even sensor signals. This convergence has a [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/\" \/>\n<meta property=\"og:site_name\" content=\"Palmer\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-19T21:03:24+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-04-16T15:09:09+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/09\/social-graph-palmer.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"675\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Laurent Zennadi\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Laurent Zennadi\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/\"},\"author\":{\"name\":\"Laurent Zennadi\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/person\\\/7ea52877fd35814d1d2f8e6e03daa3ed\"},\"headline\":\"Multimodal AI\",\"datePublished\":\"2025-10-19T21:03:24+00:00\",\"dateModified\":\"2026-04-16T15:09:09+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/\"},\"wordCount\":1370,\"publisher\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\"},\"articleSection\":[\"Artificial intelligence\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/\",\"name\":\"Multimodal AI | Palmer\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#website\"},\"datePublished\":\"2025-10-19T21:03:24+00:00\",\"dateModified\":\"2026-04-16T15:09:09+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/multimodal-ai\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Multimodal AI\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/\",\"name\":\"Palmer\",\"description\":\"Evolve at the speed of change\",\"publisher\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#organization\",\"name\":\"Palmer\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Palmer_Logo_Full_PenBlue_1x1-2.jpg\",\"contentUrl\":\"https:\\\/\\\/palmer-consulting.com\\\/wp-content\\\/uploads\\\/2023\\\/08\\\/Palmer_Logo_Full_PenBlue_1x1-2.jpg\",\"width\":480,\"height\":480,\"caption\":\"Palmer\"},\"image\":{\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/company\\\/palmer-consulting\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/palmer-consulting.com\\\/en\\\/#\\\/schema\\\/person\\\/7ea52877fd35814d1d2f8e6e03daa3ed\",\"name\":\"Laurent Zennadi\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g\",\"caption\":\"Laurent Zennadi\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Multimodal AI | Palmer","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/","og_locale":"en_US","og_type":"article","og_title":"Multimodal AI | Palmer","og_description":"Multimodal AI: when artificial intelligence sees, listens and understands as we do Artificial intelligence (AI) has entered a new era. After dominating the processing of text or images in isolation, modern systems are opening up to the understanding of several types of information simultaneously: text, sound, image, video, even sensor signals. This convergence has a [&hellip;]","og_url":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/","og_site_name":"Palmer","article_published_time":"2025-10-19T21:03:24+00:00","article_modified_time":"2026-04-16T15:09:09+00:00","og_image":[{"width":1200,"height":675,"url":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/09\/social-graph-palmer.png","type":"image\/png"}],"author":"Laurent Zennadi","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Laurent Zennadi","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/#article","isPartOf":{"@id":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/"},"author":{"name":"Laurent Zennadi","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/person\/7ea52877fd35814d1d2f8e6e03daa3ed"},"headline":"Multimodal AI","datePublished":"2025-10-19T21:03:24+00:00","dateModified":"2026-04-16T15:09:09+00:00","mainEntityOfPage":{"@id":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/"},"wordCount":1370,"publisher":{"@id":"https:\/\/palmer-consulting.com\/en\/#organization"},"articleSection":["Artificial intelligence"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/","url":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/","name":"Multimodal AI | Palmer","isPartOf":{"@id":"https:\/\/palmer-consulting.com\/en\/#website"},"datePublished":"2025-10-19T21:03:24+00:00","dateModified":"2026-04-16T15:09:09+00:00","breadcrumb":{"@id":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/palmer-consulting.com\/en\/multimodal-ai\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/palmer-consulting.com\/en\/multimodal-ai\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/palmer-consulting.com\/en\/home\/"},{"@type":"ListItem","position":2,"name":"Multimodal AI"}]},{"@type":"WebSite","@id":"https:\/\/palmer-consulting.com\/en\/#website","url":"https:\/\/palmer-consulting.com\/en\/","name":"Palmer","description":"Evolve at the speed of change","publisher":{"@id":"https:\/\/palmer-consulting.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/palmer-consulting.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/palmer-consulting.com\/en\/#organization","name":"Palmer","url":"https:\/\/palmer-consulting.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/08\/Palmer_Logo_Full_PenBlue_1x1-2.jpg","contentUrl":"https:\/\/palmer-consulting.com\/wp-content\/uploads\/2023\/08\/Palmer_Logo_Full_PenBlue_1x1-2.jpg","width":480,"height":480,"caption":"Palmer"},"image":{"@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.linkedin.com\/company\/palmer-consulting\/"]},{"@type":"Person","@id":"https:\/\/palmer-consulting.com\/en\/#\/schema\/person\/7ea52877fd35814d1d2f8e6e03daa3ed","name":"Laurent Zennadi","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/110e8a99f01ca2c88c3d23656103640dc17e08eac86e26d0617937a6846b4007?s=96&d=mm&r=g","caption":"Laurent Zennadi"}}]}},"_links":{"self":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts\/4861","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/comments?post=4861"}],"version-history":[{"count":1,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts\/4861\/revisions"}],"predecessor-version":[{"id":6424,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/posts\/4861\/revisions\/6424"}],"wp:attachment":[{"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/media?parent=4861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/categories?post=4861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/palmer-consulting.com\/en\/wp-json\/wp\/v2\/tags?post=4861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}