There are already over 400,000+ models to choose from, which do you choose?
Generative AI-powered applications (such as the chatbot ChatGPT) use LLMs (“Large Language Models“) to do their magic. LLMs make it possible to instruct and interact with systems, software and platforms using natural language, or NLP (“Natural Language Processing“). Similarly, image generators (such as Dall E or Stable Diffusion) employ GANs (“Generative Adversarial Networks“) to generate images out of thin air.
Initial applications that became hugely popular were chatbots. They launched late 2022 and early 2023 and functioned as general purpose conversational agents, revolutionizing access to knowledge and intelligence. . This availability of on-demand human level intelligence at scale will have profound effects and be an opportunity equalizer for everybody, particularly the disadvantaged, so it is crucial we all learn how to use this technology.
Current versions have added support for embedding content, and calling functions, plug-ins, controlling output and additional modalities such as images, audio and video. popular LLMs and GANs are:
- Original – OpenAI’s ChatGPT
- Proprietary Chatbots – OpenAI ChatGPT, Google Bard, Anthropic Claude
- Open Source – HuggingChat, Mistral AI, IBM WatsonX
- Longest context length (memory) – Anthropic’s Claude
- Less guardrails – xAI’s Grok (requires X Premium+ membership)
- Coding – GitHub CoPilot, Amazon CodeWhisperer, Google Duet AI for Developers, Jetbrains AI Assistent
- LLM Hosting / API Access – Google Vertex AI, Amazon Bedrock, Azure AI
- Image Generation: Adobe FireFly, Stable Diffusion, Dall-E 3
- Education – Khan Academy Khanmingo
- Internet Research: Perplexity AI
- Multi-vendor: Poe
- Bot Creators: Poe
Proprietary or Open Source?
Large Language Models (LLMs) and Generative Adversarial Networks (GANs) may be either proprietary or open-source. A popular repository of 400,000+ open source LLMs is HuggingFace. Yes, you read that right, 400,000+!
However, due to the substantial resources required for training large and powerful models, the most robust LLMs and GANs with consumer-facing interfaces are typically proprietary, however with the massively growing interest in the open-source community this is likely to evolve over time.
I suspect that in the coming years, both will be optimized further, hardware will become more powerful and LLMs and GANs will run on our own devices for privacy reason. They will become swappable so other than general purpose LLMs, domain specific ones can also be selected. Some guardrails will be part of the (curated) training data, and some will be optional.
Popular Chatbots (such as ChatGPT) use proprietary LLMs that are pre-trained on a curated data-set, offer a pre-defined context length (history) and have guardrails in place to prevent the AI from fabricating information (hallucinating) or revealing sensitive information. Additionally there are license restrictions in place to govern how the application (and generated content) can be used.
Open-source models can be self-hosted, either on your device or in the cloud. Factors that decide which LLM or GAN you use typically depend on.
- Pre-trained (curated) dataset
- Size of the LLM or GAN (typically huge)
- Availability of Infrastructure
- Parameter size, Configuration and Response Time
- Context Length (history)
- License -and Usage Restrictions
- Guardrails and Filters
- Pricing (of hardware / hosting)
Modalities
LLMs (Large Language Models) are very powerful, they power current AI applications and are trained and optimized for a certain use-case. There are general purpose models, called foundation models and fine-tuned domain specific models. Generally a model is optimized for one of the following modalities, though it is likely that in the future all powerful models will be multi-modal.
Language
The language modality focuses on understanding, generating, and manipulating written text. It enables computers to analyze, interpret, and generate language, facilitating effective communication and information exchange with users through natural language.
- Classification of data / Sentiment Analysis
- Summarization of data
- Extraction of information
- Text editing, completion and generation
- Translation
- Code generation
- Conversational agents / Chatbots
Code
Although similar to the text modality, “Code” is so specific and used by millions it deserves it own heading. Many AI-powered code assistants now exist and are integrated in IDEs through extensions. They offer tremendous benefits (Read: 11 ways Generative AI is used in Coding).
Vendor specific code assistants are trained on their own cloud infrastructure environment, libraries and dependencies and offer clear advantages if you use those. Popular code assistants are:
- GitHub CoPilot
- Amazon CodeWhisperer
- JetBrains AI Assistant
- Google Duet AI for Developers
- Open Source – Tabnine
Vision
For vision related tasks, Generative Adversarial Networks are used in conjunction with Discriminator Networks to create realistic images or perform other vision related tasks. Common tasks are:
- Text to Image Synthesis (generate images based on textual descriptions)
- Image to Text (upload an image and receive a textual description of what’s on the image)
- Object Recognition: Identifying and classifying objects within an image.
- Facial Recognition: Recognizing faces
- Anomaly Detection: Detecting anomalies or outliers in images.
- Optical Character Recognition (OCR): Recognizing and extracting text from images, including handwritten text.
- Image Segmentation: Classifying and segmenting different regions or objects within an image.
- Image to Image Translation (i.e. summer landscape to winter landscape)
- Style Transfer (maintain content, but change drawing style)
- Image Inpainting (Filling in missing or corrupted parts of an image).
- Super Resolution / Upscaling: Enhancing the resolution and details of low-resolution images.
Content creators and marketing professionals make heavy use of “Text to Image Synthesis” for royalty free image generation for marketing materials, “Image Inpainting” for embedding logos in marketing material and “Super Resolution” for generally upscaling and enhancing media assets. Currently popular image generation tools are:
Video
While there is overlap between tasks in the vision (images) and video modalities, video introduces temporal dynamics due to the changing nature of frames over time. This temporal aspect enables additional tasks specific to the video modality, including:
- Action Recognition: Recognizing and categorizing human actions or activities in a video sequence.
- Video Summarization: Generating a concise summary or representation of a longer video sequence
- Video Object Tracking: Tracking the movement of specific objects or regions of interest across consecutive frames in a video
- Video Captioning: Generating textual descriptions or captions that describe the content of a video
- Video-based Human Pose Estimation: Estimating the 2D or 3D poses of humans within a video sequence
- Video-based Emotion Recognition: Recognizing emotions
- Video-based Event Detection: Identifying and localizing specific events or activities within a video
- Video-based Object Detection and Tracking: Detecting and tracking objects of interest within a video sequence
- Video-based Foreground Segmentation: Separating the foreground objects from the background
- Video-based Depth Estimation: Estimating depth or 3D information from video sequences
These tasks leverage the temporal vectors present in video data, allowing for richer analysis and understanding of events facilitating various applications in areas like surveillance, entertainment, and robotics.
Audio / Speech
The audio modality encompasses a wide range of tasks and applications related to sound and speech. From recognizing spoken words and transcribing audio content to identifying speakers and generating music, the audio modality plays a crucial role in communication, entertainment, and information processing.
- Speech Recognition: Converting spoken language into written text (dictation, transcription)
- Speaker Recognition: Identifying and verifying the identity of a person based on their voice characteristics
- Speech-to-Speech Translation: Converting spoken language in one language into spoken language in another language, providing real-time translation capabilities
- Text-to-Speech (TTS) Conversion: Converting written text into spoken words using synthetic voices
- Audiobook Narration: Generating high-quality speech for narrating audiobooks or other long-form textual content
- Voice Cloning: Creating a synthetic voice that mimics the speech characteristics of a specific individual.
- Emotion-based Speech Synthesis: Modifying synthesized speech to convey specific emotional states
- Accent Conversion: Transforming the accent or dialect of synthesized speech to match a desired accent or regional variation.
- Multilingual Speech Synthesis: Generating speech in multiple languages using a single speech synthesis model
- Expressive Speech Synthesis: Adding expressiveness, emphasis, or prosodic variations to synthesized speech to make it sound more natural and engaging
- Music Generation: Creating new musical compositions
- Music Classification: Categorizing and classifying music based on genre, mood, instruments, etc.
- Audio Event Detection: Detecting and classifying specific sounds (alarms, footsteps, etc.)
- Audio Source Separation: Separating individual sound sources from an audio mixture.
- Audio Sentiment Analysis: Analyzing and determining the emotional content or sentiment expressed in audio recordings
- Audio Super Resolution: Enhancing the quality and fidelity of audio signals by increasing their resolution or restoring missing details
Multi-Modal
The multi-modal modality involves combining different types of information from language/text, vision, video, and audio to mimic how humans learn and interact with the world. Just like we humans rely on multiple senses and forms of communication to understand and engage with others, multi-modal approaches allow computers and technology to do the same. By integrating various modalities, we can achieve a more comprehensive understanding of information and create more natural and immersive experiences.