SAM Sees Everything

Meta's SAM 2 revolutionizes object detection, OpenAI enhances ChatGPT with voice and expands GPT-4o Token output, while Runway's Gen-3 Alpha transforms video generation.

Interesting week for vision in Machine Learning, image, audio and video.

Meta takes center stage with SAM 2, a leap forward in real-time object segmentation that's set to transform everything from creative industries to autonomous systems. OpenAI continues to push the envelope, not only by giving voice to ChatGPT but also by expanding the capabilities of GPT-4 to unprecedented lengths. Meanwhile, Runway's Gen-3 Alpha is painting a vivid picture of the future of video generation, blending artistry with cutting-edge AI.

As we unpack these innovations, we'll explore how they're set to reshape our digital experiences, unlock new creative possibilities, and drive the next wave of technological progress.

Let's delve into these innovations and explore how they're reshaping the AI landscape.

Stuff you should know

Meta Unveils SAM 2 For Real-Time Object Segmentation

Meta has launched the Segment Anything Model 2 (SAM 2), an advanced AI model designed for real-time object segmentation in images and videos. This update allows users to segment any object with just a few clicks or a simple text prompt, significantly improving efficiency and accuracy compared to its predecessor. SAM 2 features zero-shot generalization, enabling it to adapt to new visual domains without additional training, making it highly versatile for applications in creative industries, scientific research, and autonomous systems. Additionally, SAM 2 includes a memory mechanism that tracks objects across video frames, enhancing its performance in dynamic environments. Meta has also released the SA-V dataset, containing around 51,000 real-world videos and over 600,000 spatio-temporal masks, to support developers in training and testing segmentation models. By open-sourcing SAM 2 under an Apache 2.0 license, Meta aims to foster innovation and collaboration within the AI community.

OpenAI Introduces ChatGPT's Advanced Voice Mode

OpenAI has begun rolling out its advanced voice mode for ChatGPT, which had been delayed from its initial June release. This new feature is currently being made available to a small group of ChatGPT Plus subscribers. The voice, which has drawn comparisons to Scarlett Johansson's character in the film "Her," was tested extensively to enhance its ability to detect and refuse certain types of content. OpenAI also implemented new filters to prevent the generation of copyrighted audio, addressing concerns about safety and content integrity raised by external testers during the development phase.The advanced voice mode aims to improve user interaction with ChatGPT by providing a more natural and engaging conversational experience. Despite the excitement surrounding its release, the company faced scrutiny regarding the voice's resemblance to Johansson, leading to inquiries about its creation. OpenAI's careful approach to this rollout reflects its commitment to ensuring safety and compliance with content regulations while enhancing the capabilities of its AI products

Runway's Latest Image To Video Generation Model

Gen-3 Alpha is Runway's latest foundation model for video generation, showcasing significant advancements over its predecessor, Gen-2. It utilizes a new infrastructure for large-scale multimodal training, resulting in enhanced fidelity, consistency, and motion. The model supports various creative tools, including Text to Video, Image to Video, and Text to Image, while also improving features like Motion Brush and Advanced Camera Controls. A standout feature is its ability to generate photorealistic human characters with diverse actions and emotions, facilitating richer storytelling.Additionally, Gen-3 Alpha incorporates new safeguards, including an advanced visual moderation system and compliance with C2PA provenance standards, ensuring content integrity. Developed through collaboration among artists, engineers, and researchers, this model aims to meet the artistic and narrative needs of various industries with customizable options, making it a powerful tool for creators.

Now GPT4o Supporting 64,000-Token Outputs

OpenAI has introduced an experimental version of GPT-4o, which supports a maximum output of 64,000 tokens per request. This new capability is aimed at unlocking innovative use cases that require longer text completions, making it particularly useful for applications such as detailed reports, extensive narratives, and complex data analysis. Participants in the alpha program can access this feature by using the model named 'gpt-4o-64k-output-alpha.' However, the longer completions come at a higher cost, with pricing set at $6.00 per million tokens for input and $18.00 per million tokens for output. This pricing structure reflects the increased computational resources required for generating such extensive outputs. OpenAI hopes that this extended output capability will enhance user experiences and broaden the applications of their AI models.

Hand Picked Video

In today's video, we dive into the groundbreaking Gen-3 Alpha’s new Image to video feature.

Top AI Products from this week 

  • Olly - Amplify your social presence in days, not months

  • GitStart AI Ticket Studio - Write engineering-ready tickets with ease. Ticket Studio’s AI understands your codebase and gathers requirements to create well-scoped tickets within minutes.

  • Jamie - Get human-quality meeting summaries after each meeting. Thanks to the no-bot approach, Jamie works with any meeting platform and even for offline conversations.

  • Outlit - We help sales teams close more deals by accelerating their contract review process with our AI agents. Outlit can identify risk, negotiate prices, and check compliance so sales teams can close deals without waiting on legal.

  • Last24.ai - Last24 is an AI search engine that helps you understand today's news fast. It searches the internet, picks the important news you need, and summarizes the key points in a beautiful mindmap.

  • Team-U - Connect with startup enthusiasts through spontaneous live interactions. Find mentors, talents, and partners. Share ideas, seek advice. Expand your network, uncover opportunities. Secure platform for spontaneous business networking and unexpected connections.

  • CopyFrog.AI - Transform your content strategy with CopyFrog AI! Generate high-quality, engaging content tailored to your audience. Enjoy text creation, image generation, ad crafting, and more under one subscription. Experience the future of content creation today!

  • JotMe - Meetings with language barriers require extensive follow-ups for non-native speakers. JotMe solves this by translating 77 languages in real-time and providing meeting notes in 10 languages, ensuring everyone understands their next steps and key points.

This week in AI

  • Shutterstock & Getty Images Unveil AI Upgrades - Shutterstock launches Generative 3D beta, allowing quick 3D asset creation from prompts. Getty Images doubles AI image generation speed, adds advanced controls and fine-tuning, all powered by NVIDIA's Edify multimodal generative AI architecture.

  • Canva Launches X Leonardo AI for Enhanced Design - Canva introduces Leonardo AI, a generative tool that allows users to create images from text prompts, streamlining the design process. Integrated within the platform, it offers customizable templates, enhancing creativity for all users.

  • Meta Launches AI Studio for Custom AI Characters - Meta's AI Studio allows users to create personalized AI characters based on their interests, enhancing engagement on platforms like Instagram, Messenger, and WhatsApp. Currently in US beta, it offers creators a new way to connect with their audience.

  • Perplexity Launches Publishers Program to Support Media - Perplexity has introduced the Publishers Program to support media organizations and online creators. Partners like TIME and Der Spiegel will benefit from revenue sharing, access to APIs for custom answer engines, and free Enterprise Pro access for employees, promoting collaboration and enhancing audience engagement.