Superstream

软件开发

Palo Alto，California 1,710 位关注者

We help companies to control, optimize, and secure their streaming infrastructure using AI Workforce

关注

查看全部 15 位员工

关于我们

Superstream AI-Workforce helps companies of all sizes boost data engineering productivity and offloads three main infrastructure challenges in the modern streaming stack: Workload optimization, control, and security, starting with Kafka of all flavors.

网站: https://superstream.ai
Superstream的外部链接
所属行业: 软件开发
规模: 11-50 人
总部: Palo Alto，California
类型: 私人持股
创立: 2022

地点

主要

US，California，Palo Alto

获取路线
Ha Rakevet 58

IL，Tel Aviv

获取路线
2093 Philadelphia Pike

US，Delaware，Claymont，19703

获取路线

Superstream员工

Gil Dibner
Shomik Ghosh

Partner at Boldstart Ventures
Sveta Gimpelson

Co-Founder & CDO at Superstream | Keep calm and stream data
Slava Meyerzon

Software BE Engineer at Superstream

查看全部员工

动态

Superstream

1,710 位关注者
22 小时前
举报此动态
Apache Amoro – Iceberg But Easier? Working with Apache Iceberg can be painful: complex configurations, manual optimizations, and tricky metadata management. ???? Enter Apache Amoro (Still incubating) Amoro is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino. - Unified Metadata Catalog – Provides a unified catalog service for all compute engines, which can also be combined with existing metadata services. - Modular Optimizations – An open framework for creating different optimizers, such as file compaction, deduplication, and sorting. - Data Freshness - Measuring data freshness is crucially important for data developers, analysts, and administrators, and Amoro addresses this challenge by adopting the watermark concept in stream computing to gauge table freshness. Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. Worth watching! https://hubs.li/Q03b10YR0 Follow us for more data engineering trends, tools, and knowledge.
赞评论分享
Superstream

1,710 位关注者
22 小时前
举报此动态
?? Revolutionizing AI Training: Smarter, Faster, and More Affordable! ???? Training large language models (LLMs) like ChatGPT or LLaMA to follow instructions well is a costly and complex process. Traditionally, this involves: ? Expensive Human Annotations – High-quality datasets require thousands of human-labeled examples, which are slow and costly to produce. ?? ? Dependence on Proprietary AI Models – Many AI teams use GPT-4 to generate synthetic training data, but this creates licensing risks and high costs. ?? ? Forgetting – AI models often struggle to learn new tasks without forgetting old knowledge. ???? ?? Introducing LAB (Large-scale Alignment for Chatbots), a breakthrough approach from MIT-IBM Watson AI Lab & IBM Research designed to make AI training more scalable, efficient, and cost-effective! How LAB Solves the Problem: ? Synthetic Data Generation, the Smart Way – LAB uses a taxonomy-driven approach to create highly diverse and high-quality instruction datasets—without expensive human labeling or reliance on proprietary AI. ???? ? Multi-Phase Tuning for Smarter AI – LAB prevents catastrophic forgetting by structuring training into phases, ensuring that new knowledge is added without erasing prior learning. ????? ? Lower Costs, Higher Performance – Instead of using expensive GPT-4-generated data, LAB leverages the open-source Mixtral model to create training datasets at a fraction of the cost. ?? The Impact ???? ?? With LAB, AI teams can train and align LLMs faster and cheaper, making powerful LLMs more accessible to companies and researchers. ?? LAB-aligned models have already shown state-of-the-art performance, competing with models trained using expensive human-labeled or GPT-4-generated data. ?? This approach democratizes AI, allowing more developers to fine-tune powerful models without breaking the bank. ?? Could this be the key to scaling AI training efficiently while keeping costs low? Let’s discuss! ?? #AI #MachineLearning #LLMs #AITraining #Chatbots #IBMResearch #OpenSourceAI #SyntheticData #AIInnovation
赞评论分享
Superstream转发了
Yaniv Ben Hemo

Co-Founder & CEO at Superstream
1 天前已编辑
举报此动态
Did you experience the outage in Azure-based Confluent clusters? ??
赞评论分享
Superstream

1,710 位关注者
1 周
举报此动态
Is Apache Iceberg also the future of Databases? If you haven’t completely understood the reason behind its creation, this might help - ?? Data lakes can get messy. Large files, complicated partition strategies, endless schema changes—it's a lot. Engineers need a table format that’s robust and built for modern data needs. What Is It? * High-performance for massive datasets. * It separates logical table operations from the physical data layout or computes them from the storage. * Designed to handle petabyte-scale data with minimal overhead. Key Features - Hidden Partitioning ?? – Iceberg handles partition pruning under the hood. No more manual partition logic. - Schema Evolution ?? – Add or drop columns without rewriting entire tables. - Time Travel ? – Query older snapshots of data for debugging or historical analysis. - ACID Transactions ? – Ensures data consistency in distributed environments. - Multi-Engine Support ?? – Spark, Flink, Trino, and more. You pick the engine you love. Why Should You Care? Obvious reasons: - Speed: Faster queries thanks to advanced partition pruning and metadata. - Reliability: Ensures data integrity with atomic operations. - Flexibility: Adapt to changing data structures seamlessly. Less obvious reasons: No vendor locking! One engine might be great for one type of job, and the other better for a different one. That is the beauty of decoupling Like the content? Leave a like and follow us to learn more!

赞评论分享
Superstream

1,710 位关注者
1 周
举报此动态
Apache Iceberg vs. Delta Lake vs. Hudi key differences When to Choose ?? Iceberg: Cost Effectiveness. Best for large-scale analytics, multi-engine support, and simpler partition handling. ?? Delta Lake: Closed ecosystem. Great if you’re deeply tied to Spark or Databricks ecosystems. Hudi: Perfect for real-time data ingestion and upserts (e.g., CDC logs). Pro Tip ?? Evaluate your current pipelines, query engines, and update patterns. Each format shines in different areas. Full comparison in the image below. Like the content? Follow us for more!
赞评论分享
Superstream转发了
Yaniv Ben Hemo

Co-Founder & CEO at Superstream
2 周
举报此动态
A fascinating event by our partners, GlobalDots! Happy that we took part and presented Superstream capabilities to some of the most innovative companies in Israel.
1 条评论

赞评论分享
Superstream转发了
Sveta Gimpelson

Co-Founder & CDO at Superstream | Keep calm and stream data
1 个月
举报此动态
?? Inference is probably data streaming's next big use case. Here is why: ?? What is it in the first place? When we talk about inference in the context of LLMs reasoning, we’re referring to the process by which a trained LLM uses its learned parameters to generate outputs. In other words, on-the-fly training. ?? Training vs. Inference - Training is when the model “learns” from large data sets. During training, the model adjusts its internal parameters (weights) based on patterns in text. - Inference is what happens after training, when the model is presented with new text inputs (e.g., a prompt or a conversation) and must generate the most likely next tokens or words. ??? Prompt + Context = Model Output During inference, you provide a prompt. The model processes your prompt and the context you’ve given (including any conversation history, system instructions, etc.). It then generates an output, token by token, each time selecting the most probable next token based on its learned distribution over all possible sequences. ?? Why does data streaming make sense to be used for Inferencing? - Ongoing Data Flow: Many modern applications produce a continuous flow of data (e.g., clickstream logs, IoT sensor readings). Traditional batch processing would collect the data and only process it at set intervals. Streaming allows you to process these events (and generate predictions) the moment they happen. - Dynamic Model Inputs: Instead of waiting for a batch of data to build up, each new piece of data can be fed to a model for inference immediately, enabling adaptive or continuously updated insights. - Load Management: Streaming platforms can help balance loads across multiple inference endpoints or microservices, ensuring you don’t overwhelm the model server with sudden spikes in data. - Real-time updates can decrease taking wrong actions based on outdated data - Potentially, deliver models on partial (much smaller) data sets and build outputs based on real-time inference with in-depth, specifically collected knowledge. The powerful combination of Kafka & Flink can and probably will help unlock that. Superstream Confluent Kafka
赞评论分享
Superstream

1,710 位关注者
1 个月
举报此动态
??You are paying at least 43% more than you should to Confluent!?? With data streaming at the heart of modern applications, Confluent can be a game-changer—but it often comes with hidden costs that sneak up on your bottom line. Here’s why dollars go to waste: ??Forgotton resources: Who has the time or visibility to reduce idle topics/partitions? ??Long-Running, Idle Connectors: Burning money while idle ??Unused schemas: You deleted some topics, did you remember to remove their schemas? ??Transfer: Your devs swore compression was enabled—turns out, they never flipped the switch. The good news? With proper oversight and automatic optimization, you can reclaim those wasted dollars and FAST. Let’s connect and make your streaming platform more cost-efficient, so you can refocus resources on what really matters: driving innovation and delivering value to your customers. #Confluent #Kafka #DataStreaming #CostOptimization #DevOps
1 条评论

赞评论分享
Superstream转发了
Idan Asulin

Co-Founder & CTO at Superstream (formerly Memphis.dev)
1 个月
举报此动态
Why do partitions significantly impact CPU utilization in Kafka? Long post ahead ?? 1?? I/O and Log Management * Each partition is written to a separate log, which requires disk I/O and index updates. * The broker must handle the overhead of appending messages to log segments and maintaining offset indexes. 2?? Replication Overhead * If a broker is the leader for a partition, it must replicate data to follower brokers. * This replication involves network transfers, checksums, and additional I/O—further increasing CPU usage. 3?? Compression and Decompression * Kafka messages are often compressed (e.g., with GZIP, Snappy, LZ4). * The leader broker compresses data for efficient storage, while followers and consumers may decompress messages. These operations are CPU-intensive. 4?? Metadata and Controller Tasks * Kafka needs to maintain metadata (partition assignments, leader/follower status) and coordinate rebalances. * As the partition count increases, the metadata grows and requires more frequent updates—leading to additional CPU overhead. 5?? Concurrency and Networking * Each partition can receive parallel read and write requests from multiple producers and consumers. * Handling many concurrent socket connections and requests also increases CPU usage. 6?? Garbage Collection (JVM-Specific) * Kafka runs on the JVM, so more active partitions lead to more in-memory data structures and a higher garbage collection (GC) load. * Frequent GC cycles can spike CPU usage.
赞评论分享
Superstream转发了
Idan Asulin

Co-Founder & CTO at Superstream (formerly Memphis.dev)
2 个月
举报此动态
5?? Best Practices for getting the best out of your AWS MSK ?? If you're using AWS MSK, adopting best practices can make a huge difference. Here are 5 tips to get the most out of your MSK clusters: 1?? Right-Size Your Brokers: Monitor usage and scale brokers dynamically to optimize costs. 2?? Fetch from nearest replicas: This might be a bit challenging, but it can cut your transfer costs by up to 80%! 3?? Partition Smarter: Balance data across partitions to avoid bottlenecks. It won't happen automatically. 4?? Monitor with CloudWatch: Set up alarms for critical metrics like disk usage, throughput, and lag. AWS will cover you to a certain degree, but not from bad management. 5?? Tiered storage: Reduce local retention and shift as much data as you can to S3. These small tweaks can make a big impact on costs and reliability! ?? What’s your top tip for managing AWS MSK? Let’s share knowledge below! ??

赞评论分享

相似主页

查看职位

登录看看您认识Superstream的哪些人

Superstream

软件开发

Palo Alto，California 1,710 位关注者

We help companies to control, optimize, and secure their streaming infrastructure using AI Workforce

关于我们

地点

Superstream员工

Gil Dibner

Shomik Ghosh

Partner at Boldstart Ventures

Sveta Gimpelson

Co-Founder & CDO at Superstream | Keep calm and stream data

Slava Meyerzon

Software BE Engineer at Superstream

动态

立即加入，查看您错过的职场动态

相似主页

Jurnee

Viably

Angular Ventures

groundcover

Forwrd.ai

Dylibso

CloudQuery

Wallaroo.AI

SuperStream

paradime.io

查看职位

软件架构师职位

团队主管职位

软件工程师职位

产品主管职位

学生职位

业务应用程序经理职位

客户专员职位

产品设计师职位

用户体验设计师职位

初级开发员职位

技术客户经理职位

业务发展经理职位

销售专员职位

市场营销经理职位

客户经理职位

初级工程师职位

工程经理职位

工程师职位

私募实习生职位

助手职位