Superstream的封面图片
Superstream

Superstream

软件开发

Palo Alto,California 1,710 位关注者

We help companies to control, optimize, and secure their streaming infrastructure using AI Workforce

关于我们

Superstream AI-Workforce helps companies of all sizes boost data engineering productivity and offloads three main infrastructure challenges in the modern streaming stack: Workload optimization, control, and security, starting with Kafka of all flavors.

网站
https://superstream.ai
所属行业
软件开发
规模
11-50 人
总部
Palo Alto,California
类型
私人持股
创立
2022

地点

Superstream员工

动态

  • 查看Superstream的组织主页

    1,710 位关注者

    Apache Amoro – Iceberg But Easier? Working with Apache Iceberg can be painful: complex configurations, manual optimizations, and tricky metadata management. ???? Enter Apache Amoro (Still incubating) Amoro is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino. - Unified Metadata Catalog – Provides a unified catalog service for all compute engines, which can also be combined with existing metadata services. - Modular Optimizations – An open framework for creating different optimizers, such as file compaction, deduplication, and sorting. - Data Freshness - Measuring data freshness is crucially important for data developers, analysts, and administrators, and Amoro addresses this challenge by adopting the watermark concept in stream computing to gauge table freshness. Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. Worth watching! https://hubs.li/Q03b10YR0 Follow us for more data engineering trends, tools, and knowledge.

    • 该图片无替代文字
  • 查看Superstream的组织主页

    1,710 位关注者

    ?? Revolutionizing AI Training: Smarter, Faster, and More Affordable! ???? Training large language models (LLMs) like ChatGPT or LLaMA to follow instructions well is a costly and complex process. Traditionally, this involves: ? Expensive Human Annotations – High-quality datasets require thousands of human-labeled examples, which are slow and costly to produce. ?? ? Dependence on Proprietary AI Models – Many AI teams use GPT-4 to generate synthetic training data, but this creates licensing risks and high costs. ?? ? Forgetting – AI models often struggle to learn new tasks without forgetting old knowledge. ???? ?? Introducing LAB (Large-scale Alignment for Chatbots), a breakthrough approach from MIT-IBM Watson AI Lab & IBM Research designed to make AI training more scalable, efficient, and cost-effective! How LAB Solves the Problem: ? Synthetic Data Generation, the Smart Way – LAB uses a taxonomy-driven approach to create highly diverse and high-quality instruction datasets—without expensive human labeling or reliance on proprietary AI. ???? ? Multi-Phase Tuning for Smarter AI – LAB prevents catastrophic forgetting by structuring training into phases, ensuring that new knowledge is added without erasing prior learning. ????? ? Lower Costs, Higher Performance – Instead of using expensive GPT-4-generated data, LAB leverages the open-source Mixtral model to create training datasets at a fraction of the cost. ?? The Impact ???? ?? With LAB, AI teams can train and align LLMs faster and cheaper, making powerful LLMs more accessible to companies and researchers. ?? LAB-aligned models have already shown state-of-the-art performance, competing with models trained using expensive human-labeled or GPT-4-generated data. ?? This approach democratizes AI, allowing more developers to fine-tune powerful models without breaking the bank. ?? Could this be the key to scaling AI training efficiently while keeping costs low? Let’s discuss! ?? #AI #MachineLearning #LLMs #AITraining #Chatbots #IBMResearch #OpenSourceAI #SyntheticData #AIInnovation

    • 该图片无替代文字
  • 查看Superstream的组织主页

    1,710 位关注者

    Is Apache Iceberg also the future of Databases? If you haven’t completely understood the reason behind its creation, this might help - ?? Data lakes can get messy. Large files, complicated partition strategies, endless schema changes—it's a lot. Engineers need a table format that’s robust and built for modern data needs. What Is It? * High-performance for massive datasets. * It separates logical table operations from the physical data layout or computes them from the storage. * Designed to handle petabyte-scale data with minimal overhead. Key Features - Hidden Partitioning ?? – Iceberg handles partition pruning under the hood. No more manual partition logic. - Schema Evolution ?? – Add or drop columns without rewriting entire tables. - Time Travel ? – Query older snapshots of data for debugging or historical analysis. - ACID Transactions ? – Ensures data consistency in distributed environments. - Multi-Engine Support ?? – Spark, Flink, Trino, and more. You pick the engine you love. Why Should You Care? Obvious reasons: - Speed: Faster queries thanks to advanced partition pruning and metadata. - Reliability: Ensures data integrity with atomic operations. - Flexibility: Adapt to changing data structures seamlessly. Less obvious reasons: No vendor locking! One engine might be great for one type of job, and the other better for a different one. That is the beauty of decoupling Like the content? Leave a like and follow us to learn more!

  • 查看Superstream的组织主页

    1,710 位关注者

    Apache Iceberg vs. Delta Lake vs. Hudi key differences When to Choose ?? Iceberg: Cost Effectiveness. Best for large-scale analytics, multi-engine support, and simpler partition handling. ?? Delta Lake: Closed ecosystem. Great if you’re deeply tied to Spark or Databricks ecosystems. Hudi: Perfect for real-time data ingestion and upserts (e.g., CDC logs). Pro Tip ?? Evaluate your current pipelines, query engines, and update patterns. Each format shines in different areas. Full comparison in the image below. Like the content? Follow us for more!

    • 该图片无替代文字
  • Superstream转发了

    查看Sveta Gimpelson的档案

    Co-Founder & CDO at Superstream | Keep calm and stream data

    ?? Inference is probably data streaming's next big use case. Here is why: ?? What is it in the first place? When we talk about inference in the context of LLMs reasoning, we’re referring to the process by which a trained LLM uses its learned parameters to generate outputs. In other words, on-the-fly training. ?? Training vs. Inference - Training is when the model “learns” from large data sets. During training, the model adjusts its internal parameters (weights) based on patterns in text. - Inference is what happens after training, when the model is presented with new text inputs (e.g., a prompt or a conversation) and must generate the most likely next tokens or words. ??? Prompt + Context = Model Output During inference, you provide a prompt. The model processes your prompt and the context you’ve given (including any conversation history, system instructions, etc.). It then generates an output, token by token, each time selecting the most probable next token based on its learned distribution over all possible sequences. ?? Why does data streaming make sense to be used for Inferencing? - Ongoing Data Flow: Many modern applications produce a continuous flow of data (e.g., clickstream logs, IoT sensor readings). Traditional batch processing would collect the data and only process it at set intervals. Streaming allows you to process these events (and generate predictions) the moment they happen. - Dynamic Model Inputs: Instead of waiting for a batch of data to build up, each new piece of data can be fed to a model for inference immediately, enabling adaptive or continuously updated insights. - Load Management: Streaming platforms can help balance loads across multiple inference endpoints or microservices, ensuring you don’t overwhelm the model server with sudden spikes in data. - Real-time updates can decrease taking wrong actions based on outdated data - Potentially, deliver models on partial (much smaller) data sets and build outputs based on real-time inference with in-depth, specifically collected knowledge. The powerful combination of Kafka & Flink can and probably will help unlock that. Superstream Confluent Kafka

    • 该图片无替代文字
  • 查看Superstream的组织主页

    1,710 位关注者

    ??You are paying at least 43% more than you should to Confluent!?? With data streaming at the heart of modern applications, Confluent can be a game-changer—but it often comes with hidden costs that sneak up on your bottom line. Here’s why dollars go to waste: ??Forgotton resources: Who has the time or visibility to reduce idle topics/partitions? ??Long-Running, Idle Connectors: Burning money while idle ??Unused schemas: You deleted some topics, did you remember to remove their schemas? ??Transfer: Your devs swore compression was enabled—turns out, they never flipped the switch. The good news? With proper oversight and automatic optimization, you can reclaim those wasted dollars and FAST. Let’s connect and make your streaming platform more cost-efficient, so you can refocus resources on what really matters: driving innovation and delivering value to your customers. #Confluent #Kafka #DataStreaming #CostOptimization #DevOps

    • 该图片无替代文字
  • Superstream转发了

    查看Idan Asulin的档案

    Co-Founder & CTO at Superstream (formerly Memphis.dev)

    Why do partitions significantly impact CPU utilization in Kafka? Long post ahead ?? 1?? I/O and Log Management * Each partition is written to a separate log, which requires disk I/O and index updates. * The broker must handle the overhead of appending messages to log segments and maintaining offset indexes. 2?? Replication Overhead * If a broker is the leader for a partition, it must replicate data to follower brokers. * This replication involves network transfers, checksums, and additional I/O—further increasing CPU usage. 3?? Compression and Decompression * Kafka messages are often compressed (e.g., with GZIP, Snappy, LZ4). * The leader broker compresses data for efficient storage, while followers and consumers may decompress messages. These operations are CPU-intensive. 4?? Metadata and Controller Tasks * Kafka needs to maintain metadata (partition assignments, leader/follower status) and coordinate rebalances. * As the partition count increases, the metadata grows and requires more frequent updates—leading to additional CPU overhead. 5?? Concurrency and Networking * Each partition can receive parallel read and write requests from multiple producers and consumers. * Handling many concurrent socket connections and requests also increases CPU usage. 6?? Garbage Collection (JVM-Specific) * Kafka runs on the JVM, so more active partitions lead to more in-memory data structures and a higher garbage collection (GC) load. * Frequent GC cycles can spike CPU usage.

    • 该图片无替代文字
  • Superstream转发了

    查看Idan Asulin的档案

    Co-Founder & CTO at Superstream (formerly Memphis.dev)

    5?? Best Practices for getting the best out of your AWS MSK ?? If you're using AWS MSK, adopting best practices can make a huge difference. Here are 5 tips to get the most out of your MSK clusters: 1?? Right-Size Your Brokers: Monitor usage and scale brokers dynamically to optimize costs. 2?? Fetch from nearest replicas: This might be a bit challenging, but it can cut your transfer costs by up to 80%! 3?? Partition Smarter: Balance data across partitions to avoid bottlenecks. It won't happen automatically. 4?? Monitor with CloudWatch: Set up alarms for critical metrics like disk usage, throughput, and lag. AWS will cover you to a certain degree, but not from bad management. 5?? Tiered storage: Reduce local retention and shift as much data as you can to S3. These small tweaks can make a big impact on costs and reliability! ?? What’s your top tip for managing AWS MSK? Let’s share knowledge below! ??

相似主页

查看职位