IEEE Big Data 2024 Quick Note: Mixed Feelings about AI

IEEE Big Data 2024 Quick Note: Mixed Feelings about AI

I am grateful for the opportunity to join IEEE Big Data and learn so many amazing research and new ideas. Here’s my quick note from the meeting. So regretful that I can join all presentation because multiple sessions were happening in parallel. Please check our the full schedule is here to find your interesting work to read.

1, Keynote Speech I: Generative Information Retrieval and E-commerce

Problem: what the state-of-the-art (system and user interface) in e-commerce?

  • Ad Hoc Algo do not scale, search “cal red wine 2019 and 40$”
  • Product Knowledge graph is bigger than the knowledge graph
  • Knowledge on the web
  • Presentation is also knowledge
  • E-commerce search has a system bottleneck. Many results that match users’s preference fail to come through; how can the 100 results cover the preferences of millions of users?
  • Personalization: embeddings (for both user and products) based solutions that are hard to explain
  • Adopting Gen-AI for E-commerce
  • Information retrieval is still ruled by Predictive ML
  • Enable every component to communicate in natural language
  • Unifying heterogeneous Data to Text ( unstructured data)
  • Evaluation is about if customers think the results are relevant (NDCG?)
  • Normalized Discounted Cumulative Gain (NDCG) is a metric used to measure the quality of ranking algorithms,?
  • LLM-powered IR v.s. Classic IR? v.s. Model based IR
  • Future of e-commerce: the crave for a new experience; the fusion of physical & digital shopping (drone deliver, AR/VE);

2, Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product Gap

Problem: How to infer users’ intention to improve product search?

  • Understanding purchase intention (PI) (somehow like a clustering) in data retrieval for shopping?
  • Pre-trained DistilBERT on Wikipedia Data are used to embed query and products
  • Test on Amazon/FashionGen dataset to search products
  • Real-time is big problem now

3, SHRINK: Data Compression by Semantic Extraction and Residuals Encoding

Problem: How to compress data by extracting semantic information from it.

  • Semantic (Base, e,g,, variance/mean) Extraction: Angel based PLA:?
  • Construct segments by Shrinking Code
  • Residential encoding: very sparse
  • For lossless compression: derive candidate line with varying precision or slope??
  • Has application on edge device with limited storage space

4, On Modeling Adaptive Index Management as Adversarial Search

Problem: how to improve DB index with incoming query (prediction)?

  • DB cracking refers to a database optimization technique where a database system dynamically adjusts its data organization based on frequent query patterns, essentially "cracking" the data into smaller, more accessible pieces to significantly improve query performance, particularly for frequently accessed data subsets, by creating adaptive indexes tailored to specific queries
  • Split data in column into AVL tree
  • Adversarial search is a field of artificial intelligence (AI) that involves multiple entities with competing goals, where each player's strategy depends on their opponent's moves

5, Keynote Speech II: Improving Semi-Supervised Learning with Pseudo-Margins

Problem: how to label data, identify mislabeled data,?

  • MarginMatch: SSL approach to improving pseudo-labeled data quality by monitoring the model’s training dynamics on unlabeled data; Margin: difference between the assigned logit and other logit.
  • Use the prediction from iterations before training (to calculate margin) and have analysis (e.g., average and exponential moving average) on it.
  • ?MarginMatch v.s. FixMatch v.s. FlexMatch on CIFAR data
  • In future: mismatch between the labeled and un-labeled data distribution

6, NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation

Problem: how to reduce end-to-end time of Gradient Descent?

  • A Nystr?m approximation is a technique used to create a low-rank approximation of a large matrix, typically a kernel matrix in machine learning, by selecting a smaller subset of its columns and using them to represent the overall structure of the matrix
  • Test on CIFAR-100 dataset v.s. SGD

7, Performance Characterization of Expert Router for Scalable LLM Inference

Problem: how to route LLM prompts to the right model of many?

  • different LLama 3 models with quantized and non-quantized weights under up to 1,000 concurrent users.
  • Convert incoming prompt to vectors (via TF-IDF) (no word embedding is considered); Train classifier on the data via K-means?

8, Data Augmentations to support Speculative Reasoning in LLMs

Problem: how to connect (orchestra) text intelligent reports??

  • Large columns of text report (about intelligent reports /w name, locations, events)
  • LLM fail: they can summarize but fail to uncover implicit connection
  • Build a dynamic evidence tree (DET) -- first argument
  • Data condensation via LLM – second argument
  • LLM-based search and retrieval – 3rd argument
  • Steering speculative reasoning is challenging
  • Bigger models like GPT-4 didn’t perform better?
  • Key Takeaway: LLM still lack in creative analysis
  • 100s of documents are used. (dynamic data incoming ?)

9, A Study of Foundation Models for Large-scale Time-series Forecasting

Problem: will time series prediction for a dataset benefit from training on multiple datasets?

  • TSDiff model : An Unconditional Diffusion Model for Time Series. A diffusion model is a type of generative AI model that creates new data by progressively adding random noise to existing data, then learning to reverse this process to generate high-quality outputs like images or text
  • Data from Solar, Elec, Traffic, Uber
  • Train data on a dataset (or a combination) and test on others,
  • On large-scale data, it doesn’t show improved results,
  • Training on multiple datasets does not improve the performance of model on a single dataset.
  • Future work: will same idea hold in other domains

10, Zero-shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation

Problem:? how to produce Counterfactual data from LLM (potentially for test/explain LLM)?

  • Classifier: DistilBERT on
  • IMDB data/AG News/SNLI, GPT model

11, StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation

Problem: how to use LLM to recover (generate) prompt ?

  • Recover the prompt used to transform a given text. LLMs are commonly used to rewrite or make stylistic changes to text. It recovers the LLM prompt that was used to transform a given text.

  • Direct Inference with LLM, Jailbreak, LLM fin-tuning,?
  • Transcripts from Youtube video (~10K)
  • Blama-2-8B instruct and Mistral-7B-Instruct are used
  • Future work: data only focus on English, data is also small,?

12, An Overview of the Data-Loader Landscape: Comparative Performance Analysis

Problem: how’s the performance of data ingestion from different sources, hardware, etc.?PyTorch is used. NetWork is still slow to transfer data for training

13, Efficient Hierarchical Contrastive Self-supervising Learning for Time Series Classification via Importance-aware Resolution Selection

Problem: how to reduce the time of training Hierarchical Contrastive Learning?

  • Contrastive learning in the context of Self-Supervised Learning (SSL) is a technique where a model learns representations by comparing different "views" of the same data point (considered positive pairs) to other dissimilar data points (negative pairs), effectively pushing similar data closer together in the embedding space while pushing dissimilar data further apart, all without requiring explicit labels on the data; essentially, it learns by distinguishing between similar and dissimilar data samples through a contrastive loss function
  • Sampling for each epoch?

14, SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problems

Problem: how to make decisions without moving data in a distributed setting?

  • Scenario Generation (demand forecasting, power grid forecasting and other multi-stakeholder system)
  • Split learning (model is split) and collaborative learning
  • Datasets: USAID, ACES, DEMOND, RENEWABLE
  • Showing strong performance? on 1125 edge nodesMaximize image

15, OL4TeX: Adaptive Online Learning for Text Classification under Distribution Shifts

Problem: how to deal with data shift for online text classification?

  • Frequency/Semantic/Vocabulary indicator – Distribution Shift Indicators
  • KNN, Random Forest Tree are classification algorithms
  • Compared with LLM and RNN

16, Fine-grained Graph-based Anomaly Detection on Vehicle Controller Area Networks

Problem: how to detect an attack on a car network?

  • Car has ECU (electric control units) to talk each other (1MB of memory)
  • ECU is vulnerable (can be hacked)
  • Deep Learning-method (LSTM) can not work here (no GPU and limited memory size)
  • Using Graph Kernel:G-IDCS;? Byte-Thresholding Method (DAGA in real ECU)
  • Car Hacking Dataset (CHD): 4 attacks, DoS, Fuzzing

17, RYAN: A tool for explaining and visually analyzing the evolution of Relaxed Functional Dependencies

Problem: how to track change of data?

  • RFD (relaxed function dependence) tree
  • RYAN is built to plot different? RFD graphs
  • LLM can? be called to explain
  • User Study: 25 users go through using the RYAN

18, DynamicFL: Federated Learning with Dynamic Communication Resource Allocation

Problem: how to improve accuracy of federated learning?

  • Use Communication heterogeneity to solve statistical heterogeneity
  • ResNet-18 on CIFAR-10: show gap between FedAve and FedSGD
  • DynamicFL: high-frequency group?
  • Communication interval is major factor in considering?

19, ?Keynote Speech III:? Designing Text Embeddings for the Future

Problem: how to build dual-direction of text embedding ?

  • AI drive the arising of vector db
  • Embedding as data systems
  • Future of embedding: inspect contents, customize domains, migrate provider?
  • Invert text embedding (hard) – conditional generation, generated text has different embedding
  • vec2text: Iterative generation
  • Will all text have the same embedding in future?
  • How to understand the embedding data?
  • Case Study: search for customer documents at a credit card company
  • Embedding on text of transaction (from OpenAI) is high for two purchase
  • Visa card is domain-specific : Contextual training (classic embedding does not have context)
  • Vias card may not appear at training : Contextual architecture (e.g., card number)

Lots of Inspiring Posters about predicting cash flow, finding healthy food, mining offshore wind energy, fuel consumption of cars w/ or w.o AC (many can’t be put here)

?20, Multi-task Recommendation in Marketplace via Knowledge Attentive Graph Convolutional Network with Adaptive Contrastive Learning?

Problem: how to recommend products by considering 3rd party sellers on Walmart?

  • Knowledge Graph and Graph Neural Network
  • Knowledge graph: items, user, sellers
  • User-item and user-seller bipartite graph
  • KGGCN: knowledge attentive graph (attention is used to distinguish the impact of edges)
  • Trained with LightGCN
  • ACL: adaptive contrastive loss: correlation between two bipartite graphs
  • Walmart datasets/Taobao dataset:?

21, Optimal Transport for Efficient, Unsupervised Anomaly Detection on Industrial Data

Problem: how to perform anomaly detection on industrial data in maintenance?

  • Condition-based maintenance: time consuming and require specialist knowledge and lack data
  • Automated CBM pipeline: knowledge graph
  • Optimal Transport: source and target distribution – Wasserstein Distance
  • OT map is used to monitor the data (no label) and detect anomaly detection
  • Dynamic threshold adjustment based moving monument?
  • Case Study: industrial shipping, 9-year-old, 330 oil tanker,?
  • Publication on 24 NeurIPD Workshop too

22, GRAINRec: Graph and Attention Integrated Approach for Real-Time Session-Based Item Recommendations?

Problem: how to do? session-based item recommendation?

  • Capture context of each session: like the relationship between items added to buy (milk → egg →)
  • Session embedding and item embedding to build a Matrix:? nearest neighbor matrix
  • Dataset from Target:?

23, Exploring Query Understanding for Amazon Product Search?

Problem: how to understand query (types) for cache, product ranking, and? query segment ??

  • Q2PT framework: takes a search query from user (e.g., nike shoes) and check if it is in the cache
  • Customer click history between query? and ASIN matrix
  • Multi-language and multiple locale
  • Extract feature to perform brand matching

24, CryptoPulse: Short-Term Cryptocurrency Forecasting with Dual-Prediction and Cross-Correlated Market Indicators

Problem: how to predict Cryptocurrency price with real-time data and sentiment analysis?

  • Predict next day’s closing price with 7 indicator and sentiment analysis (LLM)
  • Sentiment as a scaling factor
  • Crypto news: →? LLM Prompt (Few shot learning → Think-tank) → Linear function



Last Words: Why Title It "Mixed Feelings about AI"?

Almost everyone is talking about AI, which is encouraging. However, large language models (LLMs) are not as intelligent as they might seem. They function more like tools for quantifying text through embeddings. Moreover, most fields still require a combination of AI and non-AI solutions.


Thilanka Munasinghe

Lead Research Specialist at RPI: Quantum Machine Learning (QML) and Knowledge Graphs Applications Researcher at Institute for Data Applications and Exploration (IDEA) at RPI

2 个月

Nicely elaborated the summary of the IEEE BigData 2024 Conference, it was a pleasure to get to know you during the conference. See you next year

Suman Bharti

Graduate Research Assistant @ Kennesaw State University | PhD CS student ????

2 个月

Inspiring Bin Dong

Wei Zhang, PhD

CS Researcher @LBNL, HPC + Big Data + AI | Ex-Oracle | Ex-LBNL | Ex-Weibo

2 个月

Amazing Dr. Dong!!!

要查看或添加评论,请登录

Bin Dong的更多文章

  • ESSA 2024 Note (about Extreme-Scale Storage and Analysis)

    ESSA 2024 Note (about Extreme-Scale Storage and Analysis)

    ESSA 2024: 5th Workshop on Extreme-Scale Storage and Analysis Keynote: HPC and Databases Revisited — Jay Lofstead…

  • RSDM-GeoSci 2023 Node

    RSDM-GeoSci 2023 Node

    Glad to present our ongoing work about "finding hidden earthquakes from DAS data" RSDM-GeoSci 2023. Also, learned a lot…

    1 条评论
  • HPDC 2023 Quick Note

    HPDC 2023 Quick Note

    Put here my quick note from 22 papers in HPDC 2023. I try to outline the problems targeted by each paper.

社区洞察

其他会员也浏览了