登录查看更多内容

IEEE Big Data 2024 Quick Note: Mixed Feelings about AI

Bin Dong

Big Data + HPC + NoAI (Not Only AI); Book Author; GSoC'23 Mentor, Tech committee in SC, SSDBM, etc.

发布日期: 2024年12月19日

I am grateful for the opportunity to join IEEE Big Data and learn so many amazing research and new ideas. Here’s my quick note from the meeting. So regretful that I can join all presentation because multiple sessions were happening in parallel. Please check our the full schedule is here to find your interesting work to read.

1, Keynote Speech I: Generative Information Retrieval and E-commerce

Problem: what the state-of-the-art (system and user interface) in e-commerce?

Ad Hoc Algo do not scale, search “cal red wine 2019 and 40$”
Product Knowledge graph is bigger than the knowledge graph
Knowledge on the web
Presentation is also knowledge
E-commerce search has a system bottleneck. Many results that match users’s preference fail to come through; how can the 100 results cover the preferences of millions of users?
Personalization: embeddings (for both user and products) based solutions that are hard to explain
Adopting Gen-AI for E-commerce
Information retrieval is still ruled by Predictive ML
Enable every component to communicate in natural language
Unifying heterogeneous Data to Text ( unstructured data)
Evaluation is about if customers think the results are relevant (NDCG?)
Normalized Discounted Cumulative Gain (NDCG) is a metric used to measure the quality of ranking algorithms,?
LLM-powered IR v.s. Classic IR? v.s. Model based IR
Future of e-commerce: the crave for a new experience; the fusion of physical & digital shopping (drone deliver, AR/VE);

2, Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product Gap

Problem: How to infer users’ intention to improve product search?

Understanding purchase intention (PI) (somehow like a clustering) in data retrieval for shopping?
Pre-trained DistilBERT on Wikipedia Data are used to embed query and products
Test on Amazon/FashionGen dataset to search products
Real-time is big problem now

3, SHRINK: Data Compression by Semantic Extraction and Residuals Encoding

Problem: How to compress data by extracting semantic information from it.

Semantic (Base, e,g,, variance/mean) Extraction: Angel based PLA:?
Construct segments by Shrinking Code
Residential encoding: very sparse
For lossless compression: derive candidate line with varying precision or slope??
Has application on edge device with limited storage space

4, On Modeling Adaptive Index Management as Adversarial Search

Problem: how to improve DB index with incoming query (prediction)?

DB cracking refers to a database optimization technique where a database system dynamically adjusts its data organization based on frequent query patterns, essentially "cracking" the data into smaller, more accessible pieces to significantly improve query performance, particularly for frequently accessed data subsets, by creating adaptive indexes tailored to specific queries
Split data in column into AVL tree
Adversarial search is a field of artificial intelligence (AI) that involves multiple entities with competing goals, where each player's strategy depends on their opponent's moves

5, Keynote Speech II: Improving Semi-Supervised Learning with Pseudo-Margins

Problem: how to label data, identify mislabeled data,?

MarginMatch: SSL approach to improving pseudo-labeled data quality by monitoring the model’s training dynamics on unlabeled data; Margin: difference between the assigned logit and other logit.
Use the prediction from iterations before training (to calculate margin) and have analysis (e.g., average and exponential moving average) on it.
?MarginMatch v.s. FixMatch v.s. FlexMatch on CIFAR data
In future: mismatch between the labeled and un-labeled data distribution

6, NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation

Problem: how to reduce end-to-end time of Gradient Descent?

A Nystr?m approximation is a technique used to create a low-rank approximation of a large matrix, typically a kernel matrix in machine learning, by selecting a smaller subset of its columns and using them to represent the overall structure of the matrix
Test on CIFAR-100 dataset v.s. SGD

7, Performance Characterization of Expert Router for Scalable LLM Inference

Problem: how to route LLM prompts to the right model of many?

different LLama 3 models with quantized and non-quantized weights under up to 1,000 concurrent users.
Convert incoming prompt to vectors (via TF-IDF) (no word embedding is considered); Train classifier on the data via K-means?

8, Data Augmentations to support Speculative Reasoning in LLMs

Problem: how to connect (orchestra) text intelligent reports??

Large columns of text report (about intelligent reports /w name, locations, events)
LLM fail: they can summarize but fail to uncover implicit connection
Build a dynamic evidence tree (DET) -- first argument
Data condensation via LLM – second argument
LLM-based search and retrieval – 3rd argument
Steering speculative reasoning is challenging
Bigger models like GPT-4 didn’t perform better?
Key Takeaway: LLM still lack in creative analysis
100s of documents are used. (dynamic data incoming ?)

9, A Study of Foundation Models for Large-scale Time-series Forecasting

Problem: will time series prediction for a dataset benefit from training on multiple datasets?

TSDiff model : An Unconditional Diffusion Model for Time Series. A diffusion model is a type of generative AI model that creates new data by progressively adding random noise to existing data, then learning to reverse this process to generate high-quality outputs like images or text
Data from Solar, Elec, Traffic, Uber
Train data on a dataset (or a combination) and test on others,
On large-scale data, it doesn’t show improved results,
Training on multiple datasets does not improve the performance of model on a single dataset.
Future work: will same idea hold in other domains

10, Zero-shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation

Problem:? how to produce Counterfactual data from LLM (potentially for test/explain LLM)?

Classifier: DistilBERT on
IMDB data/AG News/SNLI, GPT model

11, StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation

Problem: how to use LLM to recover (generate) prompt ?

Recover the prompt used to transform a given text. LLMs are commonly used to rewrite or make stylistic changes to text. It recovers the LLM prompt that was used to transform a given text.

Direct Inference with LLM, Jailbreak, LLM fin-tuning,?
Transcripts from Youtube video (~10K)
Blama-2-8B instruct and Mistral-7B-Instruct are used
Future work: data only focus on English, data is also small,?

12, An Overview of the Data-Loader Landscape: Comparative Performance Analysis

Problem: how’s the performance of data ingestion from different sources, hardware, etc.?PyTorch is used. NetWork is still slow to transfer data for training

13, Efficient Hierarchical Contrastive Self-supervising Learning for Time Series Classification via Importance-aware Resolution Selection

Problem: how to reduce the time of training Hierarchical Contrastive Learning?

Contrastive learning in the context of Self-Supervised Learning (SSL) is a technique where a model learns representations by comparing different "views" of the same data point (considered positive pairs) to other dissimilar data points (negative pairs), effectively pushing similar data closer together in the embedding space while pushing dissimilar data further apart, all without requiring explicit labels on the data; essentially, it learns by distinguishing between similar and dissimilar data samples through a contrastive loss function
Sampling for each epoch?

领英推荐

Synthetic Data Generator

developrec 1 个月前

Instabase and NatWest Unlock Unstructured Data

Instabase 10 个月前

Mastering Feature Transformation in Data Science: Key…

DSW | Data Science Wizards 10 个月前

14, SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problems

Problem: how to make decisions without moving data in a distributed setting?

Scenario Generation (demand forecasting, power grid forecasting and other multi-stakeholder system)
Split learning (model is split) and collaborative learning
Datasets: USAID, ACES, DEMOND, RENEWABLE
Showing strong performance? on 1125 edge nodesMaximize image

15, OL4TeX: Adaptive Online Learning for Text Classification under Distribution Shifts

Problem: how to deal with data shift for online text classification?

Frequency/Semantic/Vocabulary indicator – Distribution Shift Indicators
KNN, Random Forest Tree are classification algorithms
Compared with LLM and RNN

16, Fine-grained Graph-based Anomaly Detection on Vehicle Controller Area Networks

Problem: how to detect an attack on a car network?

Car has ECU (electric control units) to talk each other (1MB of memory)
ECU is vulnerable (can be hacked)
Deep Learning-method (LSTM) can not work here (no GPU and limited memory size)
Using Graph Kernel:G-IDCS;? Byte-Thresholding Method (DAGA in real ECU)
Car Hacking Dataset (CHD): 4 attacks, DoS, Fuzzing

17, RYAN: A tool for explaining and visually analyzing the evolution of Relaxed Functional Dependencies

Problem: how to track change of data?

RFD (relaxed function dependence) tree
RYAN is built to plot different? RFD graphs
LLM can? be called to explain
User Study: 25 users go through using the RYAN

18, DynamicFL: Federated Learning with Dynamic Communication Resource Allocation

Problem: how to improve accuracy of federated learning?

Use Communication heterogeneity to solve statistical heterogeneity
ResNet-18 on CIFAR-10: show gap between FedAve and FedSGD
DynamicFL: high-frequency group?
Communication interval is major factor in considering?

19, ?Keynote Speech III:? Designing Text Embeddings for the Future

Problem: how to build dual-direction of text embedding ?

AI drive the arising of vector db
Embedding as data systems
Future of embedding: inspect contents, customize domains, migrate provider?
Invert text embedding (hard) – conditional generation, generated text has different embedding
vec2text: Iterative generation
Will all text have the same embedding in future?
How to understand the embedding data?
Case Study: search for customer documents at a credit card company
Embedding on text of transaction (from OpenAI) is high for two purchase
Visa card is domain-specific : Contextual training (classic embedding does not have context)
Vias card may not appear at training : Contextual architecture (e.g., card number)

Lots of Inspiring Posters about predicting cash flow, finding healthy food, mining offshore wind energy, fuel consumption of cars w/ or w.o AC (many can’t be put here)

?20, Multi-task Recommendation in Marketplace via Knowledge Attentive Graph Convolutional Network with Adaptive Contrastive Learning?

Problem: how to recommend products by considering 3rd party sellers on Walmart?

Knowledge Graph and Graph Neural Network
Knowledge graph: items, user, sellers
User-item and user-seller bipartite graph
KGGCN: knowledge attentive graph (attention is used to distinguish the impact of edges)
Trained with LightGCN
ACL: adaptive contrastive loss: correlation between two bipartite graphs
Walmart datasets/Taobao dataset:?

21, Optimal Transport for Efficient, Unsupervised Anomaly Detection on Industrial Data

Problem: how to perform anomaly detection on industrial data in maintenance?

Condition-based maintenance: time consuming and require specialist knowledge and lack data
Automated CBM pipeline: knowledge graph
Optimal Transport: source and target distribution – Wasserstein Distance
OT map is used to monitor the data (no label) and detect anomaly detection
Dynamic threshold adjustment based moving monument?
Case Study: industrial shipping, 9-year-old, 330 oil tanker,?
Publication on 24 NeurIPD Workshop too

22, GRAINRec: Graph and Attention Integrated Approach for Real-Time Session-Based Item Recommendations?

Problem: how to do? session-based item recommendation?

Capture context of each session: like the relationship between items added to buy (milk → egg →)
Session embedding and item embedding to build a Matrix:? nearest neighbor matrix
Dataset from Target:?

23, Exploring Query Understanding for Amazon Product Search?

Problem: how to understand query (types) for cache, product ranking, and? query segment ??

Q2PT framework: takes a search query from user (e.g., nike shoes) and check if it is in the cache
Customer click history between query? and ASIN matrix
Multi-language and multiple locale
Extract feature to perform brand matching

24, CryptoPulse: Short-Term Cryptocurrency Forecasting with Dual-Prediction and Cross-Correlated Market Indicators

Problem: how to predict Cryptocurrency price with real-time data and sentiment analysis?

Predict next day’s closing price with 7 indicator and sentiment analysis (LLM)
Sentiment as a scaling factor
Crypto news: →? LLM Prompt (Few shot learning → Think-tank) → Linear function

Last Words: Why Title It "Mixed Feelings about AI"?

Almost everyone is talking about AI, which is encouraging. However, large language models (LLMs) are not as intelligent as they might seem. They function more like tools for quantifying text through embeddings. Moreover, most fields still require a combination of AI and non-AI solutions.

Thilanka Munasinghe

Lead Research Specialist at RPI: Quantum Machine Learning (QML) and Knowledge Graphs Applications Researcher at Institute for Data Applications and Exploration (IDEA) at RPI

2 个月

Nicely elaborated the summary of the IEEE BigData 2024 Conference, it was a pleasure to get to know you during the conference. See you next year

1 次回应

Suman Bharti

Graduate Research Assistant @ Kennesaw State University | PhD CS student ????

2 个月

Inspiring Bin Dong

1 次回应

Wei Zhang, PhD

CS Researcher @LBNL, HPC + Big Data + AI | Ex-Oracle | Ex-LBNL | Ex-Weibo

2 个月

Amazing Dr. Dong!!!

1 次回应

查看更多评论

要查看或添加评论，请登录

Bin Dong的更多文章

ESSA 2024 Note (about Extreme-Scale Storage and Analysis)

2024年5月28日

ESSA 2024 Note (about Extreme-Scale Storage and Analysis)

ESSA 2024: 5th Workshop on Extreme-Scale Storage and Analysis Keynote: HPC and Databases Revisited — Jay Lofstead…
RSDM-GeoSci 2023 Node

2023年7月7日

RSDM-GeoSci 2023 Node

Glad to present our ongoing work about "finding hidden earthquakes from DAS data" RSDM-GeoSci 2023. Also, learned a lot…

1 条评论
HPDC 2023 Quick Note

2023年6月26日

HPDC 2023 Quick Note

Put here my quick note from 22 papers in HPDC 2023. I try to outline the problems targeted by each paper.

1, Keynote Speech I: Generative Information Retrieval and E-commerce

2, Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product Gap

3, SHRINK: Data Compression by Semantic Extraction and Residuals Encoding

4, On Modeling Adaptive Index Management as Adversarial Search

5, Keynote Speech II: Improving Semi-Supervised Learning with Pseudo-Margins

6, NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation

7, Performance Characterization of Expert Router for Scalable LLM Inference

8, Data Augmentations to support Speculative Reasoning in LLMs

9, A Study of Foundation Models for Large-scale Time-series Forecasting

10, Zero-shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation

11, StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation

12, An Overview of the Data-Loader Landscape: Comparative Performance Analysis

13, Efficient Hierarchical Contrastive Self-supervising Learning for Time Series Classification via Importance-aware Resolution Selection

领英推荐

14, SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problems

15, OL4TeX: Adaptive Online Learning for Text Classification under Distribution Shifts

16, Fine-grained Graph-based Anomaly Detection on Vehicle Controller Area Networks

17, RYAN: A tool for explaining and visually analyzing the evolution of Relaxed Functional Dependencies

18, DynamicFL: Federated Learning with Dynamic Communication Resource Allocation

19, ?Keynote Speech III:? Designing Text Embeddings for the Future

?20, Multi-task Recommendation in Marketplace via Knowledge Attentive Graph Convolutional Network with Adaptive Contrastive Learning?

21, Optimal Transport for Efficient, Unsupervised Anomaly Detection on Industrial Data

22, GRAINRec: Graph and Attention Integrated Approach for Real-Time Session-Based Item Recommendations?

23, Exploring Query Understanding for Amazon Product Search?

24, CryptoPulse: Short-Term Cryptocurrency Forecasting with Dual-Prediction and Cross-Correlated Market Indicators

Bin Dong的更多文章

ESSA 2024 Note (about Extreme-Scale Storage and Analysis)

RSDM-GeoSci 2023 Node

HPDC 2023 Quick Note

社区洞察

其他会员也浏览了

AI: The New Centaur — Augmenting Data Analysts, Not Replacing Them

How LLMs Unlock Insights And Transform Unstructured Data

Challenges and Solutions in Deploying Data Science Models to Power AI Systems

Large Language Models and the Art of Data Reduction

ChatGPT for Data Science: Unlock AI Insights

Importance of People in the Data?Industry

Where Will Data Science Be In The Next Ten Years?

Where Will Data Science Be In The Next Ten Years?

Why the modern data stack will fail, and how generative AI will change everything — with Bob Muglia and Tristan Handy

Understanding Database Vector Search for Gen AI