SMT007 Magazine

SMT007-Oct2025

Issue link: https://iconnect007.uberflip.com/i/1539960

Contents of this Issue

Navigation

Page 7 of 81

8 SMT007 MAGAZINE I OCTOBER 2025 S M T P E RS P ECT I V ES & P RO S P ECTS Artificial Intelligence Part 6: Data Module 1 by D r. J ennie S. Hwang, H-Te chnolo gies Group Data is one of the six pillars of AI infrastructure. It is critical to the performance of artificial intelligence (AI) models. AI data, essential to both the training and inference of Generative AI models, connotes the datasets used to train, validate, and test AI mod- els. Training data provides models with a frame of reference by establishing a baseline against which models can compare new data using pre-trained models for predictions or generating new content. There are three primary types of data: structured, unstructured, and vector. Structured data is a highly organized database, making it easier for algo- rithms to learn patterns. Traditional machine learn- ing (ML) tasks, such as regression or classification for predicting sales numbers, fall into this category. Unstructured data lacks a predefined format and is widely used in deep learning applications, such as natural language processing, computer vision, and speech recognition. Examples include text, audio, video, and images. Vector data and embeddings are high-dimensional numerical representations of such data, and are commonly used in tasks like sim- ilarity search, semantic search, clustering, and rec- ommendation systems. At its core, model output is directly shaped by the input data. The size and quality of data are the top characteristics of data. Quality and Size of Data The overall volume of data continues to rise. By 2030, experts predict worldwide data will grow to over 660 zettabytes¹. The size of the data is one thing; the data's quality is another. What data is to be collected and used? Data needs to represent the entity described completely and accurately in the absence of missing data from a given dataset, to be consistent without contradictions, and to be valid with integrity and uniqueness without duplicate or overlapping data. Additionally, specificity, adapt- ability, and diversity are other required characteris- tics of data. Cleaning and sorting data are prereq- uisites for data preparation, and tools such as AWS SageMaker Data Wrangler can facilitate this.

Articles in this issue

Archives of this issue

view archives of SMT007 Magazine - SMT007-Oct2025