SMT007 Magazine

Page 9 of 81

10 SMT007 MAGAZINE I OCTOBER 2025 A model trained on massive but nonspecific data- sets may not be useful. The effort is to create smart data that possesses the qualities characterized by accuracy, completeness, consistency, integrity, and uniqueness. From a usefulness perspective, more isn't always better, and a high volume of data may diminish returns and increase compute costs. Haphazardly loading data into a large language model (LLM) can exacerbate the problem, which leads to overwhelming levels of complexity and a lack of confidence in shared data. To mitigate data inconsistencies, access-based data collaboration, rather than copy-based integration, eliminates data duplication or overlapping data. For example, high- quality data is imperative in the automotive indus- try when developing autonomous vehicle (AV) algo- rithms. Datasets for AV algorithms typically feature data captured from autonomous vehicles' LiDAR and camera systems to improve object detection and motion prediction. It calls for a stringent six- nines (99.9999%) of reliability. Ensuring privacy and security is another require- ment, which may have business consequences. Data Infrastructure Building an effective AI system requires not only the raw data but also the infrastructure to collect, store, prepare, transport, process, and analyze data. This includes data collection systems that can gather data from various sources (sensors, user interac- tions, public datasets). Also included are storage systems that manage, store, and transport robust databases and data lakes capable of handling large volumes of structured and unstructured data, plus the Time-to-Live parameter that specifies how long temporary or transient data is retained in a system before being deleted (e.g., training logs; a robot completing a task within a set time). Data management and governance are addi- tional crucial components of data infrastructure that ensure data quality, compliance, and traceability. Boundaries of Training Data In considering the boundaries of training data, the first three fundamental questions are: 1. What data to include: Topics, languages, time periods, and sources. 2. What to exclude: Irrelevant, harmful, or low-quality content. 3. How much data: Volume and diversity to achieve the desired performance. The overall goal of defining boundaries for LLM training data is to establish the purpose, scope, domains, and limits of data volume, aligning the datas- ets with the model's goals, resources, and capabilities. Data Preparation When working on an AI model, data preparation takes a significant portion of the total time. Prepar- ing data to achieve clean and relevant datasets is a critical process for the performance and accu- racy of AI models. To prepare data effectively, fol- low these key steps: 1. Define the AI model's objective. 2. Determine the data attributes and format required to achieve the objective. 3. Collect raw data with relevance and diversity. 4. Study the collected data to understand its content and quality, and spot anomalies, missing values, inconsistencies, and errors. 5. Clean and standardize data by removing inconsistencies, errors, and duplicates. 6. Convert the data into a suitable format for analysis. 7. Divide datasets into training, validation, and test. A rule of thumb: 70-15-15 split. S M T P E RS P ECT I V ES & P RO S P ECTS

Articles in this issue

Archives of this issue

view archives of SMT007 Magazine - SMT007-Oct2025

SMT007-Oct2025

Contents of this Issue

Navigation

Page 9 of 81

Articles in this issue

Archives of this issue