SMT007 Magazine

Page 11 of 81

12 SMT007 MAGAZINE I APRIL 2026 example, OpenAI's o1 offers several responses to each question and analyzes them to find the best one. Alibaba's QwQ-32B is a reasoning model designed to solve complex problems through a reasoning approach with only 32 billion parameters. The DeepSeek R1 model burst onto the AI scene in early 2025, catching the industry off guard. What made it such a jaw-dropping moment? The model was trained at a fraction of the cost, is open source, and does not require high-end chips. The model is significantly smaller, using substantially fewer parameters than ChatGPT-4 or -5. DeepSeek has 671 billion parameters, compared to ChatGPT-4's estimated more than a trillion parameters, or Chat- GPT-5's tens of trillions of parameters. DeepSeek takes an iterative refinement technique, while ChatGPT draws from a vast, diverse corpus backed by massive computational resources. Based on the technique known as "distillation," DeepSeek uses much less computing power by training smaller, more efficient models from larger "teacher" models. This means they could extract strong performance without needing to train from scratch at a massive scale every time. This approach begins with a small, high-quality, curated dataset as seed data, which is used to train a classifier model. The model, in turn, retrieves similar documents from larger raw data- sets and weeds out duplicates and low-quality data through data filtering and data preparation. By training models on high-quality, curated data- sets rather than on massive amounts of raw data (e.g., from the internet), DeepSeek improves training data efficiency by focusing on smart data use. It employs heavy use of synthetic data and lever- ages reinforcement learning, in which the model essentially taught itself to reason better through trial and error, reducing dependence on expen- sive human-labeled data and resulting in significant cost savings. Overall, the approach of artificially synthesizing data to supplement real-world data has recently seen significant growth. The goal is to do more with less. Data Governance This framework of policies, processes, roles, accountability, and standards ensures that data is accurate, consistent, secure, and used responsibly across an organization. It defines who owns data, how it's managed, and who can access it, enabling compliance, trust, and better decision-making. AI models can only be as valuable as the quality, trustworthiness, and accuracy of the data that was used to train and fine-tune them. The industry- specific datasets and up-to-date data are crucial. To ensure data quality, data governance, including policies and procedures, should be established by considering the following areas: • Identifying both the internal and external datasets. • Determining performance-specific accep- tance criteria before deployment. For exam- ple, the probability of an AI component failing is one in 10,000 computations, which may be acceptable for a customer chatbot but not in a self-driving vehicle. • Building the technical infrastructure and gathering, cleaning, moving, storing, and de- livering the data to the AI systems at the right time and at the optimal speed. To this end, leading techniques, such as RAG, have often been leveraged. • For enterprise agentic AI with embedded agents, ensuring a shared understanding of intent or limits among multiple agents acting across systems. Retrieval-augmented Generation (RAG) This consists of the retriever component and the generator component, which uses NLP tech- niques to combine an LLM with external knowledge S M T P E RS P ECT I V ES & P RO S P ECTS " Overall, the approach of artificially synthesizing data to supplement real-world data has recently seen significant growth. The goal is to do more with less."

Articles in this issue

Archives of this issue

view archives of SMT007 Magazine - SMT007-Apr2026

SMT007-Apr2026

Contents of this Issue

Navigation

Page 11 of 81

Articles in this issue

Archives of this issue