Do as AI say

Collective human insights, distilled through AI. What could go wrong?

Hey AI, Research Theory-Based Conjectures in Deep Learning: An Evolving Landscape

March 14, 2025 research doasaisay

1. Introduction: The Quest for a Theoretical Understanding of Deep Learning

The advent of deep learning has ushered in an era of unprecedented progress across a multitude of domains, demonstrating remarkable capabilities in tasks ranging from complex pattern recognition in images and intricate sequence modeling in natural language to even facilitating scientific breakthroughs 1. This surge in practical applications, however, initially outpaced the development of a comprehensive theoretical framework capable of fully explaining the underlying mechanisms that contribute to its success 11. In the early stages, while deep learning systems were achieving state-of-the-art results, the theoretical underpinnings remained largely unexplored, leaving a significant gap between the empirical observations and the formal understanding of these powerful models. As noted in one study, compared to classical machine learning, deep learning initially suffered from a lack of robust theoretical backing, particularly in providing proofs for its observed effectiveness and establishing a consensus on how to even frame the fundamental questions 11. This theoretical void spurred a significant research effort aimed at unraveling the mysteries behind deep learning’s capabilities.

This report aims to explore some of the most prominent theory-based conjectures that have been proposed to elucidate the success of deep learning. It will delve into the core ideas behind these conjectures, analyze the evidence that has emerged over the years in support or against them, and assess how the scientific community’s understanding of these theoretical explanations has evolved as research has progressed. The objective is to provide a comprehensive overview of the current theoretical landscape of deep learning, highlighting the key areas of inquiry and the ongoing efforts to build a solid foundation for this transformative field.

The theoretical investigation into deep learning has primarily focused on addressing several fundamental questions that arise from its empirical success. These key areas of inquiry include understanding why deep networks generalize so well to unseen data despite often being vastly overparameterized 11, why training these deep architectures is practically feasible using gradient-based methods despite the highly non-convex nature of their loss functions 11, what class of functions deep networks can effectively represent and the efficiency gains offered by increasing network depth 15, and what inherent mechanisms within the training process prevent overfitting even in the absence of explicit regularization techniques 15. These fundamental questions form the basis upon which various theoretical conjectures have been built, each attempting to shed light on a specific aspect of the deep learning phenomenon.

2. The Generalization Mystery: Why Do Deep Networks Perform So Well?

A central puzzle in deep learning theory revolves around the remarkable ability of these models to generalize to unseen data, even when they possess an enormous number of parameters, often far exceeding the number of training examples 14. This characteristic is a defining feature of modern deep learning and stands in stark contrast to the predictions of classical statistical learning theory. Traditional theory suggests that a model with such a high capacity should readily overfit the training data, memorizing not only the underlying patterns but also the inherent noise, leading to poor performance on new, unseen data 11. The fact that deep learning models often defy this expectation and exhibit strong generalization capabilities has been labeled a “mystery” 18 and presents a significant challenge to traditional learning paradigms.

Early attempts to theoretically explain this generalization ability often relied on classical measures of model complexity, such as the VC dimension and Rademacher complexity 14. These measures aim to quantify the capacity of a model class and provide bounds on the generalization error based on this capacity and the size of the training data. However, when applied to deep neural networks, these traditional measures often yield generalization bounds that are either too loose to be informative or even entirely vacuous 14. For instance, one study explicitly noted that these bounds can be “so large that they are meaningless” in the context of deep networks 14. This failure of conventional statistical learning theory to adequately explain the generalization of deep learning underscored the necessity for new theoretical frameworks and conjectures that could better account for the unique properties of these models, especially their tendency towards overparameterization.

3. Implicit Regularization: The Unseen Hand in Deep Learning Optimization

In response to the generalization mystery, a prominent conjecture has emerged suggesting that the process of training deep neural networks using gradient-based optimization algorithms like Stochastic Gradient Descent (SGD) inherently introduces a form of regularization 15. This “implicit regularization” is hypothesized to guide the learning process towards solutions that not only fit the training data well but also generalize effectively to unseen data, even without the explicit addition of regularization terms to the loss function. The idea is that the dynamics of the optimization algorithm itself impose a bias on the type of solutions that are found, favoring those with desirable properties for generalization.

Over the years, several potential mechanisms have been proposed to explain how this implicit regularization might occur. One early conjecture suggests that gradient descent on certain loss functions, such as the exponential loss, might implicitly favor solutions with smaller weight norms 15. Smaller weight norms are often associated with simpler models that are less prone to overfitting. For example, one study indicated that the critical points reached by gradient descent correspond to “minimum norm infima of the loss” 15. Furthermore, in the context of linear models, it has been theoretically shown that gradient flow with a squared error loss converges to the solution with the minimum L2 norm that still fits the training data 20. This suggests that the optimization process itself might have an inherent bias towards simpler solutions.

Another line of research has focused on the behavior of deep linear networks, particularly in the context of matrix factorization for tasks like matrix completion. These studies have revealed an implicit bias of gradient descent towards finding low-rank solutions 18. Low-rank solutions are often considered simpler and more generalizable in problems involving matrix data. One study explicitly stated that adding depth to a matrix factorization model enhances the implicit tendency towards low-rank solutions 18, while another suggested that “minimization of rank” might be a more accurate way to understand implicit regularization in this setting than focusing on the norms of the weight matrices 28.

Initially, it was also hypothesized that the inherent noise in Stochastic Gradient Descent (SGD), especially when using small batch sizes, might act as a form of implicit regularization by preventing the optimization from converging to “sharp minima” in the loss landscape, which were thought to generalize poorly 18. However, subsequent research has indicated that deterministic gradient-based algorithms can also achieve good generalization 18, and some empirical evidence even suggests that stochastic training might not offer particular benefits for generalization 20, thus questioning the primary role of SGD noise in this phenomenon.

Beyond these initial ideas, a multitude of other potential mechanisms for implicit regularization have been proposed. These include geometric properties of the loss function that might favor smoother solutions 30, the possibility of learned networks performing almost linear interpolation between data points due to the non-linearity of ReLU activations 32, the concept of “implicit self-regularization” potentially related to Tikhonov regularization 33, a dynamic theory focusing on saddle point escaping in deep low-rank matrix factorization 34, and the implicit bias towards minimum L2-norm solutions in linear models trained with gradient flow 20. The sheer variety of these proposed mechanisms highlights the complexity of understanding how optimization in deep learning leads to generalization.

The understanding of implicit regularization has evolved considerably over the years. Early research often drew parallels with explicit regularization techniques, focusing on the minimization of norms 18. However, later studies, particularly in the context of matrix factorization, have challenged this view, suggesting that rank minimization might be a more relevant concept in some cases 18. There is a growing consensus that implicit regularization is likely data-dependent and might not be universally captured by simple mathematical norms 18. More recent research has increasingly focused on the dynamics of the optimization process itself and the role of specific architectural choices in shaping these implicit regularization effects 31. This shift in focus reflects a more nuanced understanding of how the interplay between the learning algorithm, the network structure, and the data contributes to the generalization capabilities of deep learning models.

4. Gradient Coherence: Finding Harmony in Training Signals

A more recent and actively investigated conjecture aimed at explaining the generalization prowess of deep learning is the Coherent Gradients Hypothesis (CGH) 16. This hypothesis proposes that the ability of deep networks trained with gradient descent to generalize well stems from the similarity, or “coherence,” observed in the gradients computed from similar training examples. The central idea is that during training, the overall gradient update will be more pronounced in directions that lead to a reduction in loss across multiple similar data points, thereby facilitating the learning of features that are generalizable rather than specific to individual training instances.

The CGH posits that real-world datasets inherently exhibit a higher degree of gradient coherence compared to datasets with random labels 16. This difference in coherence is proposed as a key reason why deep networks can achieve strong generalization on real data but tend to overfit when trained on randomly labeled data, which lacks meaningful underlying structure. Empirical studies have been conducted to validate this by measuring the gradient coherence on real datasets, such as ImageNet, and comparing it to the coherence observed when training on versions of the same dataset with randomized labels 37. These studies generally report a significantly higher level of gradient coherence in the training process with real data. For instance, one study found substantially higher coherence when training a ResNet50 model on real ImageNet compared to a randomly labeled version of the same dataset 44.

Furthermore, research has explored the impact of manipulating the gradients during training to either enhance or suppress coherence. Techniques that aim to suppress weak gradient directions (those that do not align well with the gradients from other examples in the mini-batch) have shown promising results in terms of improved generalization performance 16. These findings lend further support to the CGH by suggesting that emphasizing coherent gradient updates can lead to better learning outcomes. For example, one study introduced a “median of means” approach to filter out weak gradient directions and observed a significant reduction in overfitting 42. The rationale behind this is that strong, coherent gradient directions are more stable with respect to the inclusion or exclusion of individual training examples, which is a desirable property for generalization 43.

As research on the CGH has progressed, efforts have been made to develop more refined and computationally efficient metrics for quantifying gradient coherence 45. Traditional methods of calculating coherence can be computationally demanding, especially with large mini-batches. A more recent study proposed a new metric called “m-coherence” that aims to be more interpretable and computationally less expensive than existing measures 45. Interestingly, some research has observed that gradient coherence can increase during the initial phases of training even when using randomly labeled data 45. This suggests that some degree of coherence might arise as a natural consequence of the optimization process itself, independent of the underlying structure in the data, prompting further investigation into the origins and implications of gradient coherence.

However, not all findings align perfectly with a straightforward interpretation of the CGH. One study reported evidence of “oscillating gradients” during the training of deep networks, where gradients in consecutive iterations showed a high negative correlation 47. The study found that a significant portion of the training loss reduction occurred through these oscillating gradients, which might seem at odds with the idea of consistently aligned gradients driving the learning process. This observation suggests that the dynamics of gradient descent in deep learning might be more complex than a simple reliance on coherent gradients. Additionally, research in other areas, such as reinforcement learning, has shown that the relationship between concepts related to implicit regularization (like effective rank) and performance can be sensitive to hyperparameters 48, indicating that simplistic assumptions about the direct link between theoretical concepts and empirical outcomes might be misleading. While not directly contradicting the CGH, these findings highlight the need for a nuanced understanding of the various factors influencing generalization in deep learning.

5. The Role of Flat Minima in Generalization

Another influential conjecture in deep learning theory centers around the geometry of the loss landscape and its relationship to the generalization ability of trained models. Specifically, the flat minima conjecture proposes that deep neural networks trained with stochastic optimizers tend to converge to “flat” minima in the loss landscape, and that these flat minima are associated with better generalization performance compared to “sharp” minima 18. A flat minimum is characterized by a broad region in the parameter space where the loss remains relatively low, suggesting that the learned solution is less sensitive to small changes or perturbations in the model’s weights.

The initial appeal of this conjecture stemmed from the intuitive idea that models residing in flat minima would be more robust to noise in the training data or slight differences between the training and test distributions 50. If the loss function is flat around a solution, then small perturbations to the parameters are less likely to cause a significant increase in the loss, implying a more stable and generalizable model. Empirical studies have provided some support for this by showing correlations between training regimes that tend to find flat minima (like SGD with small batch sizes) and improved generalization 50. Furthermore, algorithms designed to explicitly seek out flat regions in the loss landscape have often demonstrated promising results in terms of generalization 27.

However, the flat minima conjecture has also faced significant challenges and has been refined by subsequent research. A key criticism is that the notion of “flatness” can be highly sensitive to the specific parameterization used to represent the neural network 50. A minimum that appears flat under one parameterization might appear sharp under another, even if both parameterizations represent the same underlying function. This raises questions about whether flatness, as typically defined, is a fundamental property that directly guarantees generalization. Some research has even argued that sharp minima can also generalize well 53.

More recent work has explored alternative ways to define and measure the “flatness” of minima, often using concepts like the trace of the Hessian matrix as a measure of average curvature 40. Studies in specific contexts, such as low-rank matrix recovery, have shown that flat minima (as measured by the Hessian trace) can exactly recover the ground truth under certain statistical assumptions 40. This suggests that while the initial, simple view of flat minima might be problematic, more refined measures of flatness could still hold relevance in understanding generalization for certain types of problems. The relationship between flatness and generalization might also be more intricate, potentially depending on factors like the volume of the surrounding function space 50.

The current understanding is that while the flatness of minima might play a role in generalization, it is likely not the sole determining factor and the relationship is more complex than initially conceived. Ongoing research continues to investigate more robust measures of flatness and their connection to other aspects of the loss landscape and the training process.

6. Overparameterization and the Double Descent Phenomenon

Classical statistical learning theory suggests that there is a trade-off between the bias and variance of a model. As model complexity increases, the bias decreases (the model can fit the training data better), but the variance increases (the model becomes more sensitive to the specific training data and might not generalize well). This typically leads to a U-shaped curve for the test error as a function of model complexity. However, in deep learning, empirical observations have revealed a surprising phenomenon known as “double descent” 23.

In double descent, as the model complexity (e.g., the number of parameters) increases, the test error initially decreases, as expected. However, after reaching a peak around the point where the model can just perfectly fit the training data (the interpolation threshold), the test error surprisingly starts to decrease again as the model becomes even more overparameterized. This second descent in the test error, occurring when the number of parameters far exceeds the number of data points, contradicts the traditional intuition that more parameters should always lead to overfitting.

Theoretical efforts to explain double descent have explored various factors. One perspective suggests that in the overparameterized regime, there are many models that can fit the training data perfectly, and gradient descent might implicitly select those that have better generalization properties, such as having a lower norm 22. Other theories involve the interplay between the model size and the effective complexity of the learning task. Some work has used random matrix theory to analyze the behavior of the empirical feature covariance matrix in relation to the Neural Network Gaussian Process (NNGP) kernel to explain this phenomenon 58. Additionally, research has investigated the role of smoothness and weighted optimization in the context of overparameterization and generalization error, often motivated by the double descent observation 23.

The double descent phenomenon has significant implications for our understanding of generalization in deep learning. It suggests that very large, overparameterized models can indeed achieve excellent generalization, challenging the traditional view that limiting model complexity is always necessary to prevent overfitting. Understanding the mechanisms behind double descent could provide valuable insights into how to effectively design and train deep learning models, potentially suggesting that exploring very large models might be a viable strategy for achieving state-of-the-art performance.

7. Information Bottleneck: A Theoretical Framework for Deep Learning?

The Information Bottleneck (IB) principle, rooted in information theory, offers a different perspective on understanding learning in deep neural networks 11. The IB principle suggests that a good representation of the input data should retain as much information as possible about the relevant output variable while discarding irrelevant information. In the context of deep learning, this translates to the idea that hidden layers in a neural network should learn to compress the input data into a representation that is informative about the task at hand but discards noise and other task-irrelevant details.

Recent research has explored the IB framework as a potential theoretical foundation for deep learning. This includes using information-theoretic measures like mutual information to analyze the training process, employing the IB principle as a guide for designing loss functions, and investigating the “information plane” to visualize how neural networks learn relevant information and forget irrelevant information during training 62. A significant area of focus has been on trying to establish rigorous learning theory that links the information bottleneck to generalization error bounds 64. The hope is that by controlling the flow of information through the network, we can theoretically guarantee good generalization.

While the IB framework provides an intuitive and appealing way to think about learning in deep networks, establishing tight and practically useful generalization bounds based on this principle has proven to be challenging 38. Many current information-theoretic bounds are still quite loose or rely on assumptions that might not always hold. However, some studies have shown empirical correlations between the degree to which a neural network exhibits an information bottleneck (i.e., compresses the input while retaining information about the output) and its generalization performance 64. These findings suggest that the principles captured by the IB framework might indeed be important for understanding why deep learning models generalize well.

8. The Lottery Ticket Hypothesis: Uncovering Trainable Sparse Subnetworks

The Lottery Ticket Hypothesis (LTH) presents a rather surprising and intriguing conjecture about the nature of learning in deep neural networks 67. The LTH proposes that within a randomly initialized, dense neural network, there exists a sparse subnetwork (a “winning ticket”) that, when trained in isolation from the same initial weights, can achieve a test accuracy comparable to or even better than that of the original, dense network, and in a similar number of training iterations. This suggests that the initial weights of these winning subnetworks are particularly fortuitous or well-suited for learning the task.

Empirical studies have provided considerable support for the LTH, consistently finding such winning tickets across various network architectures and datasets 69. These studies typically involve training a dense network, pruning a significant portion of its weights based on magnitude, and then retraining the remaining sparse subnetwork using its original initial weights. The finding that these sparse subnetworks can often match or even exceed the performance of the original dense network is quite remarkable.

Theoretical research has begun to explore the reasons behind the success of lottery tickets. Some work suggests that these subnetworks might reside in regions of the loss landscape that are more favorable for optimization 68. There is also debate about whether the key to the success of winning tickets lies primarily in the specific initial weights or in the structure of the sparse subnetwork itself. Some evidence suggests that while the original initialization is beneficial, the subnetwork structure might be the more critical factor 68.

The LTH has significant implications for the efficiency of deep learning. If we can reliably identify these winning tickets early on, it could lead to much faster and more resource-efficient training of smaller networks. However, efficiently finding these tickets, especially in very large-scale models, remains a significant challenge.

9. Other Emerging Conjectures and Evolving Perspectives

Beyond the major conjectures discussed, research continues to explore other aspects of deep learning theory. The theoretical understanding of the approximation power of deep networks has shown that they can approximate a wide range of functions, sometimes with an exponential advantage over shallow networks, particularly for compositional functions 15. This suggests that depth plays a crucial role in the representational efficiency of these models.

The dynamics of optimization in deep learning also remain a central area of investigation 11. Researchers continue to explore why gradient-based methods work so effectively despite the non-convex nature of the loss landscape. Theories related to the geometry of the loss landscape, such as the existence of flat minima and continuous low-loss paths, as well as frameworks like the Neural Tangent Kernel (NTK) theory and Mean-Field theory, provide different lenses through which to analyze the optimization process. The impact of hyperparameters like learning rate and batch size also remains an active area of research.

10. Conclusion: The Evolving Landscape of Deep Learning Theory

The theoretical understanding of deep learning has progressed significantly since its resurgence. While the field initially saw empirical advancements outpacing theoretical explanations, substantial research efforts have been made to develop a more robust foundation. Conjectures surrounding implicit regularization have gained considerable traction, suggesting that the optimization process itself plays a crucial role in promoting generalization. The Coherent Gradients Hypothesis offers an intuitive explanation based on the alignment of gradients from similar examples. The flat minima conjecture, while facing nuances related to parameterization, continues to inspire research into the role of loss landscape geometry. The double descent phenomenon has challenged traditional views on model complexity and generalization, opening new avenues for theoretical exploration. The Information Bottleneck principle provides a promising information-theoretic framework for understanding representation learning. Finally, the Lottery Ticket Hypothesis offers intriguing insights into the potential for highly sparse and efficient deep learning models.

Despite the progress, many open questions and challenges remain. A complete and unified theory of deep learning generalization is still elusive. The precise mechanisms underlying implicit regularization need further clarification. The origins and full implications of gradient coherence require deeper investigation. The connection between flatness of minima and generalization is more complex than initially thought. Scaling theoretical results to the massive models used in practice remains a significant hurdle. Understanding the interplay between architecture, optimization, and generalization is a key area for future research. Moreover, the theoretical foundations of emerging areas like self-supervised learning and the impressive capabilities of large language models are just beginning to be explored.

Future research in deep learning theory will likely focus on developing more comprehensive and unified frameworks that can address the intricate interplay of factors contributing to the success of these models. Continued investigation into the roles of data, architecture, and optimization will be crucial. Bridging the gap between theoretical findings and practical applications, and developing theories that can explain the behavior of increasingly complex deep learning models, will be essential for the continued advancement of the field.

Table: Evolution of Key Conjectures in Deep Learning Theory

Conjecture Name Initial Formulation Key Supporting Evidence Over Time Key Challenges or Nuances Over Time Current Standing
Implicit Regularization Gradient-based optimization inherently guides learning towards generalizable solutions without explicit regularization. Observation of good generalization in overparameterized networks without explicit regularization 15; Theoretical work on minimum norm solutions 15; Empirical evidence of low-rank bias in matrix factorization 18. Difficulty in characterizing with simple norms 28; Questioning the role of SGD noise 18; Data dependence of the phenomenon 18. Broadly accepted as a key factor, but specific mechanisms are still being elucidated.
Gradient Coherence Hypothesis Generalization is driven by the similarity of gradients from similar training examples. Empirical findings of higher coherence on real data vs. random data 37; Improved generalization with suppression of weak gradients 42. Observation of increasing coherence even with random labels 45; Evidence of effective learning with oscillating gradients 47; Sensitivity to hyperparameters 48. Promising explanation, but origins and full implications are still under investigation.
Flat Minima Conjecture Deep networks tend to find flat minima in the loss landscape, which generalize better than sharp minima. Intuitive explanation of robustness to perturbations 50; Empirical correlation between flat minima and generalization 50; Success of algorithms seeking flat minima 27. Sensitivity of flatness to parameterization 50; Evidence that sharp minima can also generalize 53; Potential role of function space volume 50. Refined understanding; flatness might play a role, but it’s not the sole determinant and depends on the measure used.
Double Descent Test error decreases, increases at the interpolation threshold, and then decreases again with increasing overparameterization. Consistent empirical observations across various models and datasets 60; Initial theoretical explanations involving properties of linear regression and overparameterization 59. Requires going beyond traditional bias-variance tradeoff; Exact mechanisms are still under investigation. Well-documented phenomenon that challenges classical theory and highlights the benefits of extreme overparameterization.
Information Bottleneck Good representations are compressed but retain maximal information about the relevant output. Intuitive framework for understanding representation learning 62; Empirical correlations between bottleneck behavior and generalization 64; Use in analyzing optimization and designing cost functions 62. Difficulty in deriving tight and practical generalization bounds 38; Counterexamples to some initial conjectures 65. Promising theoretical lens, but further work is needed to establish it as a fully rigorous theory for deep learning generalization.
Lottery Ticket Hypothesis Dense networks contain sparse subnetworks (“winning tickets”) that can achieve comparable performance when trained in isolation. Consistent empirical findings of high-performing sparse subnetworks 69; Initial theoretical justifications based on loss landscape geometry 68. Debate on the relative importance of initial weights vs. subnetwork structure 68; Challenges in efficiently identifying winning tickets in large networks 67. Intriguing hypothesis with significant empirical support and potential for improving efficiency; ongoing research into the underlying mechanisms and efficient identification methods.

Works cited

1. Theoretical Perspectives on Deep Learning Methods in Inverse Problems - NSF Public Access Repository, accessed March 14, 2025, https://par.nsf.gov/servlets/purl/10432882
2. The murmuration conjecture: finding new maths with AI, accessed March 14, 2025, https://plus.maths.org/content/murmuration-conjecture-finding-new-maths-ai
3. Top Deep Learning Interview Questions and Answers (2024) - GeeksforGeeks, accessed March 14, 2025, https://www.geeksforgeeks.org/deep-learning-interview-questions/
4. The Top 20 Deep Learning Interview Questions and Answers - DataCamp, accessed March 14, 2025, https://www.datacamp.com/blog/the-top-20-deep-learning-interview-questions-and-answers
5. A Comprehensive Review of Deep Learning: Architectures, Recent Advances, and Applications - MDPI, accessed March 14, 2025, https://www.mdpi.com/2078-2489/15/12/755
6. An Introductory Review of Deep Learning for Prediction Models With Big Data - PMC, accessed March 14, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC7861305/
7. Book Review: Deep Learning - PMC, accessed March 14, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC5116548/
8. Deep Learning, Theory and Foundation: A Brief Review, accessed March 14, 2025, https://ijisrt.com/assets/upload/files/IJISRT21NOV399.pdf
9. Recent advances of deep learning in psychiatric disorders - PMC, accessed March 14, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC8982596/
10. (PDF) Recent advances in deep learning models: a systematic literature review, accessed March 14, 2025, https://www.researchgate.net/publication/370253630_Recent_advances_in_deep_learning_models_a_systematic_literature_review
11. Is there conjectures in deep learning theory? - Mathematics Stack Exchange, accessed March 14, 2025, https://math.stackexchange.com/questions/2623846/is-there-conjectures-in-deep-learning-theory
12. Is there actually a lack of fundamental theory on deep learning? - AI Stack Exchange, accessed March 14, 2025, https://ai.stackexchange.com/questions/2996/is-there-actually-a-lack-of-fundamental-theory-on-deep-learning
13. AI’s mysterious ‘black box’ problem, explained | University of Michigan-Dearborn, accessed March 14, 2025, https://umdearborn.edu/news/ais-mysterious-black-box-problem-explained
14. Theory of Deep Learning: Generalization - Desh Raj, accessed March 14, 2025, https://desh2608.github.io/2018-07-27-deep-learning-theory-2/
15. Theoretical issues in deep networks - PNAS, accessed March 14, 2025, https://www.pnas.org/doi/10.1073/pnas.1907369117
16. [2203.10036] On the Generalization Mystery in Deep Learning - arXiv, accessed March 14, 2025, https://arxiv.org/abs/2203.10036
17. lis.csail.mit.edu, accessed March 14, 2025, https://lis.csail.mit.edu/pubs/kawaguchi-techreport18.pdf
18. Implicit Regularization in Deep Matrix Factorization - OpenReview, accessed March 14, 2025, https://openreview.net/attachment?id=B1eI3EHxUr&name=pdf
19. [1709.01953] Implicit Regularization in Deep Learning - arXiv, accessed March 14, 2025, https://arxiv.org/abs/1709.01953
20. Chapter 7 - Implicit regularization - Chinmay Hegde, accessed March 14, 2025, https://chinmayhegde.github.io/fodl/generalization01/
21. How Neural Networks Escape Perils of Overparameterization …, accessed March 14, 2025, https://brain.harvard.edu/hbi_news/how-neural-networks-escape-perils-of-overparameterization/
22. [D] Why does overparameterization and reparameterization result in a better model? : r/MachineLearning - Reddit, accessed March 14, 2025, https://www.reddit.com/r/MachineLearning/comments/1elvkz6/d_why_does_overparameterization_and/
23. Overparameterization and Generalization Error: Weighted Trigonometric Interpolation | SIAM Journal on Mathematics of Data Science, accessed March 14, 2025, https://epubs.siam.org/doi/abs/10.1137/21M1390955
24. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers - NIPS papers, accessed March 14, 2025, http://papers.neurips.cc/paper/8847-learning-and-generalization-in-overparameterized-neural-networks-going-beyond-two-layers.pdf
25. [2006.08495] Overparameterization and generalization error: weighted trigonometric interpolation - arXiv, accessed March 14, 2025, https://arxiv.org/abs/2006.08495
26. Optimization for deep learning: an overview - NC State ISE, accessed March 14, 2025, https://ise.ncsu.edu/wp-content/uploads/sites/9/2020/08/Optimization-for-deep-learning.pdf
27. Shaping the learning landscape in neural networks around wide flat minima - PNAS, accessed March 14, 2025, https://www.pnas.org/doi/abs/10.1073/pnas.1908636117?doi=10.1073/pnas.1908636117
28. Implicit Regularization in Deep Learning May Not Be Explainable by Norms - NIPS papers, accessed March 14, 2025, https://proceedings.neurips.cc/paper/2020/file/f21e255f89e0f258accbe4e984eef486-Paper.pdf
29. Efficiency Calibration of Implicit Regularization in Deep Networks via Self-paced Curriculum-Driven Singular Value Selection | IJCAI, accessed March 14, 2025, https://www.ijcai.org/proceedings/2024/499
30. Implicit Regularization in Deep Learning - ResearchGate, accessed March 14, 2025, https://www.researchgate.net/publication/319534368_Implicit_Regularization_in_Deep_Learning
31. A Primer on Implicit Regularization | Daniel Gissin’s Blog, accessed March 14, 2025, https://dsgissin.github.io/blog/2020/03/09/implicit_regularization.html
32. A Mechanism of Implicit Regularization in Deep Learning - OpenReview, accessed March 14, 2025, https://openreview.net/forum?id=HJx0U64FwS
33. Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning, accessed March 14, 2025, https://jmlr.org/papers/volume22/20-410/20-410.pdf
34. A Dynamics Theory of Implicit Regularization in Deep Low-Rank Matrix Factorization - arXiv, accessed March 14, 2025, https://arxiv.org/abs/2212.14150
35. [2102.09972] Implicit Regularization in Tensor Factorization - arXiv, accessed March 14, 2025, https://arxiv.org/abs/2102.09972
36. Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization | Request PDF - ResearchGate, accessed March 14, 2025, https://www.researchgate.net/publication/339497882_Coherent_Gradients_An_Approach_to_Understanding_Generalization_in_Gradient_Descent-based_Optimization
37. An Investigation of the Interactions of Gradient Coherence and Network Pruning in Neural Networks - BYU ScholarsArchive, accessed March 14, 2025, https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=11354&context=etd
38. Understanding the Generalization Ability of Deep Learning Algorithms: A Kernelized Rényi’s Entropy Perspective - IJCAI, accessed March 14, 2025, https://www.ijcai.org/proceedings/2023/0405.pdf
39. Recent advances in deep learning theory - Fengxiang He, accessed March 14, 2025, https://fengxianghe.github.io/paper/he2020recent.pdf
40. Flat minima generalize for low-rank matrix recovery - Oxford Academic, accessed March 14, 2025, https://academic.oup.com/imaiai/article-pdf/13/2/iaae009/57366653/iaae009.pdf
41. [2203.03756] Flat minima generalize for low-rank matrix recovery - arXiv, accessed March 14, 2025, https://arxiv.org/abs/2203.03756
42. Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale - NASA/ADS, accessed March 14, 2025, https://ui.adsabs.harvard.edu/abs/2020arXiv200307422Z/abstract
43. COHERENT GRADIENTS: AN APPROACH TO UNDERSTANDING GENERALIZATION IN GRADIENT DESCENT-BASED OPTIMIZATION - OpenReview, accessed March 14, 2025, https://openreview.net/pdf?id=ryeFY0EFwS
44. Towards a Simple Explanation of the Generalization Mystery in Deep Learning - Satrajit Chatterjee - blif.org, accessed March 14, 2025, https://blif.org/~satrajit/cg/
45. Making Coherence Out of Nothing At All: Measuring Evolution of Gradient Alignment, accessed March 14, 2025, https://openreview.net/forum?id=xsx58rmaW2p
46. Making Coherence Out of Nothing At All: Measuring the Evolution of Gradient Alignment, accessed March 14, 2025, https://www.semanticscholar.org/paper/Making-Coherence-Out-of-Nothing-At-All%3A-Measuring-Chatterjee-Zielinski/121b11260583626820b9e7b9416c6adbc71f887e
47. Exploring Gradient Oscillation in Deep Neural Network Training, accessed March 14, 2025, https://www.sci.utah.edu/~beiwang/publications/Mysterious_BeiWang_2023.pdf
48. An empirical study of implicit regularization in deep offline RL - OpenReview, accessed March 14, 2025, https://openreview.net/forum?id=HFfJWx60IT¬eId=bJSY2gjyie
49. Flat minima and generalization in deep learning: a case study in low rank matrix recovery, accessed March 14, 2025, https://datascience.ucsd.edu/event/flat-minima-and-generalization-in-deep-learning-a-case-study-in-low-rank-matrix-recovery/
50. What can flatness teach us: understanding generalisation in Deep Neural Networks, accessed March 14, 2025, https://towardsdatascience.com/what-can-flatness-teach-us-understanding-generalisation-in-deep-neural-networks-a7d66f69cb5c/
51. Connection between Flatness and Generalization | Tuan-Anh Bui, accessed March 14, 2025, https://tuananhbui89.github.io/blog/2024/sharpness/
52. The Generalization Mystery: Sharp vs Flat Minima - inFERENCe, accessed March 14, 2025, https://www.inference.vc/sharp-vs-flat-minima-are-still-a-mystery-to-me/
53. Sharp Minima Can Generalize For Deep Nets - arXiv, accessed March 14, 2025, https://arxiv.org/pdf/1703.04933
54. (PDF) Unveiling the Structure of Wide Flat Minima in Neural Networks - ResearchGate, accessed March 14, 2025, https://www.researchgate.net/publication/357411556_Unveiling_the_Structure_of_Wide_Flat_Minima_in_Neural_Networks
55. Maryam Fazel (University of Washington) – Flat Minima and Generalization in Learning: the Case of Low-rank Matrix Recovery - Data Science Institute, accessed March 14, 2025, https://datascience.uchicago.edu/events/maryam-fazel-washington-flat-minima/
56. Flat Minima Generalize for Low-rank Matrix Recovery, accessed March 14, 2025, https://ifds.info/wp-content/uploads/2023/08/Flat_min_2023.pdf
57. [1703.04933] Sharp Minima Can Generalize For Deep Nets - arXiv, accessed March 14, 2025, https://arxiv.org/abs/1703.04933
58. Double-Descent Curves in Neural Networks: A New Perspective Using Gaussian Processes | Proceedings of the AAAI Conference on Artificial Intelligence, accessed March 14, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/29071
59. [2303.14151] Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle - arXiv, accessed March 14, 2025, https://arxiv.org/abs/2303.14151
60. Double descent - Wikipedia, accessed March 14, 2025, https://en.wikipedia.org/wiki/Double_descent
61. Double descent: understanding deep learning’s curve - Telnyx, accessed March 14, 2025, https://telnyx.com/learn-ai/double-descent-deep-learning
62. Information Bottleneck: Theory and Applications in Deep Learning - PMC, accessed March 14, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC7764901/
63. Information bottleneck method - Wikipedia, accessed March 14, 2025, https://en.wikipedia.org/wiki/Information_bottleneck_method
64. Deep Learning & Information Bottleneck - GitHub, accessed March 14, 2025, https://github.com/xu-ji/information-bottleneck
65. How Does Information Bottleneck Help Deep Learning?, accessed March 14, 2025, https://proceedings.mlr.press/v202/kawaguchi23a/kawaguchi23a.pdf
66. Special Issue : Information Bottleneck: Theory and Applications in Deep Learning - MDPI, accessed March 14, 2025, https://www.mdpi.com/journal/entropy/special_issues/information_theoretic_computational_intelligence
67. A Survey of Lottery Ticket Hypothesis - arXiv, accessed March 14, 2025, https://arxiv.org/html/2403.04861v1
68. Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks - NIPS papers, accessed March 14, 2025, https://proceedings.neurips.cc/paper_files/paper/2021/file/15f99f2165aa8c86c9dface16fefd281-Paper.pdf
69. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks - SciSpace, accessed March 14, 2025, https://scispace.com/papers/the-lottery-ticket-hypothesis-finding-sparse-trainable-1s8pjdso3i
70. Proving the Lottery Ticket Hypothesis: Pruning is All You Need - Proceedings of Machine Learning Research, accessed March 14, 2025, http://proceedings.mlr.press/v119/malach20a/malach20a.pdf
71. A simple introduction to MIT’s Lottery Ticket Hypothesis | by Devansh - Medium, accessed March 14, 2025, https://machine-learning-made-simple.medium.com/a-simple-introduction-to-mits-lottery-ticket-hypothesis-4a8404481e26
72. A Survey of Optimization Methods for Training DL Models: Theoretical Perspective on Convergence and Generalization | OpenReview, accessed March 14, 2025, https://openreview.net/forum?id=TDujguk7NG
73. [PDF] Recent advances in deep learning theory - Semantic Scholar, accessed March 14, 2025, https://www.semanticscholar.org/paper/Recent-advances-in-deep-learning-theory-He-Tao/c80934b0d1708be2eee8c2b5a0385fea34af9e51