ABSTRACT: In neural networks, a key hurdle for efficient learning involving sequential data is ensuring good signal propagation over long timescales, while simultaneously allowing systems to be expressive enough to implement complex computations. The brain has evolved to tackle this problem on different scales, and deriving architectural inductive biases based on these strategies can help design better AI systems.
In this talk, I will present two examples of such inductive biases for recurrent neural networks with and without self-attention. In the first, we propose a novel connectivity structure based on « hidden feed forward » features, using an efficient parametrization of connectivity matrices based on the Schur decomposition. In the second, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
contact firstname.lastname@example.org for password.