The computational resource growth in natural science motivates the use of machine learning for automated scientific discovery. However, unstructured empirical datasets are often high dimensional, unlabeled, and imbalanced. Therefore, discarding irrelevant (i.e., noisy and information-poor) features is essential for the automated discovery of governing parameters in scientific environments. We present Gaussian Stochastic Gates (STG), which rely on a probabilistic relaxation of the L0 norm of the number of selected features to address this challenge. By applying the Stochastic Gates to a neural network’s input layer, we derive a flexible, fully differentiable model that simultaneously identifies the most relevant features and learns complex nonlinear models. The STG neural network outperforms the state-of-the-art feature selection methods, both in terms of predictive power and its ability to correctly identify the correct subset of informative features. The model was successfully applied for critical biological tasks such as COX proportional hazards model and differential expression analysis on HIV and Melanoma patients. Next, using a linear model, we provide a theoretical basis for optimizing the STG objective using small batches (i.e., SGD). In particular, we present an approximation bound for estimating an unknown signal based on noisy observations. Finally, we develop an extension of the STG model for unsupervised feature selection. The new model is trained to select highly correlated features with the leading eigenvectors of a gated graph Laplacian. The gating mechanism allows us to re-evaluate the Laplacian for different subsets of features and unmask informative structures buried by nuisance features. I will demonstrate that the proposed approach outperforms several unsupervised feature selection baselines.