In classic model-based inversion, we learn to live with non-unique solution (i.e., there is more than one model that fit the data using criteria like minimum mean squared error). One safeguard is calibration (e.g., seismic-inverted model properties vs. that from well logs or drill bits) to determine if the resulting model makes sense or not.
Likewise, we can have different neural network configurations (e.g., number of hidden layers and neurons within each layer, even different activation functions) that meet the minimum mean squared error or some distance criteria we choose. Again the safeguard to non-uniqueness is calibration, where first principle comes in to differentiate the appropriate from the less desirable model. Because of the architecture (hidden layers), I am not aware of easy readily understood visual (e.g., crossplots) to tell how one configuration performs better than others at the intermediate steps. Default to QC the final output difference between predicted and actual, that shows decreasing errors with increasing number of iterations or training runs. What if we dig deeper?
Numerically a recent paper (June 2017) attempted to answer the two related fundamental questions. To paraphrase:
Q1: What distinguishes the good solution (model) from the bad?
Q2: Why randomly initialized hyperparameters converge to a better solution?
https://arxiv.org/pdf/1706.10239.pdf p.s. in seismic data processing, recall 5D interpolation performs best with randomized input. So does initialization of weights with random numbers in neural networks.
Food for thought - just as we use what we know about geology to specify model representation in classic model-based approach, how about sharing your experience or use case of applying domain knowledge and first principle in training neural networks?