PrevPrev Go to previous topic
NextNext Go to next topic
Last Post 20 Feb 2019 01:55 PM by  Patrick Ng
Machine Learning causing science crisis?
 2 Replies
Sort:
You are not authorized to post a reply.
Author Messages
Patrick Ng
Basic Member
Basic Member
Posts:151


--
18 Feb 2019 09:27 PM
    "Machine-learning techniques used by thousands of scientists to analyse data are producing results that are misleading and often completely wrong."

    Link: https://www.bbc.com/news/...environment-47267081

    According to professor Allen from Rice University, Houston, “answers they come up with are likely to be inaccurate or wrong because the software is identifying patterns that exist only in that data set and not the real world.”

    Crisis? Not quite.

    Perspective - as I understand, her observations are closely associated with the machine learning problem of overfitting vs generalization, i.e. when training is done on a dataset, we get a really good fit. But when that ML model is applied to a new unseen dataset, we get poor prediction (i.e., the model dies not generalize well).

    That certainly is a concern and known challenge. Not unlike when we try solving an inverse problem when given a set of data measurements, we may get stuck in one of the local minima and never find the global minimum solution. Under certain conditions, there are techniques and associated trade-offs.

    Interpretation? It is a matter of learning with machine, therefore not crisis yet in oil & gas.

    A better approach? Instead of experimenting with different ML algorithms, maybe we experiment with science-based ML models (e.g., analytical expressions). Learn from the underlying physics (if necessary, what data to acquire), and calibrate with "strong" data. With compounding growth of knowledge over a short period of time, avert the crisis.

    Your thoughts and / or experience?
    0
    David Baker
    New Member
    New Member
    Posts:1


    --
    19 Feb 2019 01:33 PM
    Professor Allen makes a good point and you too Patrick… But, we explorationists have been dealing with this “Crisis” ever since we first tried to predict the chance of drilling a dry hole. There are several factors we must understand when applying these techniques.

    Physics is one. We are not predicting tweets. There may be a human nature model at work when predicting sales of the next sneaker based on sentiment in tweets after the big game, but our models generally are predicting some physical phenomenon and so must also respect the physics that may constrain the event. Many of the “off the shelf” models we may employ know nothing of the underlying physics that generated the data. We must either fashion our features in such a way that the physical nature of the problem is respected, not so easy with the canned models, or, more likely, we just need to understand that the model may just not be as good as the data we are using, to test the model, tell us. Anyone who has tried to drill a 1 in 6 Minnelusa shot in the Powder knows well their model may not be quite right after the first five dry holes! We must understand and accept the risk of using any given model, be it geologic or machine learning.

    Coverage is another. Though bias may be a better term, I use coverage, as bias has many different connotations. There is human bias when evaluating results, also model bias versus variability in model quality. Coverage in the sense of data coverage. Here again we geoscientists know the paradox well. We may look at a basin and try to predict the probability of a new field. In the example above, how did I come up with the 1 in 6 number? In many statistical and machine learning models, there are assumptions about the data distribution. I have a box and am told there are red balls, blue balls and green balls in it and I am to design an experiment to determine how many red balls there are relative to the blue ones. I might randomly select a small set from the box and estimate the distribution, and so might a machine learning model. Unlike the ball box that I know has a finite number of balls and of what type, not only do I not know how many fields there are, I have no idea of the different sizes, yet I venture a guess at the probability of drilling a dry hole. I must make an “estimate” of a probability, dry hole risk, because I don’t have samples, wells, that I know cover the whole population.

    Professor Allen speaks of reproducibility. A machine learning model can generally only predict what it has seen. Unlike the geologist who can dream, if we are predicting a distribution, say, from a bimodal dataset and have only sampled from one of the modes, the model will most likely be wrong if our unseen data is from the other mode. Our original samples did not give enough coverage that we can understand all future outcomes. If another researcher samples the other mode, the result will not reproduce.

    This leads to the lots-of-data and better outcomes paradox. In general, it is believed if we have more data the better the outcome of the model. As we see above, if we do not sample the whole distribution, then no amount of data will ever help predict the other hump if we don’t sample the other hump. This goes hand in hand with the second part of the paradox, more data does not change the outcome.

    This could also be called the seismic paradox. In the Hardeman Basin in north central Texas, looking for the elusive yet quite prolific Chappel pinnacle reef is very hard. Early on, geologists did not do so well until single shot seismic. If you hit the side of a reef, a few shots may tell in what direction the top of the reef was. Success rates got a little better. And so, with 2D, a reef or two may be spotted on a line and success a little better. Then came 3D data. The premise was, if we shoot 3D all over the basin, the success in finding more reefs will get even better. And the paradox… More data did not make the success rate jump. The fallacy was and is that shooting more data did not make more reefs grow! God only made so many and no amount of acquired data will change that. Most of the reefs had been identified long before 3D.

    One must know where whence one is dipping their toe. Know the constraints of the physics, know that you are sampling the whole population and know that oversampling will not change the outcome.
    1
    Patrick Ng
    Basic Member
    Basic Member
    Posts:151


    --
    20 Feb 2019 01:55 PM
    Thanks to David’s post, in particular alerting us to “lots-of-data and better outcomes paradox”.

    As I understand, one of the tenets of ML is that given enough data (think Big), a simple ML algorithm (say single-layer neural network) can outperform more conventional approach (like regression).

    Keep in mind Mother Nature doesn’t play dice and the underlying principles of physics govern what we eventually learn through the drill bit. Therefore, a useful rule of thumb is ...

    "Learning with machine beats ML alone".
    0
    You are not authorized to post a reply.