Q&A: Why Must You Know How to Handle Data Manually Before Doing it Automatically?

Cross-posted from here.

I heard that before you can handle data automatically, you have to know how to handle it manually.

Why is it impossible to just find some machine learning algorithm to solve the problem?


In theory, it is possible to solve a problem by throwing an ML algo at it, without knowing how to do it manually.

But in practice, your ML solution usually doesn’t work right out of the box, and you have to debug it.

In order to debug, you need to be able to identify differences between what it’s doing and what it “ought” to be doing – which means you need to understand the problem/solution well enough to run some sanity checks on your model and the underlying data.

Additionally, a common solution approach involves compiling a data set of human decisions and using ML to interpolate those decisions, i.e. predict what decision a human would make on cases not explicitly included in the data set.

For instance, when creating an image classifier, the most common solution approach is to

  1. gather a data set of images,
  2. manually label the images with classifications (or pay other people to do the manual labeling if it's a ton of work or requires domain expertise that you don't have), and then
  3. train an ML model to predict what classification a human would manually assign, given any arbitrary image input.

Step 2 above requires the ability for humans to perform the task manually (even if they don’t know exactly how they’re doing it).