How-to-write-a-great-MLHC-paper — Machine Learning for Healthcare

HOW TO WRITE A GREAT MLHC PAPER

Clinical Abstracts

Clinical abstracts must be written by a clinician first author who will present the work. Clinicians are broadly defined as those with a doctor of medicine (M.D.), nursing degree (R.N.), pharmacy degree (PharmD) or other allied health professional who spend a portion of the time involved in direct patient care. Any submission that does not meet these criteria will be rejected. Beyond that, we are interested in open problems, datasets, preliminary analyses, and demonstrations, all as long as they are well-motivated w.r.t. machine learning in the context of healthcare.

Software

We will prioritize software that adds to the community. Considerations include ease of access (e.g. open source) and level of testing/readiness. Demonstrations of proprietary/closed systems may be appropriate for the clinical abstract track (see above).

Full Papers

This document is intended to provide guidance -- both for authors and reviewers -- about what MLHC is looking for in papers. Our goal is to highlight papers that provide generalizable insights about machine learning in the context of health. Your paper should teach us some new understanding, not just wow us with numbers.

1.Blind your paper.

This sounds basic, but we’re seeing a large number of submissions with author names -- either in the author block or the headers. Going forward, this will be grounds for automatic rejection. Don’t include your info! (Note: we understand that sometimes the cohort description makes it relatively easy to guess the authors or institutions. We’re not talking about that here, we’re talking about actually including your name as an author.)

2. Make sure the insights we should gain about machine learning in the context of healthcare are clear to both clinicians and computational scientists.

What would a clinician tweet to their friends about your paper? A computational researcher? We can promise you that “Using 5 nodes in an LSTM rather than 4 results in .03 gains AUC on never-previously-analyzed earwax dataset, statistically significant based on bootstrap” isn’t it! We want to feel like we’ve learned something new from reading your paper, that we could apply to our own research and practice.

The insight must be related to machine learning in the context of healthcare, but could come about in many different ways:

A novel machine learning method that addresses a need when analyzing health data (e.g. ways to handle missingness, get better calibration, etc. to enable certain kinds of predictions). The method doesn’t have to be exclusively useful for healthcare, but it should address a real healthcare need.
A replication study that tests existing methods/small advances on existing methods in a health context, and provides insights on why some methods work better than others/under what circumstances different methods might be appropriate.
An evaluation of a machine learning deployment or pilot study in a health setting, which provides insights about what worked well and what did not.

Specifically, being able to make slightly, or even highly, better predictions on an existing data set without explanation is not sufficient. It’s also insufficient that you were the first to apply existing methods to your clinical problem, without a clear explanation of why those methods were useful for those data/may be useful for other data. A machine learning researcher is going to care about what they can use from your paper to guide their future ML work.

Similarly, a large number of papers have failed to meet our bar because the clinical reviewer could not identify the clinical relevance of the work. Sometimes, this was because the problem was a pure machine learning problem and didn’t have anything specific to the healthcare context. Much more often, the authors failed to describe how the subproblem they were solving would help solve an important clinical problem in terms that seemed relevant to a clinician audience. Applying a machine learning method to a clinical data set is not sufficient to establish clinical relevance. Neither is describing a method for handling missing data. Understanding how to make a certain kind of prediction or decision that might impact care, in the presence of missing data, might make clinicians excited…

Starting in 2020, you will be required to list your generalizable insights--things we should learn from reading the article--at the end of your introduction. Use this opportunity to set the stage for what we should get excited about!

3. Substantiate your insights.

Quality of Technical Description. Among papers that did claim to teach us something, the most common failure mode was insufficient description. Recall that MLHC has computational as well as clinical reviewers; while it is essential that the main ideas of both the problem and method, as well as the analysis and discussion, are presented in a way that clinicians can understand them, it is also essential that there exist sufficient technical details that a computational reviewer can feel confident about the quality of the work. Thus, perhaps in appendices, include your technical details - how high-dimensional were your feature sets? What kinds of optimization methods were used? How were hyperparameters tuned? It’s fine to reference other papers for details if certain parts are standard, but just make absolutely clear what you did!

Relatedly, the paper must include sufficient detail for the clinician to feel confident about the quality of the evaluation. That includes details about how the cohort (including cases and controls) were chosen, pre-processing details such as how missing data and censoring were handled, and choices of evaluation measures. (Recall that our clinical reviewers come from all areas of medicine, and may not be experts in the specialty of your work.)

Evaluation. Finally, there were a group of papers that seemed to be missing convincingly strong baselines and evaluation. Certain areas of machine learning are fairly crowded: there has been a lot of work done on making predictions with and of missing data, and on using CNNs for interpreting images. In such cases, it is essential that the authors are aware of and cite this earlier work, compare to it, and convince us that they are not reinventing the wheel.

More broadly, we emphasize there are several ways toward high quality evaluation:

Theorems: You may be able to prove that a certain method will perform well given certain assumptions. Be sure to discuss the reasonableness and limitations of the assumptions.
Synthetic data experiments and illustrative examples: Suppose that the authors claim that their method or a set of methods does really well when the data have some property X. That claim can be at least empirically demonstrated by creating datasets with varying levels of property X independently of other properties of the datasets; they should be designed to demonstrate it is indeed property X that is the cause of the difference in performance.
Ablation studies: Suppose that the authors claim that algorithms with property X do better than algorithms without. Then they can support that claim by running various algorithms with and without the new property X. Make sure that authors have given the algorithms otherwise equal footing (parameter choices, etc.)

Some combination of detailed post-hoc analyses/qualitative studies/pilots can also provide insight into how and when the innovation helps solve a real problem. We all know that health data is extremely messy. Your evaluation should convince us that you are not just fitting to the noise.

Limitations and Discussion. No method is perfect. Telling us when and why your approach should work (substantiated with evidence such as above), will boost your credibility!

Starting in 2020, reviewers and ACs will explicitly be coached to determine whether the main body of your paper substantiates the insights claimed in the introduction.