Beginner to Data Scientist in 25 months: Every Project and What I Learnt


Learning to code as a full-time doctor 🩺

At the start of 2018, I had recently graduated from medical school. I knew basic web development and not much more.

Since mid-2020, I’ve been working full-time as a data scientist, analysing Big Data with Spark, SQL and Python (for a company with $300 million annual revenue).

I built my skills alongside a full-time job as a doctor, by doing courses and projects in my free moments.

Here I’m going to share every course and project that took me from zero to a full-time technical role in just over 2 years.

📋 What’s the use of a project list?

This isn’t a prescriptive list. Nor is it a ‘how-to guide’.

Instead, I hope to:

  1. Give inspiration for your own project ideas
  2. Give insight into the journey of building data science + coding skills

I’ve shared project details, timelines and the rough amount of time spent on each. Of course, rate of progress depends on many factors (amount of time you can/want to put is key 🔑). I did this alongside a fairly demanding full-time job, so obviously could have made faster progress. But it was high on my priority list outside of work, so it could also have been slower.

🎡 Themes I noticed

Making this list helped me identify two themes:

  1. I started with online courses but shifted to projects as soon as possible. (I’m a big believer that projects are where the real learning happens.)
  2. The projects got cooler and more complicated as time progressed. While each project builds your skills, it also helps you identify (i) what’s possible and (ii) what’s interesting.

🛣 THE JOURNEY: FROM BEGINNER TO NOW

(1) The Javascript Road Trips 1, 2 and 3 — CodeAcademy [COURSE]

  • Date Completed: Jan — Feb 2018
  • Time Committed: ~30 hours (roughly 2 evenings per week for 4-6 weeks)
  • Link: Here (now on Pluralsight).

When starting off, I had no idea where to begin. I wasn’t even sure what coding language to learn.

So I googled something along the lines of “Best coding language to learn 2018” and Javascript and Python seemed the most consistent top answers.

I decided to start with Javascript, looked for a course online and found this one.

Main takeaway(s):

  • It’s more import to start learning than to start learning the “right” language. In retrospect, I should have started with Python (I have used it way more since). But this course taught me so much about object-oriented programming, which is relevant for many languages, so absolutely wasn’t a waste of time.
  • Projects help you learn and remember. Looking back, the main principles that I still remember from this course are those I applied to later projects. A lot of the knowledge slipped out my brain (“use it or lose it”).

🐍 (2) Python for Data Science — DataCamp [COURSE]

  • Date Completed: March 2018
  • Time Committed: ~30 hours (roughly 2 evenings per week for 4–6 weeks)
  • Link: Courses 1, 2 and 4 from this learning track.

This was a great course for introducing the basics of coding for data science; numpy, pandas and matplotlib, which I still use heavily to this day.

Main takeaway(s):

  • When learning a new coding language or library, it’s important to build a reference of the functions you learn. I’ve lost count of the number of times I forgot and re-remembered (via Google) how to index into pandas DataFrames. The moment you have somewhere to refer each time (whether that’s your own notes or a ‘CheatSheet’ like this), life gets a whole lot easier.

💬 (3) Text Analysis Tools [PROJECT]

  • Date Completed: April 2018
  • Time Committed: ~20 hours (roughly 2 evenings per week for 4 weeks)
  • Link: I shared what I can on GitHub here.

While working part-time at a health-tech start-up in south London, I had the idea to make tools for analysing text. The company provides home care and they had a lot of unstructured text reports from visits.

I wrote code to perform various basic functions such as (i) split long reports into individual sentences, (ii) to count frequency of words to understand common topics and (iii) to create a .txt file that could be used for a language model (a later project).

Main takeaway(s):

  • Get started with projects as soon as possible (no matter how small or unimportant). Looking back, this was a really simple project that I could probably do now in less than 10 minutes. At the time, it took me much, much longer. But in the struggle of figuring out how to properly import a .csv file, define functions and execute loops, I learnt so much that I couldn’t have learnt in a course environment.

🖥 (4) Machine Learning by Andrew Ng — Coursera [COURSE]

  • Date Completed: April — May 2018
  • Time Committed: ~40–50 hours (roughly 2 x 2½ hour sessions per week for videos, plus 1 long session on the programming assignment for ~2 months)
  • Link: Here.

This is the course where so many MLers begin and a common first window into the world of MOOCs.

The course design is great; the video with text transcripts, regular check-in questions and useful quizzes and assignments at the end.

Main takeaway(s):

  • If you want to get started with machine learning, commit to finishing this course. It won’t be a waste of time. If you hate it, machine learning may not be for you. (I struggled with some of the maths at times — push through.)
  • Doing programming assignments helps you understand the topics. They say you don’t truly understand something until you have taught it. I’d say until you have programmed it.
  • There’s a lot of machine learning that this course doesn’t cover. No mention of tree-based approaches, graphical models, bayesian inference and lots more good-stuff.

🪔 (5) Deep Learning by Andrew Ng — Coursera [COURSE]

  • Date Completed: June — July 2018
  • Time Committed: ~15-20 hours (in free-moments — commute, lunchtime, etc — for around 2 months. Only watched videos, didn’t do coding exercises.)
  • Link: Here.

Main takeaway(s):

  • Sometimes it’s more efficient to just watch the videos (and not do the programming assignments). It depends how much time you have and how deep you want to go. I learnt a lot with programming anything here.

👟 (6) Physical Health Monitoring in Community Mental Health Trust [PROJECT]

  • Date Completed: Aug — Sept 2018
  • Time Committed: ~10 hours (roughly 2 x 2.5h sessions per week for ~2 weeks)
  • Link: not able to share

While undertaking a psychiatry placement in South London, I noticed that the team were finding it hard to track and maintain the different physical health monitoring requirements of different patients. Patients on a particular drug would need blood tests or physical examinations at a certain frequency, but this varied a lot between patients.

I found that the key patient details could be exported to a .CSV file which I could put into Excel to analyse. I still wasn’t overly confident with Python at this stage, so I opted to use combinations of some simple Excel algorithms to process the data and highlight the next action required.

After we implemented this system, we were able to improve the percentage receiving optimal monitoring from 22% to 71%.

Main takeaway(s):

  • You don’t always have to write complex code to make a difference. I didn’t write a line of code, but this was the most impactful data science-type thing I’d done at that point. What problem in your work-place could you solve with what you currently know (or could learn)?

💊 (7) Predicting drug response using epigenetics [PROJECT]

Date Completed: November 2018

Time Committed: ~14 hours (one full weekend - created at the Cambridge Cancer Genomics Hackathon)

The response to anti-cancer medication varies widely. Before starting treatment, it’s often unclear who will respond well and who won’t. Completing a course of ineffective drugs wastes valuable time and can cause side effects.

In a team of 5, we tried to build a tool that could predict who would respond based on epigenetic changes after initial administration of the drug. We build a proof-of-concept algorithm (and won an award at the Hackathon).

However, this can’t be applied in the real-world yet because we simply don’t have the data. It’s not routine for epigenetic data to be collected before, during and after drug treatment. Hopefully in the future, though!

Main takeaway(s):Hackathons are a great way to work on interesting problems and level-up skills. The technical mentors and other team members taught me so much, and I used sk-learn for the first time.


📚 (8) A language model based on carer reports [PROJECT]

  • Date Completed: Dec 2018 — Jan 2019
  • Time Committed: ~10–15 hours (roughly 2 sessions per week for 4-6 weeks)
  • Link: On my GitHub page here.

This was a continuation of earlier work for the health-tech start-up I was working for (Project 3).

To better understand the contents of written carer reports, I decided to build a language model (which uses the examples you show to write ‘fake’ alternative reports). This was partly educational and partly to better understand the types of things coming up in the reports.

Main takeaway(s):

  1. Most of the battle is getting data into the right form. This was my first time making a language model and my first time using Keras. Almost all my time was spent figuring out how to process the data to feed it into the model. Almost no time was spent on coding the model itself (which Keras makes really easy)

📋 (9) Preventing medication mistakes with ‘The Pill Detector’ [PROJECT]

The idea behind this project was: People often make mistakes when taking their medication, such as confusing different pills. This is particularly a problem in the elderly.

We wanted to make a device to reduce this. We made a tool which takes a picture of a pill, classifies the medication and then checks against a patients’ medical record to see if it’s right to take. We felt this could be a helpful ‘last minute check’ before taking the pill itself.

Main takeaway(s):

  • If your coding is not strong, you can still add value with domain-specific insight. While I helped with some of the code, others on the team were much stronger. What was most useful for the team was my medical insight.

🚑 (10) Preventing clinical deterioration in the elderly [PROJECT]

  • Date Completed: Aug 2018— Sept 2019
  • Time Committed: ~200+ hours (not only writing code, working roughly 1 working day per week)
  • Links: NHSx report / Media report / unable to share model (commercially sensitive)

While working for Cera Care (a healthtech company in London), we built an algorithmic platform for predicting clinical deterioration. This was a big factor in the company’s subsequent £54 million funding round.

This was a fairly hefty project, and the programming was only a fraction of the overall work.

Main learnings and reflections:

  • Data structure is really, really important. A machine learning model is only as good as the data it’s trained on. For companies that utilise data science, their value is largely determined by the quality the data they have. A huge part of this project was to improve the structure structure — only a very small part was training the actual model.

👨‍💻 (11) A database of “social prescribing” services [PROJECT]

  • Date Completed: April — July 2018
  • Time Committed: ~20–25 hours (On average, an evening or two per week for a few months)
  • Links: GitHub

Social prescribing is when a doctor ‘prescribes’ a social activity, like a dance class, social meet-up or other event that “focuses on improved quality of life and emotional wellbeing.

However, it’s hard to keep track of services. A GP and I set about making a web-facing database that could keep track.

I created an initial skeleton using the Django framework, but ultimately didn’t have enough time to commit for it to take off. Also, somebody else working on the same idea received a lot of money and started building it at scale, so we sidelined the project.

Main learnings and reflections:

  • Accountability is really helpful. It can be the difference between a completed project and dormant code. In our case, once the end-goal dissipated, so did the energy.

👨‍🏫 (12) Educational coding exercises: “Coding Medical Applications” [PROJECT]

  • Date Completed: June — Sept 2019
  • Time Committed: ~15–20 hours (2 or 3 evenings per coding exercise — four exercises in total)
  • Link: Code available here and here / Video series here

I decided to make some coding exercises specifically applied to healthcare, to encourage people with medical backgrounds to learn to code. I ended up making several:

  1. How to code a medical calculator for SIRS: blog / video
  2. How to code a neural network to predict hospital attendance: blog / video
  3. Diagnosing breast cancer with AI: blog / video

Main learnings and reflections:

  • Teaching is an amazing way to consolidate learning. Doing this was a helpful prompt to re-visit core principles and really reinforce the basics.

(13) Predicting Loan Non-Repayment [PROJECT]

  • Date Completed: Oct 2019
  • Time Committed: ~10–15 hours (a couple of full days over the course of a couple weeks)
  • Link: Code on GitHub

This project was set as an intra-university challenge at UCL, to secure some consulting work. In the end, I decided not to take on the work but still completed the full project.

The idea was to predict who wouldn’t make their loan payments based on a wide range of input variables.

Main learnings and reflections:

  • Get feedback. I shared this code with a few different technical people, and the feedback was invaluable. I learn so much every time I ask people to review my code.

🛠 (14) Predicting depth of faults in ‘heat exchanger’ tubes [PROJECT]

  • Date Completed: March — April 2020
  • Time Committed: ~150 hours (not only coding, but also making presentation and having teams meetings. Worked 9–5, 5 days a week, for 5 weeks)
  • Link: Code on GitHub

When coronavirus came along and we went into lockdown, I was keen to get some practical coding experience remotely. Thankfully, I was accepted onto the S2DS remote program.

I worked in a team of 4 to build a machine learning model that predicted depth of faults in ‘heat exchanger’ tubes and achieved a good RMSE score.

The S2DS program is great — I’d highly recommend it.

Main learnings and reflections:

  • The fastest way to learn is full-time. There’s really no substitute to immersing yourself into a project. I learnt more in 5 days of this project that I did in 5 spaced-out days spent on previous projects.
  • Having technical mentors at-hand are a game-changer. When you get stuck, having someone to message saves a huge amount of time.

🖼 (15) AI-generated Art [PROJECT]

  • Date Completed: May — June 2020
  • Time Committed: ~5–10  hours (a couple evenings a week for around a month)
  • Link: Blog

I used the AI technique “style transfer” to take personal photos and give them an artistic style. I adapted existing code from GitHub.

Main learnings and reflections:

  • Having fun makes everything easier. This project was genuinely exciting and I felt I could embrace my creative side. It was also cool to decorate my house with the photos that I generated.

🔍 (16) Automating my job search with Python [PROJECT]

  • Date Completed: May 2020
  • Time Committed: ~10–15 hours (a couple of evenings for a couple of weeks)
  • Link: Code on GitHub / Blog

I got bored while searching for new jobs, so wrote code that would automate the process.

Main learnings and reflections:

  • Scratch your own itch. This idea was just a small solution to a genuine problem I was having.
  • Share what you make. I’ve had people reach out and say this code was helpful. This is a really cool feeling. You have to share your code to make this possible (of course!)

📹 (17) I created my own YouTube algorithm (to stop me wasting time) [PROJECT]

  • Date Completed: June— Nov 2020
  • Time Committed: ~40 hours (around 10–15 hours a week for a few weeks, then a pause, then a weekend to wrap it up)
  • Link: Code on GitHub / Blog

I felt the YouTube algorithm was hit-and-miss (at best), so I coded an alternative.

The blog write-up ended up ranking first on Medium, which was pretty cool. I had several people reach out, and build on top of the code also.

Main learnings and reflections:

  • Think about ‘tech’ that you don’t think works well. How could you improve it?

🥂 The end of the beginning: starting my first full-time data science job

After all of the above, I landed my first full-time data science job. I’m loving it! I learn a lot every day and work on exciting real-world problems.

I’ve written my best code since starting this job (but unfortunately can’t share it here). And of course, I’m still learning!

I hope some of these projects have given you ideas and inspiration for your coding journey.

Chris