You may be a student or a working professional contemplating a career change and asking yourself: How do I become a data scientist? Is it worth it? While it is impossible to give a one-size-fits-all answer to these questions, I will discuss the prerequisites to becoming a data scientist and aspects of the data science job that interest anyone asking themselves the above questions.
How to become a data scientist – overview
If you are new to the data science field and google something like “Skills needed to be a data scientist”, you will likely get a list including everything from hypothesis testing and OLS regression to NLP and programming neural network models. Data science is a wide field, and what data scientists do depends on their background, specialization, and where they work. Defining one specific skill set needed to become a data scientist is impossible, and navigating the field can be ambiguous for a beginner. If you’re new to data science, it’s best to focus on a few key skills that can help you showcase your abilities. Don’t worry about specialized skills immediately unless you’re pursuing a PhD specializing in a particular field, such as NLP or image recognition. Instead, start by building a foundation through projects to help you develop your skills and prepare for your first job. In this article, I’ll discuss the skills you should focus on and how to acquire them.
How to become a data scientist – key skills
Statistics – the must-haves
While you don’t need a PhD in statistics to become a data scientist, you do need to grasp basic statistical concepts that will allow you to understand what the data tells you.
- Probability and random variables: you need to understand what a random variable is and how the concept of random variables relates to testing statistical hypotheses with data.
- Hypothesis testing: this is a component of most modeling approaches, and you need to not only mechanically know how to perform a statistical test but also have an intuitive understanding of what it means in different situations and under different assumptions.
- Descriptive statistics: understanding measures like mean, median, mode, variance, and standard deviation and when to use each one is essential for summarizing and analyzing data.
- Linear regression: this modeling approach is essential to understand on a deep level because it’s a fundamental technique and because understanding it is a building block for understanding more advanced concepts.
- Basic statistical distributions: Familiarity with common distributions like Normal, Binomial, and Poisson allows you to model different data types accurately and helps you gain a better statistical intuition.
Where do I begin if I am new to statistics?
If you’re a student and can take statistics courses at your university, I encourage you to do so. However, these days, things like statistics can be learned just as well using online resources. Even if you’re a student, those resources, many of which are free, can complement your formal university education. I am a huge fan of well-made videos that explain statistical concepts, such as the ones in this YouTube playlist. If you’re looking for an online course in statistics, you can browse through Udemy, LinkedIn Learning, or Coursera. I would find a course that combines theory with practical implementation in Python, such as this one. Why Python? Read on to find out.
Coding – to be able to implement stuff
Python
You’ll need to know a coding language that allows you to manipulate data. There are some ideas about which coding language you should use, R and Python being the primary candidates. I think there is no question that Python is the better choice; it is a general-purpose language that can be used for data manipulation, web development, automation, machine learning, and more. It offers flexibility for diverse tasks. R, on the other hand, is designed specifically for statistics and data analysis. If you end up doing specialized econometrics models, for example, you may find that there are some areas where you can find a pre-programmed estimator in R but not in Python. In those cases, you may want to switch to R for the specific task, but if you’re starting out and you don’t know how to program, start by focusing on Python. Python is also suitable for automation and model deployment and is more widely used in the industry than R.
SQL
SQL or some variation of SQL is what you’ll use to query data in structured databases. You don’t need super advanced SQL to begin with; you’ll learn most of what you need on the job. When learning SQL, don’t obsess over which flavor of SQL you’re learning, as standard SQL is a foundation for other SQL dialects, and the syntax differences are small and something you’ll be able to Google when needed. For example, Google BigQuery and Snowflake, cloud-based data warehouses that can store enormous amounts of data, use types of SQL. Some slight syntax variations exist between their SQL languages, as shown in the example below. When extracting the year from the current date, you may do it differently depending on the flavor of SQL you’re using, as shown in the image below, but the key query structure, etc, is usually the same. You don’t need to memorize these things; you’ll look them up as needed.
Big data
Big data is a word that gets thrown around a lot, and while there is no strict definition of what it is, big data typically refers to datasets that are so large or complex that traditional data processing software can’t deal with them. It’s not just about the data size that doesn’t fit in memory but also the data’s variety, velocity, and complexity. This is where Apache Spark, and specifically its Python API, PySpark, becomes relevant.
Apache Spark is an open-source, distributed computing system offering a fast, general-purpose cluster-computing framework. PySpark, the Python API for Spark, brings the power and simplicity of Python to Spark, making it accessible to data scientists who program in Python.
While not a must-have skill if you’re starting out in data science, knowing PySpark does give you an edge over fellow job applicants, and learning it is a good investment as you are guaranteed to need to use PySpark sooner or later in your career.
Data visualization is mandatory
Data visualization is not an add-on but a fundamental skill that allows you to tell a story and make an impact. Knowing what charts to use for what purpose and how to format and label the charts properly is not necessarily self-evident. You’d be surprised to learn what percentage of charts floating around in big tech companies are of horrible quality, and if you learn to make impactful graphs, you will likely stand out from the competition. In the best case, a “bad chart” is just hard to read, and in the “worst case” it can be downright misleading. Watch this TED talk to get some inspiration for what good and bad data visualization might look like.
Communication
Clear written and verbal communication is proof of clear thinking. In interview loops, it is very quickly clear if a candidate can organize their thoughts and convey them coherently, which is substantially more important than immediately giving the right answer. Just as you are actively learning technical skills, you should be engaged in learning to be a good communicator in a proactive way. I believe this is an area where most of us can improve. Think of a situation when you listened to a proficient orator and storyteller and how they were able to convey an idea. Would you like to be able to do the same? There are resources like this TED talk; use them to get inspired.
As part of your training to become a data scientist, I recommend producing sample projects using freely available data. Publish them on your GitHub and/or on your website. Those projects should include the “science” (summary statistics, regressions, etc.) and a description of the scientific process: What research question did you choose? How is it relevant to a business/policy problem? What method did you choose to answer your question with and why? What were the results? What do you recommend based on those results? As you see, in any data science project, plenty of questions must be answered using the written word. Here, you showcase that you can write a well-formatted document with a thought-through structure. If you include mathematical notation, I suggest you use LaTeX. Learn more about LaTex here.
Conscientiousness and taking ownership
When conducting interviews, employers prioritize candidates who possess the necessary skills and are trustworthy and capable of handling tasks independently. Hiring someone who requires constant supervision undermines the purpose of hiring them in the first place. Therefore, it is crucial to emphasize your commitment to responsibility, proactive attitude, and competence in delivering reliable and credible outcomes. Read my interview guide for a longer discussion on preparing for interviews.
Do you need a degree to become a data scientist?
I am going to separate the discussion into PhD vs. other degrees. For certain specialized and research-oriented data science roles, you need a PhD (in general, there are always individuals who are an exception to the rule). For more general entry-level data science roles, a degree helps but is not necessary. A degree is neither a requirement nor a guarantee of a candidate’s competency. In fact, if someone comes to me with no degree and demonstrates a high level of proficiency, then I am not only impressed by their knowledge itself but also by the fact that they had the discipline and drive to learn it without the help of a formal university structure. That said, a degree certainly helps you get into the recruiting funnel and hopefully teaches you something useful. If you are pursuing a degree, be proactive and make the most out of the resources offered by your educational institution.
Do you still have questions? Feel free to reach out to me directly.