Data Science Curriculum
-
Data Science Curriculum - mohcine’s super blog
-
mohcine’s super blog
[CV-Resume](pdfs/mohcine_madkour_cv.pdf) [Homepage](/) [Categories](/categories.html) [Quote of the Day](/pages/Quote of the Day.html)
Data Science Curriculum
By Mohcine madkour, Sun 09 April 2017, in category Data science learning
Earning you way to a career in data science
This page is intended to showcase a detailed curriculum that one could follow over the course of 18 weeks to acquire the necessary skills to become a very useful and practical data scientist. This curriculum is not meant as a replacement to machine learning, statistics, and computer science university Masters and Ph.D programs that take a more theoretical and abstract approach to their respective disciplines. You will not be magically transformed into one of the world’s premier data scientists nor be able to do rigorous proofs of different machine learning concepts.
This curriculum is focused on transforming an individual to become the most practical, useful, and hireable to businesses that have use cases for advanced analytic intelligence. After course completion, you should be highly valuable to most any business that values data-driven decision making and have a good foundation for exploring in greater depth many other data science topics.
Approach to Learning
The main component of knowledge acquisition will be a time-tested source, the book. Reading textbooks can give you a very thorough, detailed and organized aggregation of the current knowledge you are trying to acquire. The many dozens of online data science courses that have come online in the last few years from coursera, edx, udacity, udemy, etc… have greatly expanded the ability of any individual to quickly acquire data science skills. Many of these courses do suggest books as companion sources but most don’t enforce the reading of the material. Generally, a video lecture can only capture a small fraction of what is written in a textbook.
This is not to disparage online courses, as I have taken over a dozen of them myself and have learned a great deal but a good textbook can really widen the base of your knowledge quite a bit more if read in conjuction with most online video-lecture-based courses. There will be a big focus on reading textbooks and answering questions at the end of the chapter.
Of course, with data science being a very hands-on discipline, each topic in the curriculum will have programming assignments designed to implement theory and application taught in the textbooks.
A Generalized Order of Learning
Below is a list of verbs used to synthesize the above section in order of how the data science skills of the curriculum are suggested to be acquired. This is just my suggestion but this is how I would teach the material.
-
Reading - Reading of new material is done first before any lecture. For most material, a good book exists that can cover the material.
-
Listening - Listening can refer to the more traditional lecture where only the most important/difficult concepts of the book are reahashed. This time could also be used as a question and answering session instead of any lecture. Online videos can be watched in replacement of a lecture. Blog posts and online forum discussions can also be good.
-
Doing - Doing refers to attempting problem sets and programming assignments on your own with limited help. Struggling without help is a great way to learn.
-
Collaborating - Working with others on problem sets or programming assignments is invaluable to the learning process. Seeing solutions from instructors or peers can be very helpful.
-
Remembering - Covering a large amount of topics in a short amount of time is a recipe for quickly forgetting. Techniques such as spaced repetition can help you remember and reinforce previous concepts.
Week 0: Prerequisite Knowledge
Books
Before getting started on the main curriculum, some minimum assumptions are made: That you have some background in programming and some background in statistics and at the very least can do basic algebra.
Python
Good data science requires the knowledge of at least one programming language and it is better to know one programming language very well than many only marginally. This is similar to natural language where learning one language fluently is better than knowing several at the same level that a five year old does. After mastering one programming language it’s usually fairly easy to translate code to another language as the concepts of programming do not change drastically from one to another.
The Python programming language is an excellent choice for learning data science. It is general purpose (can handle nearly any task), high level (for dummies), open source (free to see source code and usually free to use), has an excellent community (help is just a google search away) and has many friendly data science libraries already built (batteries included).
There are dozens of books and online courses available to begin learning Python. The short book Think Python (freely available) is a solid introduction to the language and will be supplemented by an abundance (50+) of short exercises and some smaller data inspecting/cleansing assignments using just the standard libraries (not the third party data science ones)
Statistics
The complement to computer skills for data scientists are math/statistics skills. To ease the student into statistics (for those that have forgotten or never taken a formal class), a nearly formula-less book called Statistics by David Freedman. This book is good for getting an intuition about how statisticians think about problems and is read almost like a novel in that there is very little math to be done. The first five parts (18 chapters) cover the core proability and statistics material that forms the foundation of entry level stat books.
Week 1: Software Development and Advanced Python
Books
After covering the basics of Python in the Think Python the student should be ready for a more advanced understanding of the language to help develop
-
Day 1: Review of introductory python with focus on data structures
-
Day 2: Software development lifecycle with focus on debugging and testing
-
Day 3: Overview of Classes and Objet Oriented Programming
-
Day 4: Python Data Model, special methods and the Standard Library
-
Day 5: Multithreading and Multiprocessing
Week 2: Data Wrangling
After gaining a firm understanding of the core concepts of Python from Week 0 and 1, a deep dive into Python’s data exploration libraries will be undertaken. The Pandas library is phenomenal for nearly all kinds of data wrangling tasks. In addition to Pandas a thorough look at Python’s excellent visualization libraries - matplotlib and seaborn will be covered.
Books
-
Day 1: Introduction to Series and DataFrames
-
Day 2: Split-Apply-Combine and Tidy Data
-
Day 3: Matplotlib and Seaborn
-
Day 4: Time Series and miscellaneous Pandas functionality
-
Day 5: Data Science mock interview Assignment
Week 3: Probability
Books
Introduction to Probability - Also free online at probabilitycourse.com.
Although many data science jobs don’t involve calculating probabilities by hand, the subjet underlies nearly all of data science tasks. A good understanding of probability will provide for much greater comprehension of many machine learning techniques.
-
Day 1: Basic Discrete and Continuous Probability
-
Day 2: Conditional Probability, Bayes Theorem and counting methods
-
Day 3: Random Variables - Expected Value and Variance
-
Day 4: Discrete and Continuous Distributions
-
Day 5: Joint Distributions
Week 4: Statistics
Books
-
-
Day 1: Sampling Distributions and the Centeral Limit Theorem
-
Day 2: Hypothesis Testing, confidence intervals, p-values, types of errors
-
Day 3: Hypothesis Testing, confidence intervals, p-values, types of errors
-
Day 4: Experimental Design and ANOVA
-
Day 5: Case Study
Week 5: Databases and SQL
Statistics courses are generally taught with numbers that masquerade as data. Data in the wild is something completely different. The world’s data is held in databases and up until recently most of this data was held in relational databases. Designing and understanding the basics of relational databases is extremely important as a data scientist. Communicating with data modelers/engineers will be very important. And accessing data through the (mostly) simple structure query language, SQL, is an absolute necessity to become a data scientist.
Books
Either MySQL or PostgreSQL book
-
Day 1: Introduction to Databases, relational databases, ER Modeling
-
Day 2: Introduction to SQL (with MySQL or PostgreSQL), the different subcomponents of SQL and basic SELECT statements
-
Day 3: Advanced SQL
-
Day 4: Even more advanced SQL
-
Day 5: Building a data warehouse in the cloud
Week 6: Linear Models
Books
Introduction to Linear Regression Analysis
-
Day 1: Linear Regresion and correlation
-
Day 2: Multiple Linear Regression, variable transformation and model building
-
Day 3: Regression Diagnostics, Residual Analysis, Regularization
-
Day 4: Classificaition with Logistic Regression
-
Day 5: Generalized Linear Models
Break Week
Week 7: Nonlinear Models
Books
-
-
Day 1: Linear and nonlinear Discriminant Analysis
-
Day 2: K-Nearest Neighbors and model validation
-
Day 3: Support Vector Machines
-
Day 4: Decision Trees
-
Day 5: Random Forests and Gradient Boosted Trees
Week 8: Dimensionality Reduction and Unsupervised Learning
-
Day 1: Curse of Dimensionality and PCA
-
Day 2: K-means and hierarchical clustering
-
Day 3: One-class SVM
-
Day 4: Expectation Maximization
-
Day 5: Graph-based learning
Week 9: Specialty Topics
-
Day 1: Bag of Words model and Naive Bayes for Text Classification
-
Day 2: Matrix Decomposition Methods for Topic Discovery
-
Day 3: NLP Project
-
Day 4: Recommendation Systems
-
Day 5: Recommendation Systems
Week 10: Hadoop Ecosystem
-
Day 1: Linux
-
Day 2: Hadoop and Map Reduce
-
Day 3: Cloud computing - AWS or Google Cloud
-
Day 4: Spark
-
Day 5: Hbase
Week 11: Neural Networks
-
Day 1: Neural Networks
-
Day 2: Convolutional and Recurrent Nets
-
Day 3: Tensor Flow
-
Day 4: Deep Learning
-
Day 5: Autoencoders and Restricted Boltzman Machines
Week 12: Web Development
-
Day 1: Basic html/css
-
Day 2: Javascript
-
Day 3: JQuery
-
Day 4: D3
-
Day 5: App building
Break Week
Weeks 13 - 15: Capstone Project
Week 16
Review
Week 17
Hundreds of Interview questions and beginning of job search
Week 18
Interview Feedback
Sitemap
- Archives
- Tags
-
Social - You can add links in your config file
-
Links - Pelican
- Python.org
- Jinja2
-
You can modify those links in your config file
*Proudly powered by [pelican](http://docs.getpelican.com/)* *Theme and code by [molivier](https://github.com/molivier)* © blogname 2015