Description
In this course, you will learn :
- How to manipulate data using vectorization The course also covers some common errors and how to avoid them.
- Miki shows how to use many high-performance built-in functions in Python and Pandas. Pandas can consume a lot of memory, so Miki provides useful memory-saving advice. The course shows how to serialise data using SQL and HDF5.
- Miki then discusses how to speed up your code with Numba and Cython. Alternative DataFrames can also help you speed up your code, and Miki walks you through some possibilities.
- Look into a few additional resources that you can use.
Syllabus :
1. Overview
- Why performance matters
- Setting goals
- Measuring performance
- Profiling
2. Vectorization
- What is vectorization?
- Boolean indexing
- Understanding ufuncs
3. Common Mistakes
- The limitations of appending
- The limitations of object dtype
- The limitations of row iteration
- Understanding the isin function
- Parsing time once
4. pandas Performance
- Using built-in functions
- Understanding eval and query
- Understanding the join function
5. Saving Memory
- Why memory is important?
- Measuring memory
- Loading parts of data
- Categorical data
6. Fast Serialization
- Various formats and why not CSV
- Optimizing with SQL
- Optimizing with HDF5
7. Numba and Cython
- What is Numba?
- Using Numba
- What's Cython?
- Writing Cython code
- Compiling Cython
- %%cython magic
8. Alternative DataFrames
- Using Dask
- Using Vaex