Description
You will learn the fundamental limitations of using technology to protect privacy, as well as the emerging codes of conduct that will guide the behaviour of data scientists. You will also learn about the importance of reproducibility in data science and how the commercial cloud can help support reproducible research even when experiments involve massive datasets, complex computational infrastructures, or both.
Learning Objectives: By the end of this course, you will be able to:
- Design and critique visualisations
- Describe the current state of privacy, ethics, and governance in the context of big data and data science.
- Make use of cloud computing to analyse large datasets in a repeatable manner.
Syllabus :
1. Visualization
- Introduction: What and Why
- Introduction: Motivating Examples
- Data Types: Definitions
- Mapping Data Types to Visual Attributes
- Data Types Exercise
- Data Types and Visual Mappings Exercises
- Data Dimensions
- Effective Visual Encoding
- Effective Visual Encoding Exercise
- Design Criteria for Visual Encoding
- The Eye is not a Camera
- Preattentive Processing
- Estimating Magnitude
- Evaluating Visualizations
2. Privacy and Ethics
- Motivation: Barrow Alcohol Study
- Barrow Study Problems
- Reifying Ethics: Codes of Conduct
- ASA Code of Conduct: Responsibilities to Stakeholders
- Other Codes of Conduct
- Examples of Codified Rules: HIPAA
- Privacy Guarantees: First Attempts
- Examples of Privacy Leaks
- Formalizing the Privacy Problem
- Differential Privacy Defined
- Global Sensitivity
- Laplacian Noise
- Adding Laplacian Noise and Proving Differential Privacy
- Weaknesses of Differential Privacy
3. Reproducibility and Cloud Computing
- Reproducibility and Data Science
- Reproducibility Gold Standard
- Anecdote: The Ocean Appliance
- Code + Data + Environment
- Cloud Computing Introduction
- Cloud Computing History
- Code + Data + Environment + Platform
- Cloud Computing for Reproducible Research
- Advantages of Virtualization for Reproducibility
- Complex Virtualization Scenarios
- Shared Laboratories
- Economies of Scale
- Provisioning for Peak Load
- Elasticity and Price Reductions
- Server Costs vs. Power Costs
- Reproducibility for Big Data
- Counter-Arguments and Summary