Instructor | Subhadeep Sarkar | Email: subhadeep@brandeis.edu |
Lectures | Timings: Tue & Fri: 12:45 PM – 2:05 PM | Location: Gerstenzang: 121 |
Student hours | Timings: Tue & Fri: 2:15 PM – 3:15 PM | Location: Volen 259 |
Recitation 1 | Timings: Fri: 4:00 PM – 5:00 PM 2 | Location: Volen 259 3 |
1 Recitations are optional for this class and will be used as additional student hours to discuss project progress and updates.
2 Students are encouraged to email the instructor to set up an appointment if they want to attend a recitation session.
3 The recitations will be held in Volen 259 (instead of Gerstenzang: 121).
Course Description
Data is everywhere. As scientists, users, and citizens we are both generating and exploiting large, ever-growing, diverse sets of data. For several applications – ranging from scientific discovery to business analysis, governance, and everyday activities – we are directly using and indirectly affecting hundreds of data systems! The big challenge is to turn data into useful knowledge, and to do so quickly, in order to increase the impact of the new insights. Achieving these goals comes with a number of technical challenges. How to exploit the continuously evolving hardware (storage, computation, network)? How to collect all incoming data efficiently? How to query dynamic collections of data that keep accumulating incoming data? How to parallelize query processing from one core to a few (scale-up), and then to thousands (scale out)? What are the needs of evolving workloads (hybrid transactional/analytical processing, graph analytics, Internet-of-Things, micro-payments, monitoring)? In this course, we will discuss how to design data systems that can address these challenges. We will see in detail the two driving forces behind innovation in data systems: hardware and workloads, and we will discuss recent and future trends of both. We will use examples from several data management areas including relational systems, distributed database systems, key-value stores, newSQL and NoSQL systems, data systems for machine learning (and machine learning for data systems), interactive analytics, and data management as a service. In a quickly moving industry and research landscape, such skills are essential.
Prerequisites
COSI 127b – Database Management Systems. A working knowledge of C/C++ and Java or Python programming and a fundamental understanding of data structures and algorithms is required. Please see the instructor if you are not sure about the level of your preparation.
Learning Goals
Students who successfully complete all components of this course will be able to demonstrate the following by the end of the semester. 1. Familiarization with the history and evolution of NoSQL data systems design over the past decades. 2. Understanding of the challenges and tradeoffs associated with large-scale data management and analysis. 3. Knowledge about the key design principles of state-of-the-art NoSQL systems. 4. Ability to interact and tune with modern key-value stores. 5. Understanding how large-scale commercial data systems work and how to optimize such systems for a given workload and performance target. 6. Ability to program, experiment, evaluate complex, scalable data systems, and reason about their performance. 7. Read and interpret state-of-the-art research papers, with the ability to criticize them constructively.
Topics
In this class, we will cover data systems design principles from the following different angles.
- What affects new data systems designs (data and applications, emerging hardware, and new workloads)
- Traditional data systems for modern hardware
- Distributed database systems
- NoSQL, newSQL, and key-value stores
- Hardware-conscious systems design