Due Dates
Assigned Due Fri 04/06 Mon 04/30 0. Objectives
After completing the project, students are expected to be able to
- Explain in general how a database is implemented;
- Implement a subset of database functions such as a simplified SFW (Select-From-Where) clause;
- Explain how different data structures can be used to implement a database;
- Select proper data structures and algorithms for applications such as a database.
1. Overview
This is a team project of 2-3 students. Please let the instructor know if you'd prefer solo.
You will implement a simplified SFW clause in a database using a high level programming language such as Python. That is, when a user issues a database command such as
select * from books where pubdate > 1990;
you are to search through the set of data you mention and return the information as requested. In this case, all the books that are published after the year 1990.A database system such as sqlite3 has its own user interface, or set of commands through which a user can interact with the system to accomplish tasks such as importing data, reorganizing data, imposing constraints, or querying the data to find useful information. These database functionality are implemented using some lower (compared to the database language) level programming languages such as C, Java, or Python. Database systems are typically implemented using a data structure such as B+ tree that supports efficient search and storage. In this project, you are given a implementation of a B+ and B tree. Your tasks are to understand how this version of B+ tree implementation works, then use it to realize the functionality you want, that is, to support a simplified SFW operation. While the given version of B+ tree is in Python, you can certainly search and use other versions of B+ tree in Python or other programming languages such as C or Java.
2. A Quick Review of B+ Tree and The Given Implementation of B+ Tree
A B+ tree is a N-ary search tree with a variable but often large number of children per node. A B+ tree has the following properties.
- A B+ tree is a search tree that meets the general relation of value(keys in left children) < value(key in node) < value(keys in right children);
- The height of a B+ tree is balanced;
- Each interior node is an index node, that is, all interior node contain keys only, not the data associated with the key;
- All data are stored at the leaf level (or linked through the leaf nodes);
- All data nodes are connected at the leaf level to support a quick traversal.
Here is a diagram from Wikipedia to illustrate the concept of a B+ tree.
![]()
Figure: A Sample B+ Tree
By Grundprinzip - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=10758840
The above figure shows a simple B+ tree example linking the keys 1-7 to data values d1-d7. The linked list (red) allows rapid in-order traversal. This particular tree's branching factor is 4.
Because implementing data structures such as B+ tree is not the focus of our course, you are given a complete implementation of B+ tree in Python. The code can be downloaded from this link at the course website, or you can download it directly from the original website.
3. Your Tasks
Your overall task is to implement a simplified SFW SQL clause. That is, when a user issues an SFW command, your program should go through the data you have and return the datasets that meet the specified criteria. We say it will be simplified in the sense that we limit the functionality of SFW to have simple condition of one Boolean expression only. We also do not implement any nested features or aggregate functions in our base implementation. Some of these features can be implemented as a challenge. (See description later.)
Here are some specific points you'd want to pay attention.
- You can limit the SFW to a simple form of select [items] from [table] where [condition].
- The items above should be attribute names separated by a comma. You may assume table and condition are both single, e.g., select name,major from students where classyear=2018 in which the attributes of name and major are selected from a single table students with a single condition classyear=2018.
- Though the condition is single, your implementation must support all logical conditions including
=, !=, >, >=, <, <=;
. You may assume the comparison is always against a constant value, e.g., classyear>2015 or major='csci'. The constant is always on the right side of the comparison.- You can also make the assumptions that the items and the condition do not contain any white spaces. This will make the parsing a bit simpler.
- You may want to read the given B+ tree code and understand how it works before starting your own task. Write some simple code to test various cases.
- There are many different ways to implement the SFW clause on top of a given B+ tree code. Try to do it the way you feel comfortable, yet reasonably efficient (the bottom line is that it shouldn't be huge amount of coding).
- You do not need to implement elaborate user interfaces. Your program should load the data, then allow the user type in SFW queries from the command line that your program should return the appropriate data in a readable output format.
While your implementation should work for any data, we provide a sample data set here. Download this CSV file that contains the top 100 books by Newsweek. The data has four columns, Rank, Title, Author, and Year. The meaning of these attributes are self-explained. The following are some sample queries that your program should support. Assume the table in the database is called books.
select * from books where author='Ralph Ellison';
select * from books where author!='Ralph Ellison';
select title,author from books where year>1900;
select title from books where author='Jane Austen';
3.1 Challenges
Your base implementation should support the queries listed above and the ones similar to these queries. You are asked to choose and implement a challenge task that goes beyond the base implementation. You and your partner can decide what to do. Pick something interesting and challenge, yet within your reach. We will leave a 10 percent of the project grade for this part. Here are some examples for your reference. Note that the difficult levels in these tasks vary widely.
- Free format query: The SFW statements in the base implementation have very limited freedom in format, e.g., no space within the condition, to make the implementation a bit easier. An updated implementation would allow any format, as long as the syntax is correct.
- Aggregate functions: You may choose to implement some interesting and useful aggregate functions such as sum, max, min, or count.
- Multi-condition SFW: The base implementation only supports a single condition. You can try to support conditions with multiple Boolean expressions, e.g., year=1900 and author='Jane Austen'.
- Multi-index, single data block: The current implementation of the B+ tree has one set of index. In another word, if you'd search by title, you'd need a B+ tree built using the title as the key; if you'd search by author, you'd need another B+ tree built using the author as the key. You can try to build a multi-index, single data set B+ tree. That is, there is only one set of data blocks, multiple index can point to this data set.
4. Submission
You are asked to submit all program and data files in a single folder under your git repo tree. In addition to programs and data, submit a README file of a page or two describing your experiences, successes and challenges.
Submit everything through git. Make sure the submission is complete. The key is that when downloaded from git, the files should be self-contained that they run directly without requiring other files.