I finally got my programming assignment for my databases class yesterday. We have been wondering in class when this was going to show up, and what it would pertain to. Now we know… and I have a feeling that there is a good portion of the class that is going to be hopelessly screwed.
Why? Because this is a REAL programming assignment. That’s why.
You see, these days in college level courses, at least at the University of Florida, we have been getting things like “make a program that takes these numbers and gives the average, mean, and standard deviation.” Of course, all of these programs are written in Java.
This is different. With this assignment we are doing an exploration of search algorithms based on different indexing techniques. The first, our baseline, is simply a sequential search. I am still mulling over how exactly I am going to do it, as there are a couple of tweaks that I can perform even with a sequential search to get the best performance. After that we are to perform a kd-tree search on the data (the data is two-dimensional points, so a d-tree type indexing makes sense in this case) and then a Vkd-tree, which I have yet to actually figure out the specifics of.
We have a month to do this assignment, and I have already started. Let me be the first to say that those that wait on working on this assignment are going to be screwed, because I am seeing issues with it already. I have sent three emails to the professor asking questions about the assignment (not for help, mind you, but about the specification) because certain things about it are not completely clear to me yet, such as the range on the points, whether or not they can be real numbers or if they are strictly integers, and just how many points there can be in a database file in the first place.
All of these things matter, because if I write the program to use short integers and then the point values exceed 216 we get a buffer overflow. Same goes for if I want to store all of the points in memory, as I have a finite amount of space and need to define the number of points I am going to load. The sample database has 2 million points in it, which blew up a normal integer array size but survived in a short integer array. If the number of points is going to be larger than this, then I will have to go back to stuffing everything onto disk instead. Ah well.
I have, however, decided that the output of all three methodologies are going to be output to disk first, sorted, then displayed. In this way I can get the same output and ordering regardless of search methodology, and I shouldn’t have to worry about sorting the original data first.