Hashing
In this lab we will be continue to study ways to process a large amount of data. We have two constraints we must meet.
- The data is too large to store in memory or in just one file
- We are processing so many lookups that the binary search is too slow (even if we could maintain the data sorted).
Because of this, we will be working on a modified hashing solution
Objectives
- Understanding and implement hashing
- Learn to work with random access files
- Adapt a hashing routine to try to achieve O(1.3) with at most 35% wasted space when working with a very large set of data
Assignment
Link for github classroom:
https://classroom.github.com/a/m4-RJf6S
Phase 1:
- Design a hashing function that creates two numbers. The first number will be used to specify the file. The second number will be used to indicate which record in that file.
- Choose a collision handling function to work with your hashing function.
- You will have access to SellItem.java. Make sure you understand every component. Design two methods (one static), both called hash that will receive an id (or use the id of the item) and hash it. They should return an array with the first item indicated which file and the second item indicated the record in that file.
- Design DataFile.java. This class should only have static methods, constants, and one class variable.
- Class variable: The class variable should count the number of reads/writes. It needs to be able to be reset and the value reported back.
- Constants
- BLOCKSIZE is the number of records that can be stored in a given file. This should be set to 20,000.
- INDEXSIZE is the number of different files allowed. This should be set to 4,000.
- BASENAME is the main part of each data file with the number of which data file appended to the end of it. It should be set to "selldata.".
- Methods. You will need additional methods to find and save a given sell item. Because this class only has class field variables, all methods should be class methods.
- Serious time should be spent designing the testing plan. You need a couple of small examples to use to test and debug your code Then you need medium size ones. Finally you should have large test cases that you can use to see if you have achieve the benchmarks you wanted.
- Turn in the software design report as a pdf by the due date specified on Canvas.
Phase 2:
- Implement your designs
- Test and debug your code
- Run your code on your large data sets. If you have not achieved the benchmarks, adapt your hashing functions and/or collision handling routines until you do. Document the different methods you use in a second part of your testing report. If you have tried a total of three different hashing functions each with two different collision handling methods and have still not achieved the benchmarks, you may stop. Just be sure to talk about it in your SDR.
Submission instructions for Phase 2
- Complete the SDR.
- Combine your SDR and all of your .java and testing files into one zip folder
- Submit the zip folder on Canvas by the due data specific there.
Restrictions:
- You may not use any Internet resources (other than to look up Java syntax).
- You may use code from this course or last semesters.