Prediction:
Predicting the cuisine from the ingredients. Some of the strongest geographic and cultural associations are tied to region’s local foods. Every cuisine has few special ingredients which make them different than others..or may be not. I would like the machine tell which cuisine I can make by looking at some ingredients.
Description of the data:
I found the dataset on www.kaggle.com. The training data has 39774 instances in a json format. I wrote a python script to convert it into csv format, so that it can be imported into weka.
Below is the screenshot of the dataset in csv format:
Originally it had 3 attributes:
- id
- cuisine
- ingredients
Baseline Performance:
In order to get some baseline performance, I used bag-of-words approach. Using python, I created a bag-of-ingredients, where only one instance of the ingredients is stored in the bag. Then created another csv file, with attributes id, cuisine, and ingredients, ended up with 6726 attributes. I ran python script, that would create a bag-of-ingredients for 1000 instances. Assigning ‘1’ if the that ingredient exists in the cuisine and ‘0’ if it does not.
Screenshot of the new csv file:
After importing it into weka, I ran ZeroR classifier with 10 folds for cross-validation. Below is the screenshot of the statistic:
It classified very few instances correctly and Kappa of 0. And most of them were classified as Italian cuisine. I think this is because a lot of ingredients are used in various italian food, also the number of italian instances in the data set were far greater than any other cuisine.
Then I went on to run J48 algorithm. Screen shot is attached below:
This gave much better accuracy than ZeroR, although correctly classified instances are still less than 50% and kappa 0.396. I would like to keep this as my baseline performance, and improve the kappa over the semester.
Plan for improving the result:
-I would like to run algorithm, on more than 1000 instances
-Clean up the data
-Remove not-so-influential attributes
-Apply better algorithms suited for this problem