Predict Future Sales
Final project for "How to win a data science competition" Coursera course
It’s a time-series dataset consisted of daily sales data of Russian software firms 1C Company. The mission for this competition is to predict total sales for every product and store for the next month from the given dataset.
- Items -> items.csv
- shops -> shops.csv
- categories = item_categories.csv
- train = sales_train.csv
- test = test.csv
ID - an Id that represents a (Shop, Item) tuple within the test set
shop_id - unique identifier of a shop
item_id - unique identifier of a product
item_category_id - unique identifier of item category
item_cnt_day - number of products sold. We are predicting a monthly amount of this measure
item_price - current price of an item
date - date in format dd/mm/yyyy
date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
item_name - name of item
shop_name - name of shop
item_category_name - name of item category
- Remove outliers
- remove item_cnt_day , item_price attributes’s outliers
2. Clean training data
Replace shop_id with the same location and store but may be re-open, because some location have the same location but different name.
(ex. “Zhukovsky st. Chkalov 39m?” And “Zhukovsky st. Chkalov 39m²”)
Convert a shop’s city and category of it’s into the unique numerical data
cleaning item category data by extract the group of item’s categories from the name by using separated by ‘-’ before is type after ‘-’ is suptype
Change category attribute to numerical
- exploring the prediction constraints
- Test set attributes
- Competition’s evaluation method
2. Transform training set data by extract every combination of shop_id ,item_id and date block number
3. For every combination of DateBlockNumber , ShopId and ItemId , sum up the Item count per day and then sum up the item counts to form item count per month
4. Merge test set and train set together by set the date_block_num to 34 (month of 34 )
5. Merge all dimension tables
By now we will have the variable called matrix which contained all the attributes needed for item_cnt_month, It’s the result from combine all the processed dimensional data such as shop, item and item’s category.
basic of feature engineering
- Feature engineering is a process of transforming the given data into a form which is easier to interpret.
- The goal of feature is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm to model.
Lag Feature ( Target Lag)
- Predict the value at the next time (t+1) given the value at the previous time (t-1)
- Shifting the dataset by 1 creates the t-1 column, adding a NaN (unknown) value for the first row. The time series dataset without a shift represents the t+1
- The first 3 values of the temperature dataset are 20.7, 17.9, and 18.8.
Combining different attributes to form new features
•date_block_num, item_category_type_code -> date_cat_avg_item_cnt•date_block_num, shop_id, item_category_id -> date_shop_cat_avg_item_cnt•date_block_num, shop_id, item_category_type_code -> date_shop_type_avg_item_cnt•date_block_num, shop_id, item_category_sub_type_code -> date_shop_subtype_avg_item_cnt•date_block_num, item_category_sub_type_code -> date_subtype_avg_item_cnt•date_block_num, shop_city -> date_city_avg_item_cnt•date_block_num, item_id, shop_city -> date_item_city_avg_item_cnt•date_item_city_avg_item_cnt -> date_type_avg_item_cnt•date_block_num, item_category_sub_type_code -> date_subtype_avg_item_cnt
Finding price trend