Sep 2017 - Dez 2017
Language: Python3
Designed and developed supervised Machine Learning models for predicting rainfall at a particular location. The project was carried out by 2 awesome teammates and myself, in hopes of getting a good grade on Prof. Sameer Singh’s course of Machine Learning.
Courtesy of the UC Irvine Center for Hydrometeorology and Remote Sensing, our data consisted of satellite-based measurements of temperature at particular locations across the globe (infrared imaging) and information about clouds (such as area and average temperature). Each data point corresponded to a location on the globe (identified by its latitude, longitude and elevation) and was labeled with CHRS’s belief of whether that particular location will admit rain.
We tried the following models:
Random Forests turned out to be the model scoring highest in validation AUC, with scores above 0.79. We used hold-out validation with a training fraction of 80% of the data.
Experiments were carried out using scikit-learn
’s implementations of Random
Forests (both RandomForestRegressor
and ExtraTreesRegressor
) and we
experimented different configurations of the following parameters (see code at
the end):
max_depth
. The maximum depth on all the decision trees. If None
, the
depth is unrestrictedmin_samples_split
. The minimum number of samples to split a node (e.g. minParent)min_samples_leaf
. The minimum number of samples to form a leafmax_features
. The maximum size of the subsample of features considered in splittingn_estimators
. The number of decision trees generatedbootstrap
. Either True or False. In scikit-learn
, the subsample of the data (drawn with or without replacement) will always have the size as the data itselfIn the end, an unlimited max_depth
together with small sizes of feature
subsampling (max_features
=2) and a large number of decision trees
(n_estimators
=300) turned out to be a very good configuration of params. The
remaining parameters revealed little influence in validation AUC. See below for
a plot of training and validation AUC’s varying max_features
and
n_estimators
, with and without bootstrap
.
RandomForestRegressor
with
bootstrap
, min_samples_split
=min_samples_leaf
=5 and max_depth
=None
.
RandomForestRegressor
without
bootstrap
, min_samples_split
=min_samples_leaf
=5 and max_depth
=None
.
$~$
The following plots are similar to the ones below, but using
ExtraTreesRegressor
instead.
ExtraTreesRegressor
with
bootstrap
, min_samples_split
=min_samples_leaf
=5 and max_depth
=None
.
ExtraTreesRegressor
without
bootstrap
, min_samples_split
=min_samples_leaf
=5 and max_depth
=None
.
# Consistent behavior
np.random.seed(0)
def loadData(filename):
""" Load data from binary cache if possible for efficiency. """
f = os.path.splitext(filename)[0] + '.npy'
if os.path.isfile(f):
D = np.load(f) # faster than genfromtxt
else:
D = np.genfromtxt(filename, delimiter = None)
np.save(f, D)
return D
def gen_params(**params_ranges):
params_ranges = {k: [(k, v) for v in params_ranges[k]] for k in params_ranges}
return map(dict, itertools.product(*params_ranges.values()))
if __name__ == '__main__':
timestamp = str(int(time.time()))
# Prepare output folder for results
date = datetime.fromtimestamp(time.time()).strftime('%m-%d_%H-%M-%S')
# Data Loading
X = loadData('data/X_train.txt')
Y = loadData('data/Y_train.txt')
X, Y = ml.shuffleData(X,Y)
m, n = X.shape
Xtr, Xva, Ytr, Yva = ml.splitData(X, Y)
Xt, Yt = Xtr, Ytr
max_depth = [None]
min_samples_split = [10]
min_samples_leaf = [10]
max_features = [2]
n_estimators = [100]
bootstrap = [True]
type = ['RF']
params_ranges = {p: eval(p) for p in ['max_depth',
'min_samples_split',
'min_samples_leaf',
'max_features',
'n_estimators',
'bootstrap',
'type']}
results = []
for params in gen_params(**params_ranges):
t = params.pop('type')
if t == 'RF':
RF = RandomForestRegressor(n_jobs = -1, random_state = 0, **params)
else:
RF = ExtraTreesRegressor(n_jobs = -1, random_state = 0, **params)
RF.fit(Xt, Yt)
params['AUCt'] = roc_auc_score(Yt, RF.predict(Xt))
params['AUCv'] = roc_auc_score(Yva, RF.predict(Xva))
params['type'] = t
results.append(params)
if saveResults:
with open('experiments/' + timestamp + '.json', 'w') as f:
json.dump(results, f, indent = 2)