Predicting the success of a crowdfounding campaign
in Kickstarter
before the campaign is launched
Project percentage per country
Outcome distribution
Importance from a random forest classifier
over the full set of numerical features
"Posterior" variables (in red) are dropped
Remaining numerical features:
Tfidf has been applied to:
from project blurbs
Models over trigrams have been aborted for excessive computing time (>4days on m44xlarge instance on AWS)
Features type  Gridsearch tuned classifiers 

Numerical 

Text 

Numeric better than text
Best models: Random forest and Adaboost on numeric
83,000 features from tfidf on lemmatized text
and
173 from 10 dummified numerical variables
How to combine them in a
balanced predictor?
PCA:
Applied a Random Forest to the top 200 Principal Components from the text features combined with the 173 numerical features.
ENSAMBLE:
Took the samplebysample majority vote from basic classifers previously introduced, both from text and from numeric features.
Comparison against
best on numeric (AdaBoost) and best on text(Random Forest)
F. type  Model  AUC  Accuracy  Precision  Recall 

Numerical 





Text 





Combined 





Ensembled 





Model  AUC  Acc on train  Accuracy  Precision  Recall 

























PCA and Ensambled classifiers' improvment was limited for different reasons:
PCA case: In the top 200 components "only" 8% of total variance of text is included. The remaining 92% remains out of the model
Ensambled case: The best models used in the ensambled classifier makes the same mistakes: 89% of errors in the training set and 90% of the errors in the test set happen in the same samples. The ensambled classifier cannot do any better on them
An hint for the next steps: Have a closer look to those samples where most of classifiers make classifying errors
Among the 3 best and equivalently performing classifiers, PCA(text) + Num is the reccomended one: Adaboost is much slower to fit while Ensembled is even worse since it needs all the others classifieres to get fit first
Num+PCA(txt) is the reccomended classifier