Kickstarter

Predicting projects crawdfunding's outcome

Mauro Gentile @ Metis, NYC

Objective

Predicting the success of a crowdfounding campaign
in Kickstarter
before the campaign is launched

Outline

  1. Dataset introduction
  2. Predicting models:
    • Basic
    • Combined
    • Ensembled
  3. Final thoughts

Dataset introduction

  • 231.000 projects, Oct. 2015 - Jan. 2018 (Credits to: WebRobots)
  • 10 numerical or quantifiable features
  • 1 text feature: project blurbs
  • Balanced target

Project percentage per country

Outcome distribution

Assessment of numerical features importance

Importance from a random forest classifier
over the full set of numerical features

"Posterior" variables (in red) are dropped

Remaining numerical features:

  • Goal (target amount)
  • Staff pick
  • Genre/subgenre (dummified)
  • Year and month
  • Campaign duration
  • Set up duration
  • Blurb length

Text feature

Tf-idf has been applied to:

  • ~20k unique lemmatized uni-grams
  • ~83k unique lemmatized bi-grams
  • ~120k unique lemmatized tri-grams

from project blurbs

Models over tri-grams have been aborted for excessive computing time (>4days on m4-4xlarge instance on AWS)

Basic models applied

Features type Grid-search tuned classifiers
Numerical
  • Adaboost
  • Decision Tree
  • Random Forest
  • Logistic Regression
  • Logistic Regression normalized
  • Naive Bayes
  • Knn
  • Knn normalized
Text
  • Logistic Regression
  • Naive Bayes
  • Random Forest

Basic models results

Numeric better than text
Best models: Random forest and Adaboost on numeric

COMBINING NUMERICAL AND TEXT FEATURES

83,000 features from tf-idf on lemmatized text
and
173 from 10 dummified numerical variables

How to combine them in a
balanced predictor?

EXPERIMENTED OPTIONS

  • PCA:

    Applied a Random Forest to the top 200 Principal Components from the text features combined with the 173 numerical features.

  • ENSAMBLE:

    Took the sample-by-sample majority vote from basic classifers previously introduced, both from text and from numeric features.

COMBINED MODELS' RESULTS

Comparison against
best on numeric (AdaBoost) and best on text(Random Forest)

Results

F. type Model AUC Accuracy Precision Recall
Numerical
  • Adaboost
  • Random Forest
  • Knn norm.zed
  • Knn
  • Decision Tree
  • Logistic norm.zed
  • Logistic Regression
  • Naive Bayes
  • 0.94
  • 0.93
  • 0.91
  • 0.87
  • 0.84
  • 0.83
  • 0.83
  • 0.77
  • 0.85
  • 0.84
  • 0.82
  • 0.76
  • 0.75
  • 0.74
  • 0.74
  • 0.71
  • 0.84
  • 0.85
  • 0.82
  • 0.75
  • 0.76
  • 0.77
  • 0.77
  • 0.73
  • 0.85
  • 0.84
  • 0.80
  • 0.75
  • 0.72
  • 0.67
  • 0.66
  • 0.67
Text
  • Random Forest
  • Logistic Regression
  • Naive Bayes
  • 0.90
  • 0.83
  • 0.79
  • 0.80
  • 0.78
  • 0.71
  • 0.81
  • 0.78
  • 0.72
  • 0.77
  • 0.77
  • 0.68
Combined
  • Num+PCA (txt)
  • 0.95
  • 0.86
  • 0.89
  • 0.83
Ensembled
  • Ensembled
  • 0.94
  • 0.87
  • 0.87
  • 0.83

Results

Model AUC Acc on train Accuracy Precision Recall
  • Adaboost
  • Random Forest
  • Knn norm.zed
  • 0.94
  • 0.93
  • 0.91
  • 0.83
  • 0.84
  • 0.80
  • 0.85
  • 0.84
  • 0.82
  • 0.84
  • 0.85
  • 0.82
  • 0.85
  • 0.84
  • 0.80
  • Random Forest (txt)
  • Logistic Regression (txt)
  • 0.90
  • 0.83
  • 0.78
  • 0.76
  • 0.80
  • 0.78
  • 0.81
  • 0.78
  • 0.77
  • 0.77
  • Num+PCA (txt)
  • 0.95
  • 0.85
  • 0.86
  • 0.89
  • 0.83
  • Ensembled
  • 0.94
  • 0.91
  • 0.87
  • 0.87
  • 0.83

RESULTS' REVIEW

PCA and Ensambled classifiers' improvment was limited for different reasons:


  • PCA case: In the top 200 components "only" 8% of total variance of text is included. The remaining 92% remains out of the model

  • Ensambled case: The best models used in the ensambled classifier makes the same mistakes: 89% of errors in the training set and 90% of the errors in the test set happen in the same samples. The ensambled classifier cannot do any better on them

An hint for the next steps: Have a closer look to those samples where most of classifiers make classifying errors

A thought on timing

Among the 3 best and equivalently performing classifiers, PCA(text) + Num is the reccomended one: Adaboost is much slower to fit while Ensembled is even worse since it needs all the others classifieres to get fit first

Num+PCA(txt) is the reccomended classifier

Thank You

Questions?