dynamic rnn vs rnn
随机森林中的out of bag error
Understanding the Bias-Variance Tradeoff
最大熵模型（Maximum Etropy）—— 熵，条件熵，联合熵，相对熵，互信息及其关系，最大熵模型。
It was proposed by Microsoft in paper Orthant-Wise Limited-memory Quasi-Newton Optimizer for L1-regularized Objectives of ICML 2007. You can also find the source code and executable runner via this link.
This model is optimized by a method which is similar with L-BFGS, but can achieve sparse model with L1 regularizer. I recommend you try this model and compare with other models you are using in your dataset. Here are four reasons:
It’s fast, especially when the dataset is huge; It can generate start-of-art prediction results on most dataset; It’s stable and there are few parameters need to be tried. Actaully, I find only regularization parameters can impact the performance obviously; It’s sparse, which is very important for big dataset and real product. (Of course, sparse is due to L1 regularizer, instead of the specific optimization method) One problem is it’s more challenge to implement it by yourself, so you need spend some time to make it support incremental update or online learning.
It was proposed by Google via paper Ad Click Prediction: a View from the Trenches in 2013. I tried on my dataset, and this implementation can generate similar prediction performance with OWLQN. It’s quicker than OWLQN for training, and it’s also sparse. One advantage is it’s very easy to implement, and it support increamental update naturally. One pain point for me is this model has 3-4 parameters need to be chosen, and most of them impact the prediction performance obviously.
This paper was also proposed by Microsoft in ICML 2009.
One biggest different with upper 3 implementation is it’s based on bayesian, so it’s generative model. Ad predictor is used to predict CTR of sponsor search ads of Bing, and on my dataset, it could also achieve comparable prediction performance with OWQLN and FTRL. Ad predictor model the weight of each feature with a gaussian distribution, so it natually supports online learning. And the prediction result for each sample is also a gaussian distribution, and it could be used to handle the exploration and exploitation problem. See more details of this model in another post.
Maximum Entropy Model Tutorial Reading This section lists some recommended papers for your further reference.
- Maximum Entropy Approach to Natural Language Processing [Berger et al., 1996] （必读）A must read paper on applying maxent technique to Natural Language Processing. This paper describes maxent in detail and presents an Increment Feature Selection algorithm for increasingly construct a maxent model as well as several example in statistical Machine Translation.
2.Inducing Features of Random Fields [Della Pietra et al., 1997] （必读）Another must read paper on maxent. It deals with a more general frame work: Random Fields and proposes an Improved Iterative Scaling algorithm for estimating parameters of Random Fields. This paper gives theoretical background to Random Fields (and hence Maxent model). A greedy Field Induction method is presented to automatically construct a detail random elds from a set of atomic features. An word morphology application for English is developed.
3.Adaptive Statistical Language Modeling: A Maximum Entropy Approach [Rosenfeld, 1996] This paper applied ME technique to statistical language modeling task. More specically, it built a conditional Maximum Entropy model that incorporated traditional N-gram, distant N-gram and trigger pair features. Significantly perplexity reduction over baseline trigram model was reported. Later, Rosenfeld and his group proposed a Whole Sentence Exponential Model that overcome the computation bottleneck of conditional ME model.
4.Maximum Entropy Models For Natural Language Ambiguity Resolution [Ratnaparkhi, 1998] This dissertation discussed the application of maxent model to various Natural Language Disambiguity tasks in detail. Several problems were attacked within the ME framework: sentence boundary detection, part-of-speech tagging, shallow parsing and text categorization. Comparison with other machine learning technique (Naive Bayes, Transform Based Learning, Decision Tree etc.) are given.
5.The Improved Iterative Scaling Algorithm: A Gentle Introduction [Berger, 1997] This paper describes IIS algorithm in detail. The description is easier to understand than [Della Pietra et al., 1997], which involves more mathematical notations.
6.Stochastic Attribute-Value Grammars (Abney, 1997) Abney applied Improved Iterative Scaling algorithm to parameters estimation of Attribute-Value grammars, which can not be corrected calculated by ERF method (though it works on PCFG). Random Fields is the model of choice here with a general Metropolis-Hasting Sampling on calculating feature expectation under newly constructed model.
7.A comparison of algorithms for maximum entropy parameter estimation [Malouf, 2003] Four iterative parameter estimation algorithms were compared on several NLP tasks. L-BFGS was observed to be the most effective parameter estimation method for Maximum Entropy model, much better than IIS and GIS. [Wallach, 2002] reported similar results on parameter estimation of Conditional Random Fields.