#paper
Project Interpolation
- classical statistics
- bias/variance trade-off
- model complexity
- overfitting
- general principles:
- you have a model, where you can tweak its model complexity: classic case of this is knn
- I actually think that knn is problematic (though, I would have said the same thing about those nonparametric kernel regression methods)
- essentially, the model “complexity” corresponds to basically being more “local”
- in other words, complexity here can mean many things (maybe?)
- perhaps the essence of the problem is the ability to have multiple solutions
- though, again, that’s refuted by the kernel regression example, right?
- intuition: you don’t want to go overboard with the model complexity, because at that point, you’ve overfit to the “training data”
- ah right, there’s this other notion, of training/test and generalization:
- this is a more difficult concept to pin down. basically, we think that the data comes from some “distribution” that’s going to be variable
- if you could assume that the data you’re given is always correct, then perhaps you don’t mind too much about overfitting
- though, even then, you still worry about overfitting in the sense that you don’t want to overfit on the training data (even aside from the noise problem)
- let’s think of some simulations:
- if you have noise, then you don’t really want to be overfitting
- but even if you don’t have noise, you can still be worried about overfitting (is that true?)
- that is, we usually like to assume some sort of smoothness to the function, which is really just an aesthetic (or an appeal to Occam’s Razor)
- but perhaps the act of overfitting to the noise-less case will produce something a little too jagged
- so we have training and test data, and the goal is to do better on the test data. even without looking at the test data, the intuition is that if you’ve gotten zero training error, then you’re probably doing something wrong.
- one might think that if there is noise, then necessarily this must be true, since you’ve fit to the noise
- but it seems like this depends on the flexibility of the model. that is, the model might be fitting to the noise, but it might be doing it in a very “weird” way, in that it localizes the effect of this training point
- and if there is no noise, then you’re probably fine.
- theory
- as we talk about in class, you can basically perform this decomposition of the prediction error (at a point \(x\)) (wiki) in terms of
- irreducible error (the noise in the data) \(\sigma^2\)
- bias(\(f(x)\))
- variance(\(f(x)\))
- and so what i’m claiming is that, things are very different depending on the \(x\) that we’re considering
- for \(x\) in the neighborhood of training points
- we’re going to have low bias and high variance
- but for \(x\) outside of those neighborhoods
- we’re going to have the “optimal” trade-off (maybe?)
- for \(x\) in the neighborhood of training points
- in this way, the decomposition doesn’t need to be thrown away - simply that the decomposition is now adaptive
- this allows you to get zero training error, while still having optimal decomposition
- alternative: paper (reviews) speculates that this is actually not a trade-off #paper
- as you increase the width of neural networks, you get both reduction in variance and bias
- I guess there isn’t any reason why these things must be opposed?
- in principle, if your data is pretty straightforward, for instance,
- something weird about all this is that, in my view, the irreducible error should somehow come into play here. as in, the generalizability of the model should be a function of the irreducible error?
- as we talk about in class, you can basically perform this decomposition of the prediction error (at a point \(x\)) (wiki) in terms of
- classification vs regression (and the margin):
- you have a model, where you can tweak its model complexity: classic case of this is knn
Backlinks
- [[next-steps-for-interpolation]]
- Let’s record the kinds of experimental data that we are interested in for [[project-interpolation]].
- [[matrix-completion-experiments]]
- this relates (nicely) to [[project-interpolation]]. here, we are dealing with depth of 2, and so one could argue that we’re in the regime where we’re just able to interpolate, but not well enough, so you can’t get nice generalization.
- but it feels like matrix completion is a very particular problem, and since we’re restricting ourselves to just linear neural networks, I don’t expect to see the same kind of double-dipping behaviour.
- this relates (nicely) to [[project-interpolation]]. here, we are dealing with depth of 2, and so one could argue that we’re in the regime where we’re just able to interpolate, but not well enough, so you can’t get nice generalization.
- [[master-paper-list]]
- [[project-interpolation]]
- [[does-learning-require-memorization]]
- Here’s an interesting relation to #differential_privacy. One of the motivations for this paper is that DP models (DP implies you can’t memorize) fail to achive SOTA results for the same problem as these memorizing solutions. If you look at how these DP models fail, you see that they fail on the exact class of problems as those proposed here, i.e. it cannot memorize the tail of the mixture distribution. This is definitely something to keep in mind for [[project-interpolation]].
Project Misinformation
Goal: detect disinformation
Resource: Awesome list
- context: twitter/facebook/social media.
- data:
- the text/content: you have a tweet that is basically a headline of an article, for instance
- source: where is the headline from? (e.g. nytimes)
- user covariates: demographic information of the sharer
- information cascade: this is catch-all phrase for everything that happens with the sharing of the post
- what is the sharing like? grass-roots or shared by influencers
- what are the responses like (content in the retweets, say)
- demographic of the sharers/clustering?
- I make the distinction between mis- and dis-information because I think dis-information is the much more pernicious problem. dis-information is not about the politics, but just the self-aware nature of the information.
Literature Review
- #paper paper on misinformation with neural network
- they use something known as “cascade model”, which takes advantage of the twitter architecture to capture responses/retweets, and use the content of the retweets to help classify the truthiness of the original tweet
- #idea there must be something here that allows you to merge ideas of nlp/tweet responses/the underlying social network
- #paper Bias Misperceived: The Role of Partisanship and Misinformation in YouTube Comment Moderation
- this is a little different, but it also deals with partisanship: there’s a dataset that has partisanship scores for websites (which then gets linked to Youtube videos in some weird)
- thesis: is there political bias in terms of youtube comment censorship
- #paper survey on misinformation 👍
- covariates:
- source
- content
- lots of “descriptive” results on the traits of fake news headlines (longer titles, more capitalized words)
- user response (on social media) (cascade)
- propagation structure
- methods:
- cue/feature, which is basically the pre-NLP era way of doing linguistic analysis
- lie detection: linguistic cues of deception
- deep learning based methods: this is what we want to target
- #paper FakeNewsNet: A data repository with news content, social context and dynamic information for studying fake news on social media
- supposedly, this paper shows that these kinds of methods have bad prediction scores (on the new dataset)
- #paper FakeNewsNet: A data repository with news content, social context and dynamic information for studying fake news on social media
- feedback-based (covariates/secondary information)
- propogation
- temporal
- response text
- response users
- cue/feature, which is basically the pre-NLP era way of doing linguistic analysis
- something that we haven’t even talked about, is intervention: what kinds of methods are available to combat these bad actors.
- covariates: