Friday, March 14, 2014

A Discriminative Latent Variable Model for Online Clustering (ICML 2014)

This paper addresses the problem of coreference resolution in which denotative noun phrases (mentions) that refer to the same underlying entity are clustered together. For example:
[American President]1 [Bill Clinton]1 has been invited by the [Russian President]2, [Vladimir Putin]2, to visit [Russia]3. [President Clinton]3 said [he]1 looks forward to [his]1 visit.
This paper's main objective is to treat this as an online clustering problem since the mentions follow a natural order. This follows the linguistic intuition that humans are likely to use to resolve coreference in a statement.

They further present their procedure, the Latent Left-Linking Model (L3M). L3M assumes that since the data arrives in an order, to cluster a given item it is sufficient to only consider the previous items. L3M makes 3 modeling assumptions:

  1. Left-Linking. An item i can only link (probabilistically) to items before it
  2. Independence of Left-links. The probability of an item i linking to and antecedent item j is independent of any other item i' that links to j.
  3. Probabilistic Left-link. For an item set d, the probability of an item i ≥  1 linking to an antecedent item j (0 ≤ < i), $P[j \leftarrow i; d,{\bf w}] $: $$Pr[j \leftarrow i; d,{\bf w}] = \frac{exp(\frac{{\bf w}_{ij}}{\gamma})}{Z_i(d,{\bf w},\gamma)}$$
  4. where $\gamma$ is a temperature parameter and $Z_i$ is the normalization factor. Essentially, this means that $P[j \leftarrow i; d,{\bf w}] $ is $exp(\frac{{\bf w}_{ij}}{\gamma})$ normalized to be a probability (it sums to one). $w$ is a weight vector that consists of different features indicative of the similarity of items $i$ and $j$ such as the cosine similarity.
This model is similar to the Distance Dependent Chinese Restaurant Process (CRP) for generative clustering. L3M is differentiated from CRP in that is the first supervised discrimintative model to generalize the idea of using arbitrary pairwise features with learned weights.

The goal of the model is to cluster to items. An item is clustered into the cluster  to which is has the highest probability of belonging to (greedy approach). If an item has the highest probability with a dummy cluster, a new cluster is formed with that item. The probability that an item joins a previously formed cluster $c$ is the sum of probabilities of the $i$ linking to items in $c$: $$Pr[c \odot i;d,{\bf w}] = \sum_{j \in c, o\le j \lt i} Pr[j \leftarrow i; d {\bf w}]$$. The feature vector ${\bf w}$ is learned by minimizing the regularized negative log-likelihood of the data. Casting the problem as a latent tree, a log-likelihood loss function can be minimized to find ${\bf w}$ using stochastic gradient descent.

I am left with a few questions, namely, how is ${\bf w}$ initialized? At the conclusion of the paper, they randomly initialize ${\bf w}$ to show that their algorithm is robust to the non-convexity of the problem. But, it wasn't clear to me how ${\bf w}$ as initialized in the previous experiments.

No comments:

Post a Comment