When you have a long document and there's no intern around to assist you with the tedious task of identifying specific clauses and you don't know where to start, that's the moment when you most feel the need for some automation in the process.
After all, it's the age where they automatically read your car's plate to send you a speeding ticket. Why do you still have to manually annotate text documents?!
You would need something that can automatically annotates the document based on what you are interested in. For this, you need to easily define your interests in a way that the system understands.
Beagle.ai also felt the need for this and decided to solve it for contract review. We chose to formulate this inside a framework that is familiar to most people today: tags. Automatically tagging units of text (clauses in a contract in our case) will be the target output in this post.
It's easy to define inside this framework what a task looks like. You start by showing the system examples of good clauses. Once it starts to get the hold of the problem, it also gives suggestions, so you can continue to train it in a more collaborative way.
As soon as a certain level of maturity is reached, you can now rely on the system to automatically do the task, just as if you had that personal assistant you always wished for.
The system makes a binary decision whether a clause deserves a tag T or not and it has to base this decision on previous experience. In machine learning terms, this translates into a binary classifier.
We also want to feed it training data one sample at a time which forks the probelm into two possible options: use an online classifier or constantly retrain an offline classifier. Reasons like performance on small training data sets, speed, flexibility are tipping the balance towards the online option.
To apply supervised learning, both positive and negative samples are needed. But as a user, I don't want to start defining what's not a T, I only want to mark the valid Ts. This scenario is usually refered to as Positive-Unlabeled (PU) learning, where instead of positives and negatives, you get positives and unknown.
To overcome this and not go further into the PU realm (which is a more difficult problem that requires more data), negative samples have to be infered from the document. This proves to be tricky.
Negative samples inference
The first and most obvious rule is that a positive sample is not a negative sample. This said --Thank you Captain Obvious-- we have to pick from the clauses not yet tagged as T and do our best to predict which are true negatives.
One helpful observation is that a document is most often read top-down, so if the user has just tagged the 15th clause in the current document, the first 14 clauses have a higher probability of having already been considered and not tagged, which makes them good negative samples candidates.
An even further indication in this direction is the fact that the user interacted with a clause before labeling the current one as T. Interaction can mean anything that tells the user read the clause and did not mark it as T. It can mean tagging it with some other label, it can mean adding a comment or even simply clicking on it.
A point worth mentioning: there will be conflicts. No heuristic method for infering negative samples is bullet-proof, so the case where a sample that is already in as negative will be added as a positive is probable to appear. This is actually not critical, the classifier won't throw an error and blow up, but the confusion inserted by this contradiction can make a visible performance impact when the training data set is small. And small training set sounds like the usual usecase here.
The easiest solution to solve such conflicts is to remake the negative samples set consistent and retrain a new classifier from scratch. To even detect the conflict we need to have the already seen samples saved. This may be unwanted if the user has special privacy needs, so methods to anonymize the data (e.g. hashing) can be applied.
For evaluation purposes in this article, we used a small dataset of 458 clauses manually annotated. It contains both positive and negative samples of jurisdiction clauses in a contract.
A positive sample for this jurisdiction dataset is:
This EULA shall be governed by and construed in accordance with the laws of England.
Negative samples can both be the trivial clause that has nothing to do with laws and jurisdiction, and the more complex case when the clause is stating something "uninteresting" about jurisdiction:
Termination of Web Site Development and/or Web Site Hosting.
This Agreement will not be governed by the conflict of law rules of any jurisdiction.
The set is unbalanced, with a bias towards negative samples, which is ok considering the prior probability of a clause being a jurisdiction (A whole contract may have ~3 clauses that state jurisdiction).
The following learning curve graph shows a behaviour desirable for an online classifier - a quick performance increase after a small number of samples, followed by a plateau. (Green line is the F-score, red represents precision and blue recall.)
Similar to the learning curve, particular tests show the steep performance increase within the first samples seen.
After training on 1 positive sample and 3 negative ones, the F-score goes to 68%. Giving it 3 positives and 6 negatives takes it to 74% F-score.
At 8 positive and 19 negative samples, which - in a practical scenario - means tagging 8 clauses (the inference described above taking care of the negatives), the F-score is over 80%. This could be called a young, but mature enough to be useful classifier.
From this point on, the learner can start giving suggestions, which aid the training process by turning the "needle-in-haystack" problem into a "isn't-this-a-needle?" question.
Try Beagle 30 days for free. Use Promo code BEAGLE30