nlp - Parsing bulk text with Hadoop: best practices for generating keys -
I have a 'big' line that delivers full sentences, which I'm processing with headop. I have developed a mapper which applies some of my favorite NLP techniques to this. There are many different techniques that I am mapping on the original set of sentences, and during the reducing phase, my goal is to gather these results into groups, like sharing the same basic sentence to all the members in the group.
I think the use of the whole sentence as a key is a bad habit. I thought that creating some hash value of sentence can not work due to the number of limited (inappropriate belief).
Can you recommend the best ideas / exercises to create unique keys for each sentence? Ideally, I want to preserve the order. However, this is not a main requirement.
ASTO,
Standard hashing should work well in most hash algorithms There is a much higher price point than the number of people who are likely to work with you, and thus the probability of a collision will still be very low.
Comments
Post a Comment