nlp - Parsing bulk text with Hadoop: best practices for generating keys -

- September 15, 2014

I have a 'big' line that delivers full sentences, which I'm processing with headop. I have developed a mapper which applies some of my favorite NLP techniques to this. There are many different techniques that I am mapping on the original set of sentences, and during the reducing phase, my goal is to gather these results into groups, like sharing the same basic sentence to all the members in the group.

I think the use of the whole sentence as a key is a bad habit. I thought that creating some hash value of sentence can not work due to the number of limited (inappropriate belief).

Can you recommend the best ideas / exercises to create unique keys for each sentence? Ideally, I want to preserve the order. However, this is not a main requirement.

ASTO,

Standard hashing should work well in most hash algorithms There is a much higher price point than the number of people who are likely to work with you, and thus the probability of a collision will still be very low.

Search This Blog

Add s econ

nlp - Parsing bulk text with Hadoop: best practices for generating keys -

Comments

Post a Comment

Popular posts from this blog

paypal - How to know the URL referrer in PHP? -

oauth - Facebook OAuth2 Logout does not remove fb_ cookie -

wpf - Line breaks and indenting for the XAML of a saved FlowDocument? -