Jump to content

eCapture Hash


Sharmarke Afgab
 Share

Recommended Posts

Hello everyone,

 

I wanted to post this email and a portion of the thread between Chuck Holtzworth (Sr. SWAT Engineer) and myself discussing hashing. I feel like this is a topic not discussed too often or in detail. Chuck captures our hashing logic in a way that we can all understand. Please stay and read the below thread.

Edited by Michael Atkin
Link to comment
Share on other sites

By: Chuck Holtzworth

 

In most cases (projects), MD5 hash values are calculated on the file itself. For more reliable de-duplication of emails though, it is required that de-duplication occur on the information contained within it and not the file itself. There are many reasons for this; the simplest is the fact that when an email is saved out of its container (PST, NSF, etc) the file that is created contains information that would change the hash value of the same email each time that the email was saved out. Hashing of e-mails uses the UTC time to ensure proper deduplication across time zones.

 

 

 

When an email is discovered within Ipro eCapture, it is assigned a hash value based on fields chosen by the user. The values of these fields are concatenated and the text is hashed. Select from the following email fields to generate the hash value:

 

• Subject

 

• From/Author

 

• Attachment Count

 

• Body Whitespace: From the Body Whitespace drop-down list, select either Include (default) or Remove. Whitespace in the e-mail body could cause slight differences between the same e-mails, which could result in different hashes being generated. Remove - removes all whitespace between lines of text in the e-mail body prior to hashing. Include - keeps the whitespace.

 

• E-mail Date: The following message types use the specified date values: Outlook: Sent Date, IBM (formerly Lotus) Notes: Posted Date, RFC822: Date, and GroupWise: Delivered Date.

 

• Attachment Names

 

• Recipients

 

• CC

 

• BCC

 

 

 

Select from either Creation Date or Last Modification Date. The selected value will be used when calculating the MD5 hash in the event that the normal E-mail Date value is not present. This commonly occurs for Draft messages that have not been sent.

 

 

 

Start Time is always used if it exists.

 

 

 

DataExtract/Processing:

 

 

 

A comparison is made on the SHA1 hash value for parents and standalone documents at the time of de-duplication. If a match is found on a parent email a family hash is generated using the MD5 values for the entire family. These are concatenated and hashed resulting in an MD5 family hash. The resulting family hash is then used for de-duplication comparison. The comparison set of documents for de-duplication are those that are found within the DataExtractResults\Results tables within the client database and depending on the de-duplication settings specified via Flex Processor.

 

 

 

• The Family hash value is stored in the ClientDatabase.Items.FamilyHash field.

 

• Family hash is generated in SQL code.

 

 

 

Streaming Discovery:

 

 

 

Family hash is generated for all email parents where the Items.EmailAttachmentCount field is > 1. Family hash is always generated during Streaming Discovery just before an export set is created either by export interval or by manually exporting the documents. The MD5 values are concatenated and hashed resulting in a SHA1 family hash. The resulting family hash is then used for de-duplication comparison. The comparison set of documents for de-duplication are those that are found within the ExportSetSelection table within the client database.

 

 

 

• The Family hash value is stored in the ClientDatabase.Items.FamilyHash field.

 

• Family hash is generated using the Stored Procedure named 'CalculateFamilyHash' found in each client database

  • Like 1
Link to comment
Share on other sites

As you know, for non emails, we use the fixed size bit value of the file to generate hash. Each file contains blocks of data and therefore we use that data to generate a hash for the file.

 

Emails however are a different animal. If I send an email to Chuck, there will be 2 copies of that email. The first will reside in my "sent" folder. The second will reside in Chuck's inbox. If both Chuck and I are Custodians in a case, our data will get collected and those 2 emails should be considered duplicates. However, emails have something called "internet header" which contains information regarding the email servers it passed through to get to the recipient. In this example, the email in Chuck's inbox, will contain information that is not in the email in my "Sent" folder. Due to this little known "Internet header" information, emails can never be hashed the same using the standard hashing method. Hence why we choose specific email fields and body to generate hash for emails.

 

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...