Jump to content

Leaderboard

Popular Content

Showing content with the highest reputation on 03/19/2018 in all areas

  1. By: Chuck Holtzworth In most cases (projects), MD5 hash values are calculated on the file itself. For more reliable de-duplication of emails though, it is required that de-duplication occur on the information contained within it and not the file itself. There are many reasons for this; the simplest is the fact that when an email is saved out of its container (PST, NSF, etc) the file that is created contains information that would change the hash value of the same email each time that the email was saved out. Hashing of e-mails uses the UTC time to ensure proper deduplication across time zones. When an email is discovered within Ipro eCapture, it is assigned a hash value based on fields chosen by the user. The values of these fields are concatenated and the text is hashed. Select from the following email fields to generate the hash value: • Subject • From/Author • Attachment Count • Body Whitespace: From the Body Whitespace drop-down list, select either Include (default) or Remove. Whitespace in the e-mail body could cause slight differences between the same e-mails, which could result in different hashes being generated. Remove - removes all whitespace between lines of text in the e-mail body prior to hashing. Include - keeps the whitespace. • E-mail Date: The following message types use the specified date values: Outlook: Sent Date, IBM (formerly Lotus) Notes: Posted Date, RFC822: Date, and GroupWise: Delivered Date. • Attachment Names • Recipients • CC • BCC Select from either Creation Date or Last Modification Date. The selected value will be used when calculating the MD5 hash in the event that the normal E-mail Date value is not present. This commonly occurs for Draft messages that have not been sent. Start Time is always used if it exists. DataExtract/Processing: A comparison is made on the SHA1 hash value for parents and standalone documents at the time of de-duplication. If a match is found on a parent email a family hash is generated using the MD5 values for the entire family. These are concatenated and hashed resulting in an MD5 family hash. The resulting family hash is then used for de-duplication comparison. The comparison set of documents for de-duplication are those that are found within the DataExtractResults\Results tables within the client database and depending on the de-duplication settings specified via Flex Processor. • The Family hash value is stored in the ClientDatabase.Items.FamilyHash field. • Family hash is generated in SQL code. Streaming Discovery: Family hash is generated for all email parents where the Items.EmailAttachmentCount field is > 1. Family hash is always generated during Streaming Discovery just before an export set is created either by export interval or by manually exporting the documents. The MD5 values are concatenated and hashed resulting in a SHA1 family hash. The resulting family hash is then used for de-duplication comparison. The comparison set of documents for de-duplication are those that are found within the ExportSetSelection table within the client database. • The Family hash value is stored in the ClientDatabase.Items.FamilyHash field. • Family hash is generated using the Stored Procedure named 'CalculateFamilyHash' found in each client database
    1 point
×
×
  • Create New...