To request access this dataset you will need to login with an IMPACT account. Accounts are free. If you don't have one please register.
GT Malware Unsolicited Email Daily Feed
This dataset contains a daily feed of unsolicited email produced by the Georgia Tech Information Security Center??s malware analysis system. Supplemental metadata included with the feed associates each message with a specific suspect Windows executable, which is run in a sterile, isolated environment, with controlled access to the Internet, for a short period of time. Network activity comprising each sample's generation of unsolicited email is recorded and made available in raw (packet capture, or PCAP) and plaintext (mbox and CSV) formats.
The CSV file, which contains a small subset of information present in the PCAP and mbox file sets, is named according to the date on which the corresponding set of executables were processed. Each entry in the CSV file comprises a 4-tuple that provides the executable's MD5 hash, the message sender (From:) address, a recipient (To:) address, and the subject (Subject:) of a given message. Note that in the CSV file, for a given message there is at most one recipient provided, even if the field contains multiple addresses.
This dataset is structured as a set of archives that each correspond to a single day of sample processing-based unsolicited email collection. Each archive decompresses to a top-level folder containing a CSV file, a PCAP subdirectory, and an mbox subdirectory for that day. The PCAP and mbox subdirectories each contain a set of files that are named according to the MD5 of the sample that produced the corresponding messages.
Note that the SMTP/MSA_SMTP redirection mechanism used to implement this feed is fully transparent. Thus, while an examination of the dataset??s PCAP files may suggest that a given sample is able to interact with an Internet mail exchange, such outbound traffic is transparently redirected to a high performance spamtrap operated by the Georgia Tech Information Security Center.
This dataset is the subject of ongoing measurement and data collection. As such the data is continuously growing. Researchers who are granted access will be able to download updates for a period of one year after their request.
Size is growing as more data is collected
georgia, tech, unsolicited, email, feed, malware, daily, gt, gt malware unsolicited email daily feed, 609, 2016, georgia tech, file, pcap, csv, mbox, message, sample, dataset, named, files, recipient, smtp, executable, md5, subject, center, day, security, produced, note, subdirectory, exchange, period, network, interact, formats, transparently, redirected, comprising, examination, raw, analysis, entry, packet, system, comprises, archive, capture, subdirectories, subset, time, short, metadata, executables, performance, environment, processed, folder, windows, specific, generation, multiple, based, redirection, included, activity, suspect, plaintext, sets, associates, hash, archives, access, mechanism, spamtrap, structured, supplemental, isolated, top, decompresses, tuple, transparent, sterile, operated, unsolicited bulk email data, traffic, field, implement, outbound, sender, correspond, level, mail, processing, messages, msa, single, controlled, bulk