The most common raw format of network traffic data is thePCAP format which is not supported by many widely usedmachine learning and data mining tools and platforms. Hence,as shown in Figure 1, it is necessary to carry out several stepsto prepare the data for processing. These steps are explainedin detail in the following subsections.A. Obtaining the PCAP DataThe data is usually captured in PCAP format using networktraffic analysers such as Wireshark. According to the documentation of Wireshark, some global information is stored inthe header of each PCAP file. After that, the file containsrecord(s) for captured packets. These records are organised insuch a way that each packet data has its own packet record asshown in Figure 2.B. From PCAP to Plain TextAccording to the documentation of Wireshark, network datastored in the captured packet data in a PCAP file might notnecessarily be in its original order as it appeared on thenetwork. This is because the PCAP file might store only somepart of each packet (usually the length of this part is predefinedto be larger than the largest possible packet so no packet istrimmed). Due to this reason, it is highly recommended to useFig. 1: PreProcessingPipelineFig. 2: Contents of PCAP Filespecialised tools that understand the structure of PCAP files.Therefore, to transform PCAP data into a textual format, it isadvised to use the freely available tool FlowMeter [7]. Thisis a Java package that reads in a directory which containsone or more PCAP files and transforms them into CommaSeparated Value (CSV) files. It analyses the contents of PCAPfiles and generates several attributes (features) such as SourcePort, Destination Port, Protocol, Flow Duration, Flow Bytesper second and Flow Packets per second. The total numberof features generated by FlowMeter is 26 and their fulldescription can be found in [7].