# Real-world networking datasets ## Dataset repos or tools * Finding datasets: [New Google Dataset search](https://toolbox.google.com/datasetsearch) * Mendeley Data: https://data.mendeley.com/datasets * Kaggle Data: https://www.kaggle.com/datasets * *Google's M-Lab networking performance data sets: https://www.measurementlab.net/data/* ## Service/VNF-related data * SNDZoo: https://sndzoo.github.io/ * Datasets from the University of Catalunya: http://knowledgedefinednetworking.org/ ## Traffic ### Traffic traces: * Traffic over time (in minutes, hours, or days) from ~2005 of a private ISP. Only time + traffic size, no source/destination or any other info. https://datamarket.com/data/list/?q=cat:ecd%20provider:tsdl * Facebook traffic traces (access via FB group): https://research.fb.com/data-sharing-on-traffic-pattern-inside-facebooks-datacenter-network/ * North American Backbone network Abilene, date from 24 weeks of 5 minute (2004) averages, 12 routers (12x12 matrices): http://www.maths.adelaide.edu.au/matthew.roughan/project/traffic_matrix/ (160 MB) * SNDlib: Library with real-world network topologies (most popular: Abilene) and sometimes traffic/service demands: http://sndlib.zib.de/ * Under `Library > Dynamic traffic` there are realistic traffic traces that are dynamically changing over time. For example traffic matrix from every 5min over 6 months for the Abilene network. * Also for every 15min for the larger Geant network * [UMassTraceRepository](http://traces.cs.umass.edu/index.php/Network/Network) * RawDad: Real, wireless data (124 datasets). Download only for registered users (free). https://crawdad.org/about.html * Internet traffic archive: Useful, real-world traces. http://ita.ee.lbl.gov/html/traces.html * MAWI Working Group Traffic Archive [Packet traces from WIDE backbone](http://mawi.wide.ad.jp/mawi/) * Huge traces from Google. 12000+ machines, measured over 1 month, compressed size ~41GB (also smaller traces available). https://github.com/google/cluster-data * IP-Network traffic labeled with different apps: https://www.kaggle.com/jsrojas/ip-network-traffic-flows-labeled-with-87-apps/home * Summarized network traffic where computers are gradually compromised by a botnet: https://www.kaggle.com/crawford/computer-network-traffic/home * Caida packet-level traffic traces from 2016: https://www.caida.org/data/passive/passive_2016_dataset.xml * `tcpreplay` simple traffic traces (`smallFlows.pcap` and `bigFlows.pcap`) for traffic generation: https://tcpreplay.appneta.com/wiki/captures.html ### Traffic Generator * Trex generates statefull and stateless traffic. Python API is provided. https://github.com/cisco-system-traffic-generator/trex-core * Intel's traffic testbed for data plane development kit (DPDK) (using Trex): https://software.intel.com/en-us/articles/build-your-own-dpdk-traffic-generator * [MoonGen](https://github.com/emmericp/MoonGen) * How to generate Netflow data from PCAP traces (more available): https://stackoverflow.com/a/34792376/2745116 ## Networks ### Network topologies * SNDlib: http://sndlib.zib.de/home.action * TopologyZoo: http://topology-zoo.org/ ### Network delays * Historical (hourly) ping measurements for large-scale network: https://wondernetwork.com/pings * Datasets from the University of Catalonia: http://knowledgedefinednetworking.org/ ## Other ### Workload * Alibaba traces with batch workloads on thousands of machines (from 2017 and 2018): https://github.com/alibaba/clusterdata