This is some simple implementation of zerocopy sendfile(2) for use with TUX. It works with acenic, 3c59x and SunHME. [001015] * tcp_fragment() is ready. * Optional zero copy from sendmsg() (MSG_NOCOPY) [001017] * Nothing new, checkpoint before merge to more recent kernel. * Some technical bits though. [001018] * merged to the latest kernel tree. * ugly bug in tcp_fragment(), introduced in previous patch. * BPF and packet socket receives our local paged frames without copy. * NAT updates checksum on CHECKSUM_HW frames correctly. * mapping kiobuf is more or less sane. [ Ough, it is not only suboptimal, it is just wrong. Seems, it would be better not to use this function... ] [ Hmm... I see no more holes (module netfilter). It is just nice, really. ] [001022] * Two bad bugs (pskb_copy(), tcp_sendpages()) * 3c59x joins community of "good" devices. Nice device. No documentation was required, couple of comments in source code were enough. 8) * Various fixes from Ingo, particularly CONFIG_HIGHMEM compiles now. * skb_clone() (from Ingo) is unrolled. This did not help though. For now skb_clone() is almost champion in profiles, only ace_interrupt beats it. In fact, it was clear that it will not help. I feel very incomfortable, when CPU is not used for all 100% and TCP does not saturate link. Though people commonly believe, that it _good_, it signals nothing but some bug or misunderstanding. If you are idle --- work faster! Apparently, something with latency... tcpdumps look perfect though. Hmm... [001025] * [BUG in 2.4] Defragmenter did not set timestamp on reassembled packets. * SG on eepro100 * SG on Tulip * RX checksumming on eepro100 (the first "true" CHECKSUM_HW, working for IPv6 8)) * HW checksumming engine is sanitized and "documented" in comment in include/linux/skbuff.h * IPv6 uses CHECKSUM_HW on received packets correctly, including fragmented frames. NOTE: well, all the devices, which I have here, are SGized. Tulip (SG), eepro100(SG), 3c59x(SG+CSUM), acenic(SG+CSUM). [001031] * merged to main kernel of today. * acenic.c: optional TX ring, based in host memory. * eepro100.c: use dynamic TBDs with 82558/9. * cleanup checksumming: new primitives: csum_add, csum_sub, csum_block_add, csum_block_sub All the shit scattered over code is encapsulated there. * check for high DMA (far from being perfect) * do high memory skb page acesses under softirq, not cli. * icmp.c, tunnels understand CHECKSUM_HW Not related to zerocopy: * [TCP] Increase minimal ATO to 40msec. sshd is so damnly slow, that _never_ is able to reply for 10msec. Sigh. * [IPv6] 6to4 in SIT. Roger Venning and Nate Thompson [001101] * merged to main kernel of today (final test10) * SunHME joins family of devices. Seems, it is the most clever member of the family. 8) By Dave. * Another patch from Dave: some crap in map_user_kiobuf is cleaned out. * netsyms.c fixes, again from Dave. [001102] * some bugs are found by Dave: that one in pskb_copy_bits() is really bad, if it was hit. [001105] * tcp_sendmsg() is forced to page output on SG devices. (by Dave) * skb frag list is moved to shared part of skb. (by Dave) * Some changes in TCP and packet socket to adapt to new situation. * all bits of zerocopy from vm and async signaling are withdrawn. * changes to the devices, which are not able to checksum, are withdrawn. Not related to zerocopy: * Sending MSG_OOB is rewritten to match RFCs and BSD. Match to BSD is still not full, see comment above tcp_xmit_probe_skb(). [001105] * Paged tcp_sendmsg(). Some fix by Dave. I hope Dave had fresh brains after skiing yet, I still did not understand this. 8) I will verify this tonight. Not related to zerocopy: * [IPv6] Two bugs found by Aki M. Laukanen - Check that destination of packet with rthdr is not multicast. - double free of skb in twoplaces in IPv6 reassembler. * [IPv6,IP] Serious protocol inconistency, noticed by Aki M. Laukanen Fragment queues must expire not later than MSL after arrival the first packet. Danger! IPv4 is changed _too_, so that it starts to contradict RFC791. Some people can blame, but we MUST NOT follow RFC in this point, otherwise MSL limit is evidently violated. In fact, RFC was self-consistent because it assumes that routers decrease TTL each second, so that delayed fragments have NOW+ttl constant. This requirement on routers is obsoleted, so that it automatically obsoletes reassembly procedure described in RFC. * Yes, forgot to say, funny assert is added to sock_wait_on_wmem(). Some people insist that this can happen, let's check. 8) [001110] * Newer paged tcp_sendmsg() using per socket cache page. By Dave. * Fix to SunHME. By Dave. * Reset SG flag for IPv6, if device is able to csum only IPv4. Not related to zerocopy: * Bug in urgent data transmission. * Missing lock in dev_load(). (By Denis Lunev ) * Large fix for big bug: TCP option parser does not affect socket state. Mostly trivial, SYN-SENT is the only place, which looks mmm... aesteticllay dubious. 8) * By product: another bug in tstamps: zero echoed tstamp is legal and means "ignore me". * Small change in debugging window shrinking: eliminate known winscale case not to frighten people. NOTE: still not merged to kernel of today. [001113] * Lazy defragmentation. Funny, code became simpler after this... - IPv6 is still not updated, only IPv4. - NFS is still not updated. (Easy, but requires new function, which works like skb_copy_and_csum_datagram(), but with kmap() -> kmap_atomic(), memcpy_touser() -> memcpy()) Tomorrow. Not related to zerocopy: * tcp_v4_rebuild_header() (problem is pinned down by "Andrey V. Shytov" ) * divide by zero bug in ndisc.c (by Gilles Berger Sabbatel ) [001116] * NFS is supposed to be updated. Client does single copy. * Input IP path understands paged/fragmnted skbs. Try to do ftp (sendfiling) on loopback. 8) - IPv6 is still not updated. Not related to zerocopy: * rtt is calculated not by the last acked packet, but the earlier one. * tcp_input has hole allowing to cancel backoff even if not retransmitted data was acked. [ After some discussion with Andrey Gurtov. He proposed to relax Karn algo, to be honest. 8) Well, and result is more strong. ] [001118] * Bug fixes from Dave (NFS, netsyms) * __pskb_pull_tail was buggy yet. * IPv6 is updated module exthdr parser [001119] * IPv6 extension headers. Now defragmenter generates good fragmented skbs. [ Sweaty work, yes. ] [001120] * MSG_MORE is implemented. * Results of audit of IP/IPv6 icmp paths are commited. * netsyms are _right_ now. Patch from Dave, checked, IPv6 is loaded as module. [001123] * Minor API change: skb_copy_bits() uses "standard" bcopy()-like order of arguments: noticed and advised by Dave. * skb->rx_dev is removed. It was pure bloat of sk_buff. * one more audit pass: a few of misprints are found; some comments are added. * Wow! Proxy arp has not been threaded in softnet! 8) This sad fact was discovered while removing rx_dev. * One more SMP bug in error queue processing. Well, byproducts effects of doing nw things are not less interesting sometimes. 8) [001125] * upgraded to current kernel. * old bug in tcp_transmit_skb(): order of args in between() was wrong. * by the way, invented some workaround for problem with urgent data and closed window. It looks a bit unusual, but seems to be even better than trick used by BSD. [001202] * Bug fixes, except for a few of ones, which intersect to zerocopy are flushed to main kernel. Thanks to Dave, I cowardly hided of this jab. * Sync to kernel of today (yesterday, to be more exact). * Nothing more. [001207] **** vm zerocopy hacks are remerged back. Patch *-vmhacks is prepared. **** * Wrong argument order to udp_v6_mcast_next(). By yoshfuji@ecei.tohoku.ac.JP (YOSHIFUJI Hideaki) * Wicked argument order of udp_v4_mcast_next() is changed to reasonable one. Motivated by the fix above. * Return ENOTCONN in inet_recvmsg() if socket is not bound. Made after discussion with Andi. * Zero sin_zero in ip_recv_err(). By Andi. * Supressing message for win. shrink with wscale did not suppress anything. 8) * Better estimate for initial rcvmss to handle marginal cases better. 1. For low mtu: add as hint outgoing mtu (if it is too low, it will be fixed later automatically) 2. Clamp rcvmss with half of initial rcv_wnd to handle case of extremally low rcvbufs, when we could advertise identically zero window from the very beginning and forever. * Try to withdraw CWR after undo. * Time to repair RTO estimator yet. Our current algo turned out to be pure disgrace and we cannot leave it in its current state. Let's experiment a bit. Actually, the last resort which we can always fall in stable kernels is evident. It is BSD scheme, of course. It is so bogus, that will work despite of it is simply logically inconsistent. 8) For now we still may try better alternatives. See README.rto. [001211] * Merge to current kernel, some things percolated to main tree. [001212] * Fix to sunhme from Dave. * Some mud (nothing essential) present in the previous snapshot is removed. * Merge to vger of today. Alexey Kuznetsov .