You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
|
Mar
(1) |
Apr
(9) |
May
(3) |
Jun
|
Jul
(3) |
Aug
(6) |
Sep
|
Oct
(7) |
Nov
|
Dec
|
2004 |
Jan
|
Feb
(5) |
Mar
(10) |
Apr
(2) |
May
(22) |
Jun
(8) |
Jul
(4) |
Aug
(8) |
Sep
(3) |
Oct
|
Nov
(36) |
Dec
(52) |
2005 |
Jan
(9) |
Feb
(13) |
Mar
(9) |
Apr
|
May
(14) |
Jun
(5) |
Jul
(20) |
Aug
(31) |
Sep
(2) |
Oct
(3) |
Nov
(18) |
Dec
(18) |
2006 |
Jan
(36) |
Feb
(16) |
Mar
(76) |
Apr
(78) |
May
(32) |
Jun
(30) |
Jul
(67) |
Aug
(43) |
Sep
(54) |
Oct
(116) |
Nov
(223) |
Dec
(158) |
2007 |
Jan
(180) |
Feb
(71) |
Mar
(110) |
Apr
(114) |
May
(203) |
Jun
(100) |
Jul
(238) |
Aug
(191) |
Sep
(177) |
Oct
(171) |
Nov
(211) |
Dec
(159) |
2008 |
Jan
(227) |
Feb
(288) |
Mar
(197) |
Apr
(253) |
May
(132) |
Jun
(152) |
Jul
(109) |
Aug
(143) |
Sep
(157) |
Oct
(198) |
Nov
(121) |
Dec
(147) |
2009 |
Jan
(105) |
Feb
(61) |
Mar
(191) |
Apr
(161) |
May
(118) |
Jun
(172) |
Jul
(166) |
Aug
(67) |
Sep
(86) |
Oct
(79) |
Nov
(118) |
Dec
(181) |
2010 |
Jan
(136) |
Feb
(154) |
Mar
(92) |
Apr
(83) |
May
(101) |
Jun
(66) |
Jul
(118) |
Aug
(78) |
Sep
(134) |
Oct
(131) |
Nov
(132) |
Dec
(104) |
2011 |
Jan
(79) |
Feb
(104) |
Mar
(144) |
Apr
(145) |
May
(130) |
Jun
(169) |
Jul
(146) |
Aug
(76) |
Sep
(113) |
Oct
(82) |
Nov
(145) |
Dec
(122) |
2012 |
Jan
(132) |
Feb
(106) |
Mar
(145) |
Apr
(238) |
May
(140) |
Jun
(162) |
Jul
(166) |
Aug
(147) |
Sep
(80) |
Oct
(148) |
Nov
(192) |
Dec
(90) |
2013 |
Jan
(139) |
Feb
(162) |
Mar
(174) |
Apr
(81) |
May
(261) |
Jun
(301) |
Jul
(106) |
Aug
(175) |
Sep
(305) |
Oct
(222) |
Nov
(95) |
Dec
(120) |
2014 |
Jan
(196) |
Feb
(171) |
Mar
(146) |
Apr
(118) |
May
(127) |
Jun
(93) |
Jul
(175) |
Aug
(66) |
Sep
(85) |
Oct
(120) |
Nov
(81) |
Dec
(192) |
2015 |
Jan
(141) |
Feb
(133) |
Mar
(189) |
Apr
(126) |
May
(59) |
Jun
(117) |
Jul
(56) |
Aug
(97) |
Sep
(44) |
Oct
(48) |
Nov
(33) |
Dec
(87) |
2016 |
Jan
(37) |
Feb
(56) |
Mar
(72) |
Apr
(65) |
May
(66) |
Jun
(65) |
Jul
(98) |
Aug
(54) |
Sep
(84) |
Oct
(68) |
Nov
(69) |
Dec
(60) |
2017 |
Jan
(30) |
Feb
(38) |
Mar
(53) |
Apr
(6) |
May
(2) |
Jun
(5) |
Jul
(15) |
Aug
(15) |
Sep
(7) |
Oct
(18) |
Nov
(23) |
Dec
(6) |
2018 |
Jan
(39) |
Feb
(5) |
Mar
(34) |
Apr
(26) |
May
(27) |
Jun
(5) |
Jul
(12) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
(4) |
Dec
(4) |
2019 |
Jan
(7) |
Feb
(10) |
Mar
(21) |
Apr
(26) |
May
(4) |
Jun
(5) |
Jul
(11) |
Aug
(6) |
Sep
(7) |
Oct
(13) |
Nov
(3) |
Dec
(17) |
2020 |
Jan
|
Feb
(3) |
Mar
(3) |
Apr
(5) |
May
(2) |
Jun
(5) |
Jul
|
Aug
|
Sep
(6) |
Oct
(7) |
Nov
(2) |
Dec
(7) |
2021 |
Jan
(9) |
Feb
(10) |
Mar
(18) |
Apr
(1) |
May
(3) |
Jun
|
Jul
(16) |
Aug
(2) |
Sep
|
Oct
|
Nov
(9) |
Dec
(2) |
2022 |
Jan
(3) |
Feb
|
Mar
(9) |
Apr
(8) |
May
(5) |
Jun
(6) |
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
(7) |
Dec
(2) |
2023 |
Jan
(7) |
Feb
(2) |
Mar
(6) |
Apr
|
May
(4) |
Jun
(2) |
Jul
(4) |
Aug
(3) |
Sep
(4) |
Oct
(2) |
Nov
(4) |
Dec
(10) |
2024 |
Jan
(4) |
Feb
(2) |
Mar
(1) |
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
1
(11) |
2
(7) |
3
(5) |
4
(1) |
5
(19) |
6
(7) |
7
|
8
(1) |
9
|
10
(1) |
11
|
12
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
|
23
|
24
|
25
|
26
|
27
|
28
|
29
|
30
|
31
|
|
From: Martin J. <ga...@wl...> - 2004-12-10 16:24:22
|
On Thu, 2004-12-02 at 19:23, Robert Olsson wrote: > Hello! >=20 > Below is little patch to clean skb at xmit. It's old jungle trick Jamal > and I used w. tulip. Note we can now even decrease the size of TX ring. Just a small unimportant note. > --- drivers/net/e1000/e1000_main.c.orig 2004-12-01 13:59:36.000000000 +01= 00 > +++ drivers/net/e1000/e1000_main.c 2004-12-02 20:37:40.000000000 +0100 > @@ -1820,6 +1820,10 @@ > return NETDEV_TX_LOCKED;=20 > }=20 > =20 > + > + if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 = ) > + e1000_clean_tx_ring(adapter); > + > /* need: count + 2 desc gap to keep tail from touching > * head, otherwise try next time */ > if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { This patch is pretty broken, I doubt you want to call e1000_clean_tx_ring(), I think you want some variant of e1000_clean_tx_irq() :) e1000_clean_tx_irq() takes adapter->tx_lock which e1000_xmit_frame() also does so it will need some modification. And it should use E1000_DESC_UNUSED as Scott pointed out. --=20 /Martin |
From: Ray L. <ra...@ma...> - 2004-12-08 23:36:55
|
hello martin On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote: > > Here's the patch, not much more tested (it still gives some transmit > timeouts since it's scotts patch + prefetching and delayed TDT updating). > And it's not cleaned up, but hey, that's development :) > > The delayed TDT updating was a test and currently it delays the first tx'd > packet after a timerrun 1ms. > > Would be interesting to see what other people get with this thing. > Lennert? well, i'm brand new to gig ethernet, but i have access to some nice hardware right now, so i decided to give your patch a try. this is the average tx pps of 10 pktgen runs for each packet size: 60 1187589.1 64 601805.4 68 1115029.3 72 593096.4 76 1097761.1 80 587125.4 84 1098045.2 88 588159.1 92 1072124.8 96 582510.3 100 1008056.8 104 577898.0 108 946974.0 112 573719.2 116 892871.0 120 573072.5 124 844608.3 128 563685.7 any idea why the packet rates are cut in half for every other line? pktgen is running with eth0 bound to CPU0 on this box: NexGate NSA 2040G Dual Xeon 3.06 GHz, HT enabled 1 GB PC3200 DDR SDRAM Dual 82544EI - on PCI-X 64 bit 133 MHz bus - behind P64H2 bridge - on hub channel D of E7501 chipset thanks -- ---------------------------------------------------------------------- Ray L <ra...@ma...> |
From: jamal <ha...@cy...> - 2004-12-06 13:11:51
|
Hopefully someone will beat me to testing to see if our forwarding capacity now goes up with this new recipe. cheers, jamal On Mon, 2004-12-06 at 07:30, Martin Josefsson wrote: > On Mon, 6 Dec 2004, Lennert Buytenhek wrote: > > > > Right, but so far when i scan the results all i see is PCI not PCI-X. > > > Which of your (or Martins) boards has PCI-X? > > > > I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin > > was running at 64/133 PCI-X. > > I don't have any motherboards with PCI-X so no :) > I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop > adapter) at 32/66, both are able to send at wirespeed. > > /Martin > > |
From: Martin J. <ga...@wl...> - 2004-12-06 12:30:51
|
On Mon, 6 Dec 2004, Lennert Buytenhek wrote: > > Right, but so far when i scan the results all i see is PCI not PCI-X. > > Which of your (or Martins) boards has PCI-X? > > I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin > was running at 64/133 PCI-X. I don't have any motherboards with PCI-X so no :) I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop adapter) at 32/66, both are able to send at wirespeed. /Martin |
From: Lennert B. <bu...@wa...> - 2004-12-06 12:23:05
|
On Mon, Dec 06, 2004 at 07:20:43AM -0500, jamal wrote: > > > Someone correct me if i am wrong - but does it appear as if all these > > > changes are only useful on PCI but not PCI-X? > > > > They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I > > get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0. > > > > Martin gets the ~1Mpps number with just the tx rework, and even more > > with TXDMAC=0 added in as well. > > Right, but so far when i scan the results all i see is PCI not PCI-X. > Which of your (or Martins) boards has PCI-X? I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin was running at 64/133 PCI-X. --L |
From: jamal <ha...@cy...> - 2004-12-06 12:20:55
|
On Mon, 2004-12-06 at 07:11, Lennert Buytenhek wrote: > On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote: > > > Someone correct me if i am wrong - but does it appear as if all these > > changes are only useful on PCI but not PCI-X? > > They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I > get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0. > > Martin gets the ~1Mpps number with just the tx rework, and even more > with TXDMAC=0 added in as well. Right, but so far when i scan the results all i see is PCI not PCI-X. Which of your (or Martins) boards has PCI-X? cheers, jamal |
From: Lennert B. <bu...@wa...> - 2004-12-06 12:11:54
|
On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote: > Someone correct me if i am wrong - but does it appear as if all these > changes are only useful on PCI but not PCI-X? They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0. Martin gets the ~1Mpps number with just the tx rework, and even more with TXDMAC=0 added in as well. --L |
From: jamal <ha...@cy...> - 2004-12-06 11:33:24
|
On Sun, 2004-12-05 at 12:54, Martin Josefsson wrote: > On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > > > I've tested all packet sizes now, and delayed TDT updating once per jiffy > > (instead of once per packet) indeed gives about 25kpps more on 60,61,62 > > byte packets, and is hardly worth it for bigger packets. > > Maybe we can't see any real gains here now, I wonder if it has any effect > if you have lots of nics on the same bus. I mean, in theory it saves a > whole lot of traffic on the bus. > This sounds like really exciting stuff happening here over the weekend. Scott, you had to leave Intel before giving us this tip? ;-> Someone correct me if i am wrong - but does it appear as if all these changes are only useful on PCI but not PCI-X? cheers, jamal |
From: Scott F. <sf...@po...> - 2004-12-06 01:20:53
|
On Sun, 2004-12-05 at 13:25, Lennert Buytenhek wrote: > What your patch does is (correct me if I'm wrong): > - Masking TXDW, effectively preventing it from delivering TXdone ints. > - Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes > the chip to 'ignore the TIDV' register, which is the 'TX Interrupt > Delay Value'. What exactly does this? A descriptor with IDE, when written back, starts the Tx delay timers countdown. Never setting IDE means the Tx delay timers never expire. > - Not setting the "Report Packet Sent"/"Report Status" bits in the TXD > command field. Is this the equivalent of the TXdone interrupt? > > Just exactly which bit avoids the descriptor writeback? As the name implies, Report Status (RS) instructs the controller to indicate the status of the descriptor by doing a write-back (DMA) to the descriptor memory. The only status we care about is the "done" indicator. By reading TDH (Tx head), we can figure out where hardware is without reading the status of each descriptor. Since we don't need status, we can turn off RS. > I'm also a bit worried that only freeing packets 1ms later will mess up > socket accounting and such. Any ideas on that? Well the timer solution is less than ideal, and any protocols that are sensitive to getting Tx resources returned by the driver as quickly as possible are not going to be happy. I don't know if 1ms is quick enough. You could eliminate the timer by doing the cleanup first thing in xmit_frame, but then you have two problems: 1) you might end up reading TDH for each send, and that's going to be expensive; 2) calls to xmit_frame might stop, leaving uncleaned work until xmit_frame is called again. -scott |
From: Lennert B. <bu...@wa...> - 2004-12-05 21:26:02
|
On Sun, Dec 05, 2004 at 01:12:22PM -0800, Scott Feldman wrote: > Would Martin or Lennert run these test for a longer duration so we can > get some data, maybe adding in Rx. It could be that removing the Tx > interrupts and descriptor write-backs, prefetching may be ok. I don't > know. Intel? What your patch does is (correct me if I'm wrong): - Masking TXDW, effectively preventing it from delivering TXdone ints. - Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes the chip to 'ignore the TIDV' register, which is the 'TX Interrupt Delay Value'. What exactly does this? - Not setting the "Report Packet Sent"/"Report Status" bits in the TXD command field. Is this the equivalent of the TXdone interrupt? Just exactly which bit avoids the descriptor writeback? I'm also a bit worried that only freeing packets 1ms later will mess up socket accounting and such. Any ideas on that? > Also, wouldn't it be great if someone wrote a document capturing all of > the accumulated knowledge for future generations? I'll volunteer for that. --L |
From: Scott F. <sf...@po...> - 2004-12-05 21:09:51
|
On Sun, 2004-12-05 at 07:03, Martin Josefsson wrote: > BUT if I use the above + prefetching I get this: > > 60 1483890 Ok, proof that we can get to 1.4Mpps! That's the good news. The bad news is prefetching is potentially buggy as pointed out in the freebsd note. Buggy as in the controller may hang. Sorry, I don't have details on what conditions are necessary to cause a hang. Would Martin or Lennert run these test for a longer duration so we can get some data, maybe adding in Rx. It could be that removing the Tx interrupts and descriptor write-backs, prefetching may be ok. I don't know. Intel? Also, wouldn't it be great if someone wrote a document capturing all of the accumulated knowledge for future generations? -scott |
From: Richard S. <ri...@sa...> - 2004-12-05 18:38:36
|
Martin Josefsson wrote: > On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > > >>>Just tested the 82540EM at 32/33 and it's a big diffrence. >>> >>>60 350229 >>>64 247037 >>>68 219643 > > > [snip] > > >>With or without prefetching? My 82540 in 32/33 mode gets on baseline >>2.6.9: > > > With, will test without. I've always suspected that the 32bit bus on this > motherboard is a bit slow. > Hi Martin, The southbridge used on these boards has a known but poorly publicised bug that results in a maximum throughput of around 25MB/s. I can dig out links to a disussion of a couple of years ago, but certainly my experience with a SCSI RAID bore this out - 24MB/s with the adapter in the 32/33 bus and 100MB/s in the 64/66 bus. Regards, Richard |
From: Lennert B. <bu...@wa...> - 2004-12-05 18:14:32
|
On Sun, Dec 05, 2004 at 06:38:05PM +0100, Martin Josefsson wrote: > e1000: MMIO read took 481 clocks > e1000: MMIO read took 369 clocks > e1000: MMIO read took 481 clocks > e1000: MMIO read took 11 clocks > e1000: MMIO read took 477 clocks > e1000: MMIO read took 316 clocks Interesting. On a 1667MHz CPU, this is around ~0.28us per MMIO read in the worst case. On my hardware (dual Xeon 2.4GHz), the best case I've ever seen was ~0.83us. This alone can make a hell of a difference, esp. for 60B packets. --L |
From: Lennert B. <bu...@wa...> - 2004-12-05 17:58:20
|
On Sun, Dec 05, 2004 at 05:48:34PM +0100, Martin Josefsson wrote: > I tried using both ports on the 82546GB nic. > > delay nodelay > 1CPU 1.95 Mpps 1.76 Mpps > 2CPU 1.60 Mpps 1.44 Mpps I get: delay nodelay 1CPU 1837356 1837330 2CPU 2035060 1947424 So in your case using 2 CPUs degrades performance, in my case it increases it. And TDT delaying/coalescing only improves performance when using 2 CPUs, and even then only slightly (and only for <= 62B packets.) --L |
From: Martin J. <ga...@wl...> - 2004-12-05 17:54:14
|
On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > I've tested all packet sizes now, and delayed TDT updating once per jiffy > (instead of once per packet) indeed gives about 25kpps more on 60,61,62 > byte packets, and is hardly worth it for bigger packets. Maybe we can't see any real gains here now, I wonder if it has any effect if you have lots of nics on the same bus. I mean, in theory it saves a whole lot of traffic on the bus. /Martin |
From: Lennert B. <bu...@wa...> - 2004-12-05 17:51:42
|
On Sun, Dec 05, 2004 at 06:44:01PM +0100, Lennert Buytenhek wrote: > On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote: > > > The delayed TDT updating was a test and currently it delays the first tx'd > > packet after a timerrun 1ms. > > > > Would be interesting to see what other people get with this thing. > > Lennert? > > I took Scott's notxints patch, added the prefetch bits and moved the > TDT updating to e1000_clean_tx as you did. > > Slightly better than before, but not much: I've tested all packet sizes now, and delayed TDT updating once per jiffy (instead of once per packet) indeed gives about 25kpps more on 60,61,62 byte packets, and is hardly worth it for bigger packets. --L |
From: Lennert B. <bu...@wa...> - 2004-12-05 17:44:05
|
On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote: > The delayed TDT updating was a test and currently it delays the first tx'd > packet after a timerrun 1ms. > > Would be interesting to see what other people get with this thing. > Lennert? I took Scott's notxints patch, added the prefetch bits and moved the TDT updating to e1000_clean_tx as you did. Slightly better than before, but not much: 60 1070157 61 1066610 62 1062088 63 991447 64 991546 65 991537 66 991449 67 990857 68 989882 69 991347 Regular TDT updating: 60 1037469 61 1038425 62 1037393 63 993143 64 992156 65 993137 66 992203 67 992165 68 992185 69 988249 --L |
From: Martin J. <ga...@wl...> - 2004-12-05 17:38:11
|
On Sun, 5 Dec 2004, Martin Josefsson wrote: > > -#define E1000_READ_REG(a, reg) ( \ > > - readl((a)->hw_addr + \ > > - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg))) > > +#define E1000_READ_REG(a, reg) ({ \ > > + unsigned long s, e, d, v; \ > > +\ > > + (a)->mmio_reads++; \ > > + rdtsc(s, d); \ > > + v = readl((a)->hw_addr + \ > > + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \ > > + rdtsc(e, d); \ > > + e -= s; \ > > + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \ > > + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \ > > + dump_stack(); \ > > + v; \ > > +}) > > > > You might want to disable the stack dump of course. > > Will test this in a while. It gives pretty varied results. This is during a pktgen run. The machine is an Athlon MP 2000+ which operated at 1667 MHz e1000: MMIO read took 481 clocks e1000: MMIO read took 369 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 477 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 332 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 372 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 388 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 485 clocks e1000: MMIO read took 317 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 337 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 409 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 334 clocks e1000: MMIO read took 481 clocks e1000: MMIO read took 316 clocks e1000: MMIO read took 480 clocks e1000: MMIO read took 11 clocks e1000: MMIO read took 505 clocks e1000: MMIO read took 359 clocks e1000: MMIO read took 484 clocks e1000: MMIO read took 337 clocks e1000: MMIO read took 464 clocks e1000: MMIO read took 504 clocks /Martin |
From: Martin J. <ga...@wl...> - 2004-12-05 17:12:03
|
On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > > Just tested the 82540EM at 32/33 and it's a big diffrence. > > > > 60 350229 > > 64 247037 > > 68 219643 [snip] > With or without prefetching? My 82540 in 32/33 mode gets on baseline > 2.6.9: With, will test without. I've always suspected that the 32bit bus on this motherboard is a bit slow. > Your lspci output seems to suggest there is another PCI bridge in > between (00:10.0) Yes it sits between the 32bit and the 64bit bus. > Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the > 'Memory Controller Hub' and P64H2 the PCI-X bridge chip. I don't have PCI-X (unless 64/66 counts as PCI-x which I highly doubt) > > I have no idea how expensive an MMIO read is on this machine, do you have > > an relatively easy way to find out? > > A dirty way, yes ;-) Open up e1000_osdep.h and do: > > -#define E1000_READ_REG(a, reg) ( \ > - readl((a)->hw_addr + \ > - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg))) > +#define E1000_READ_REG(a, reg) ({ \ > + unsigned long s, e, d, v; \ > +\ > + (a)->mmio_reads++; \ > + rdtsc(s, d); \ > + v = readl((a)->hw_addr + \ > + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \ > + rdtsc(e, d); \ > + e -= s; \ > + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \ > + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \ > + dump_stack(); \ > + v; \ > +}) > > You might want to disable the stack dump of course. Will test this in a while. /Martin |
From: Martin J. <ga...@wl...> - 2004-12-05 17:01:48
|
On Sun, 5 Dec 2004, Martin Josefsson wrote: > I removed the delayed TDT updating and gave it a go again (this is scott + > prefetching): > > 60 1486193 > 64 1267639 > 68 1259682 Yet another mail, I hope you are using a NAPI-enabled MUA :) This time I tried vanilla + prefetch and it gave pretty nice performance as well: 60 1308047 64 1076044 68 1079377 72 1058993 76 1055708 80 1025659 84 1024692 88 1024236 92 1024510 96 1012853 100 1007925 104 976500 108 947061 112 919169 116 892804 120 868084 124 844609 128 822381 Large gap between 60 and 64byte, maybe the prefetching only prefetches 32bytes at a time? As a reference: here's a completely vanilla e1000 driver: 60 860931 64 772949 68 754738 72 754200 76 756093 80 756398 84 742111 88 738120 92 740426 96 739720 100 722322 104 729287 108 719312 112 723171 116 705551 120 704843 124 704622 128 665863 /Martin |
From: Lennert B. <bu...@wa...> - 2004-12-05 17:00:15
|
On Sun, Dec 05, 2004 at 04:30:47PM +0100, Martin Josefsson wrote: > > I verified that I get the same results on a small whimpy 82540EM > > that runs at 32/66 as well. Just about to see what I get at 32/33 > > with that card. > > Just tested the 82540EM at 32/33 and it's a big diffrence. > > 60 350229 > 64 247037 > 68 219643 > 72 218205 > 76 216786 > 80 215386 > 84 214003 > 88 212638 > 92 211291 > 96 210004 > 100 208647 > 104 182461 > 108 181468 > 112 180453 > 116 179482 > 120 185472 > 124 188336 > 128 153743 With or without prefetching? My 82540 in 32/33 mode gets on baseline 2.6.9: 60 431967 61 431311 62 431927 63 427827 64 427482 And with Scott's notxints patch: 60 514496 61 514493 62 514754 63 504629 64 504123 > Sorry, forgot to answer your other questions, I'm a bit excited at the > moment :) Makes sense :) > The 64/66 bus on this motherboard is directly connected to the > northbridge. Your lspci output seems to suggest there is another PCI bridge in between (00:10.0) Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the 'Memory Controller Hub' and P64H2 the PCI-X bridge chip. > I have no idea how expensive an MMIO read is on this machine, do you have > an relatively easy way to find out? A dirty way, yes ;-) Open up e1000_osdep.h and do: -#define E1000_READ_REG(a, reg) ( \ - readl((a)->hw_addr + \ - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg))) +#define E1000_READ_REG(a, reg) ({ \ + unsigned long s, e, d, v; \ +\ + (a)->mmio_reads++; \ + rdtsc(s, d); \ + v = readl((a)->hw_addr + \ + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \ + rdtsc(e, d); \ + e -= s; \ + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \ + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \ + dump_stack(); \ + v; \ +}) You might want to disable the stack dump of course. --L |
From: Martin J. <ga...@wl...> - 2004-12-05 16:48:42
|
On Sun, 5 Dec 2004, Martin Josefsson wrote: > The delayed TDT updating was a test and currently it delays the first tx'd > packet after a timerrun 1ms. I removed the delayed TDT updating and gave it a go again (this is scott + prefetching): 60 1486193 64 1267639 68 1259682 72 1243997 76 1243989 80 1153608 84 1123813 88 1115047 92 1076636 96 1040792 100 1007252 104 975806 108 946263 112 918456 116 892227 120 867477 124 844052 128 821858 It gives a little diffrent results, 60byte is ok but then it falls a lot down to 64byte and the curve seems a bit flatter. This should be the same driver that Lennert got 1.03Mpps with. I get 1.03Mpps without prefetching. I tried using both ports on the 82546GB nic. delay nodelay 1CPU 1.95 Mpps 1.76 Mpps 2CPU 1.60 Mpps 1.44 Mpps All tests performed on an SMP kernel, the above mention of 1CPU vs 2CPU just means how the two nics were bound to the cpus. And there's no tx-interrupts at all due to scotts patch. /Martin |
From: Martin J. <ga...@wl...> - 2004-12-05 15:42:42
|
On Sun, 5 Dec 2004, Martin Josefsson wrote: [snip] > BUT if I use the above + prefetching I get this: > > 60 1483890 [snip] > This is on one port of a 82546GB > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and > the nic is located in a 64/66 slot. > > I won't post any patch until I've tested some more and cleaned up a few > things. > > BTW, I also get some transmit timouts with Scotts patch sometimes, not > often but it does happen. Here's the patch, not much more tested (it still gives some transmit timeouts since it's scotts patch + prefetching and delayed TDT updating). And it's not cleaned up, but hey, that's development :) The delayed TDT updating was a test and currently it delays the first tx'd packet after a timerrun 1ms. Would be interesting to see what other people get with this thing. Lennert? diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h --- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h 2004-12-04 18:16:53.000000000 +0100 +++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h 2004-12-05 15:12:25.000000000 +0100 @@ -101,7 +101,7 @@ struct e1000_adapter; #define E1000_MAX_INTR 10 /* TX/RX descriptor defines */ -#define E1000_DEFAULT_TXD 256 +#define E1000_DEFAULT_TXD 4096 #define E1000_MAX_TXD 256 #define E1000_MIN_TXD 80 #define E1000_MAX_82544_TXD 4096 @@ -187,6 +187,7 @@ struct e1000_desc_ring { /* board specific private data structure */ struct e1000_adapter { + struct timer_list tx_cleanup_timer; struct timer_list tx_fifo_stall_timer; struct timer_list watchdog_timer; struct timer_list phy_info_timer; @@ -222,6 +223,7 @@ struct e1000_adapter { uint32_t tx_fifo_size; atomic_t tx_fifo_stall; boolean_t pcix_82544; + boolean_t tx_cleanup_scheduled; /* RX */ struct e1000_desc_ring rx_ring; diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h --- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h 2004-12-04 18:16:53.000000000 +0100 +++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h 2004-12-05 15:37:50.000000000 +0100 @@ -417,14 +417,12 @@ int32_t e1000_set_d3_lplu_state(struct e /* This defines the bits that are set in the Interrupt Mask * Set/Read Register. Each bit is documented below: * o RXT0 = Receiver Timer Interrupt (ring 0) - * o TXDW = Transmit Descriptor Written Back * o RXDMT0 = Receive Descriptor Minimum Threshold hit (ring 0) * o RXSEQ = Receive Sequence Error * o LSC = Link Status Change */ #define IMS_ENABLE_MASK ( \ E1000_IMS_RXT0 | \ - E1000_IMS_TXDW | \ E1000_IMS_RXDMT0 | \ E1000_IMS_RXSEQ | \ E1000_IMS_LSC) diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c --- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c 2004-12-05 14:59:19.000000000 +0100 +++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c 2004-12-05 15:40:11.000000000 +0100 @@ -131,7 +131,7 @@ static int e1000_set_mac(struct net_devi static void e1000_irq_disable(struct e1000_adapter *adapter); static void e1000_irq_enable(struct e1000_adapter *adapter); static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs); -static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter); +static void e1000_clean_tx(unsigned long data); #ifdef CONFIG_E1000_NAPI static int e1000_clean(struct net_device *netdev, int *budget); static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter, @@ -286,6 +286,7 @@ e1000_down(struct e1000_adapter *adapter e1000_irq_disable(adapter); free_irq(adapter->pdev->irq, netdev); + del_timer_sync(&adapter->tx_cleanup_timer); del_timer_sync(&adapter->tx_fifo_stall_timer); del_timer_sync(&adapter->watchdog_timer); del_timer_sync(&adapter->phy_info_timer); @@ -522,6 +523,10 @@ e1000_probe(struct pci_dev *pdev, e1000_get_bus_info(&adapter->hw); + init_timer(&adapter->tx_cleanup_timer); + adapter->tx_cleanup_timer.function = &e1000_clean_tx; + adapter->tx_cleanup_timer.data = (unsigned long) adapter; + init_timer(&adapter->tx_fifo_stall_timer); adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall; adapter->tx_fifo_stall_timer.data = (unsigned long) adapter; @@ -882,19 +887,16 @@ e1000_configure_tx(struct e1000_adapter e1000_config_collision_dist(&adapter->hw); /* Setup Transmit Descriptor Settings for eop descriptor */ - adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP | + adapter->txd_cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_IFCS; - if(adapter->hw.mac_type < e1000_82543) - adapter->txd_cmd |= E1000_TXD_CMD_RPS; - else - adapter->txd_cmd |= E1000_TXD_CMD_RS; - /* Cache if we're 82544 running in PCI-X because we'll * need this to apply a workaround later in the send path. */ if(adapter->hw.mac_type == e1000_82544 && adapter->hw.bus_type == e1000_bus_type_pcix) adapter->pcix_82544 = 1; + + E1000_WRITE_REG(&adapter->hw, TXDMAC, 0); } /** @@ -1707,7 +1709,7 @@ e1000_tx_queue(struct e1000_adapter *ada wmb(); tx_ring->next_to_use = i; - E1000_WRITE_REG(&adapter->hw, TDT, i); + /* E1000_WRITE_REG(&adapter->hw, TDT, i); */ } /** @@ -1809,6 +1811,11 @@ e1000_xmit_frame(struct sk_buff *skb, st return NETDEV_TX_LOCKED; } + if(!adapter->tx_cleanup_scheduled) { + adapter->tx_cleanup_scheduled = TRUE; + mod_timer(&adapter->tx_cleanup_timer, jiffies + 1); + } + /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) { @@ -1845,6 +1852,7 @@ e1000_xmit_frame(struct sk_buff *skb, st netdev->trans_start = jiffies; spin_unlock_irqrestore(&adapter->tx_lock, flags); + return NETDEV_TX_OK; } @@ -2140,8 +2148,7 @@ e1000_intr(int irq, void *data, struct p } #else for(i = 0; i < E1000_MAX_INTR; i++) - if(unlikely(!e1000_clean_rx_irq(adapter) & - !e1000_clean_tx_irq(adapter))) + if(unlikely(!e1000_clean_rx_irq(adapter))) break; #endif @@ -2159,18 +2166,15 @@ e1000_clean(struct net_device *netdev, i { struct e1000_adapter *adapter = netdev->priv; int work_to_do = min(*budget, netdev->quota); - int tx_cleaned; int work_done = 0; - tx_cleaned = e1000_clean_tx_irq(adapter); e1000_clean_rx_irq(adapter, &work_done, work_to_do); *budget -= work_done; netdev->quota -= work_done; - /* if no Rx and Tx cleanup work was done, exit the polling mode */ - if(!tx_cleaned || (work_done < work_to_do) || - !netif_running(netdev)) { + /* if no Rx cleanup work was done, exit the polling mode */ + if((work_done < work_to_do) || !netif_running(netdev)) { netif_rx_complete(netdev); e1000_irq_enable(adapter); return 0; @@ -2181,66 +2185,76 @@ e1000_clean(struct net_device *netdev, i #endif /** - * e1000_clean_tx_irq - Reclaim resources after transmit completes - * @adapter: board private structure + * e1000_clean_tx - Reclaim resources after transmit completes + * @data: timer callback data (board private structure) **/ -static boolean_t -e1000_clean_tx_irq(struct e1000_adapter *adapter) +static void +e1000_clean_tx(unsigned long data) { + struct e1000_adapter *adapter = (struct e1000_adapter *)data; struct e1000_desc_ring *tx_ring = &adapter->tx_ring; struct net_device *netdev = adapter->netdev; struct pci_dev *pdev = adapter->pdev; - struct e1000_tx_desc *tx_desc, *eop_desc; struct e1000_buffer *buffer_info; - unsigned int i, eop; - boolean_t cleaned = FALSE; + unsigned int i, next; + int size = 0, count = 0; + uint32_t tx_head; - i = tx_ring->next_to_clean; - eop = tx_ring->buffer_info[i].next_to_watch; - eop_desc = E1000_TX_DESC(*tx_ring, eop); + spin_lock(&adapter->tx_lock); - while(eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) { - for(cleaned = FALSE; !cleaned; ) { - tx_desc = E1000_TX_DESC(*tx_ring, i); - buffer_info = &tx_ring->buffer_info[i]; + E1000_WRITE_REG(&adapter->hw, TDT, tx_ring->next_to_use); - if(likely(buffer_info->dma)) { - pci_unmap_page(pdev, - buffer_info->dma, - buffer_info->length, - PCI_DMA_TODEVICE); - buffer_info->dma = 0; - } + tx_head = E1000_READ_REG(&adapter->hw, TDH); - if(buffer_info->skb) { - dev_kfree_skb_any(buffer_info->skb); - buffer_info->skb = NULL; - } + i = next = tx_ring->next_to_clean; - tx_desc->buffer_addr = 0; - tx_desc->lower.data = 0; - tx_desc->upper.data = 0; + while(i != tx_head) { + size++; + if(i == tx_ring->buffer_info[next].next_to_watch) { + count += size; + size = 0; + if(unlikely(++i == tx_ring->count)) + i = 0; + next = i; + } else { + if(unlikely(++i == tx_ring->count)) + i = 0; + } + } - cleaned = (i == eop); - if(unlikely(++i == tx_ring->count)) i = 0; + i = tx_ring->next_to_clean; + while(count--) { + buffer_info = &tx_ring->buffer_info[i]; + + if(likely(buffer_info->dma)) { + pci_unmap_page(pdev, + buffer_info->dma, + buffer_info->length, + PCI_DMA_TODEVICE); + buffer_info->dma = 0; } - - eop = tx_ring->buffer_info[i].next_to_watch; - eop_desc = E1000_TX_DESC(*tx_ring, eop); + + if(buffer_info->skb) { + dev_kfree_skb_any(buffer_info->skb); + buffer_info->skb = NULL; + } + + if(unlikely(++i == tx_ring->count)) + i = 0; } tx_ring->next_to_clean = i; - spin_lock(&adapter->tx_lock); + if(E1000_DESC_UNUSED(tx_ring) != tx_ring->count) + mod_timer(&adapter->tx_cleanup_timer, jiffies + 1); + else + adapter->tx_cleanup_scheduled = FALSE; - if(unlikely(cleaned && netif_queue_stopped(netdev) && - netif_carrier_ok(netdev))) + if(unlikely(netif_queue_stopped(netdev) && netif_carrier_ok(netdev))) netif_wake_queue(netdev); spin_unlock(&adapter->tx_lock); - - return cleaned; } /** /Martin |
From: Martin J. <ga...@wl...> - 2004-12-05 15:30:52
|
On Sun, 5 Dec 2004, Martin Josefsson wrote: > > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and > > > the nic is located in a 64/66 slot. > > > > Hmmm. Funny you get this number even on 64/66. How many PCI bridges > > between the CPUs and the NIC? Any idea how many cycles an MMIO read on > > your hardware is? > > I verified that I get the same results on a small whimpy 82540EM that runs > at 32/66 as well. Just about to see what I get at 32/33 with that card. Just tested the 82540EM at 32/33 and it's a big diffrence. 60 350229 64 247037 68 219643 72 218205 76 216786 80 215386 84 214003 88 212638 92 211291 96 210004 100 208647 104 182461 108 181468 112 180453 116 179482 120 185472 124 188336 128 153743 Sorry, forgot to answer your other questions, I'm a bit excited at the moment :) The 64/66 bus on this motherboard is directly connected to the northbridge. Here's the lspci output with the 82546GB nic attached to the 64/66 bus and 82540EM nic connected to the 32/33 bus that hangs off the southbridge: 00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11) 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04) 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03) 00:08.0 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03) 00:08.1 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03) 00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05) 01:05.0 VGA compatible controller: Silicon Integrated Systems [SiS] 86C326 5598/6326 (rev 0b) 02:05.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0c) 02:06.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02) 02:08.0 Ethernet controller: Intel Corp. 82540EM Gigabit Ethernet Controller (rev 02) And lspci -t -[00]-+-00.0 +-01.0-[01]----05.0 +-07.0 +-07.1 +-07.3 +-08.0 +-08.1 \-10.0-[02]--+-05.0 +-06.0 \-08.0 I have no idea how expensive an MMIO read is on this machine, do you have an relatively easy way to find out? /Martin |
From: Martin J. <ga...@wl...> - 2004-12-05 15:19:37
|
On Sun, 5 Dec 2004, Lennert Buytenhek wrote: > > 60 1483890 > > [snip] > > > > Which is pretty nice :) > > Not just that, it's also wire speed GigE. Damn. Now we all have to go > and upgrade to 10GbE cards, and I don't think my girlfriend would give me > one of those for christmas. Yes it is, and it's lovely to see. You have to nerdify her so she sees the need for geeky hardware enough to give you what you need :) > > This is on one port of a 82546GB > > > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and > > the nic is located in a 64/66 slot. > > Hmmm. Funny you get this number even on 64/66. How many PCI bridges > between the CPUs and the NIC? Any idea how many cycles an MMIO read on > your hardware is? I verified that I get the same results on a small whimpy 82540EM that runs at 32/66 as well. Just about to see what I get at 32/33 with that card. /Martin |