Most Efficient Large Language Models for AI PC#

This page is regularly updated to help you identify the best-performing LLMs on the Intel® Core™ Ultra processor family and AI PCs. The current data is as of OpenVINO 2024.4, 20 Nov. 2024.

The tables below list the key performance indicators for inference on built-in GPUs.

Topology

Precision

Input Size

max rss memory

1st latency (ms)

2nd latency (ms)

2nd tok/sec

opt-125m-gptq

INT4-MIXED

32

833.1

15.6

3.9

256.4

opt-125m-gptq

INT4-MIXED

1024

955.9

553.8

4.8

208.3

bloomz-560m

INT4-MIXED

32

1457.5

48.5

11.1

90.1

qwen2-0.5b

INT4-MIXED

32

1167.8

95.7

11.5

87.0

qwen2-0.5b

INT4-MIXED

1024

1266

2330.3

12.7

78.7

qwen2-0.5b

INT8-CW

32

1496.3

90.5

12.8

78.1

bloomz-560m

INT8-CW

32

1724.2

84

13.9

71.9

qwen2-0.5b

INT8-CW

1024

1593

2370.7

14

71.4

bloomz-560m

INT4-MIXED

1024

1691

2005.3

15.2

65.8

qwen2-0.5b

FP16

32

2989.8

94.6

15.9

62.9

bloomz-560m

INT8-CW

1024

1941

2343.4

16.1

62.1

qwen2-0.5b

FP16

1024

3088.1

2376.8

17.4

57.5

bloomz-560m

FP16

32

3857

86.7

17.5

57.1

bloomz-560m

FP16

1024

4085.6

2373.4

19.8

50.5

tiny-llama-1.1b-chat

INT4-MIXED

32

1738.9

237.4

20

50.0

tiny-llama-1.1b-chat

INT8-CW

32

2471.2

224.6

22.6

44.2

tiny-llama-1.1b-chat

INT4-MIXED

1024

1929.3

5993

22.7

44.1

tiny-llama-1.1b-chat

INT8-CW

1024

2661.8

6238.8

25.2

39.7

qwen2-1.5b

INT4-MIXED

32

2429

312.8

28.4

35.2

tiny-llama-1.1b-chat

FP16

32

4834.9

231.7

28.9

34.6

tiny-llama-1.1b-chat

FP16

1024

5023.2

6191.5

31.7

31.5

qwen2-1.5b

INT4-MIXED

1024

2600.3

7597.3

31.8

31.4

stablelm-3b-4e1t

INT4-MIXED

32

3982.1

348.4

32.1

31.2

qwen2-1.5b

INT8-CW

32

3619

301

32.7

30.6

qwen2-1.5b

INT8-CW

1024

3790.3

7990.5

34.6

28.9

stablelm-3b-4e1t

INT4-MIXED

1023

4455.4

11963.2

39.2

25.5

minicpm-1b-sft

INT4-MIXED

31

5815.4

214.3

40.1

24.9

qwen2-1.5b

FP16

32

7582.3

304.4

42.2

23.7

minicpm-1b-sft

INT8-CW

31

6609.6

210.6

43.3

23.1

qwen2-1.5b

FP16

1024

7753.4

7915.3

44.2

22.6

gemma-2b-it

INT4-MIXED

32

3728.2

523

46.2

21.6

stable-zephyr-3b-dpo

INT4-MIXED

32

3689.3

656.5

47.4

21.1

gemma-2b-it

INT4-MIXED

1024

4207.3

11867.9

47.5

21.1

minicpm-1b-sft

FP16

31

8999.8

222.2

49.1

20.4

red-pajama-incite-chat-3b-v1

INT4-MIXED

32

3448.1

1028.9

49.6

20.2

dolly-v2-3b

INT4-MIXED

32

3448.4

714.8

49.9

20.0

gemma-2b-it

INT8-CW

32

5423.2

488.8

51

19.6

gemma-2b-it

INT8-CW

1024

5902.7

12434.4

52.3

19.1

stable-zephyr-3b-dpo

INT8-CW

32

5630.3

694.5

54.4

18.4

phi-2

INT4-MIXED

32

3732.9

723.2

54.5

18.3

phi-2

INT8-CW

32

5600.4

747

55.7

18.0

dolly-v2-3b

INT8-CW

32

5589.7

1009.8

55.9

17.9

red-pajama-incite-chat-3b-v1

INT8-CW

32

5590.1

698.9

55.9

17.9

stablelm-3b-4e1t

INT8-CW

32

5630.1

660.7

56.1

17.8

dolly-v2-3b

INT4-MIXED

1024

3984.5

15502.8

56.5

17.7

red-pajama-incite-chat-3b-v1

INT4-MIXED

1023

3915.6

15363.9

56.6

17.7

llama-2-7b-gptq

INT4-MIXED

32

8618.5

782.9

56.9

17.6

phi-2

INT4-MIXED

1024

4251.3

15317

61

16.4

phi-2

INT8-CW

1024

6119.4

15886.6

62

16.1

red-pajama-incite-chat-3b-v1

INT8-CW

1023

6056.9

15984.9

62.2

16.1

dolly-v2-3b

INT8-CW

1024

6124.9

16099.7

62.5

16.0

stablelm-3b-4e1t

INT8-CW

1023

6097.1

16206.9

62.5

16.0

gemma-2b-it

FP16

32

12208.2

501.4

65.5

15.3

llama-3-8b

INT4-MIXED

33

8741.2

869

65.7

15.2

llama-2-7b-gptq

INT4-MIXED

1024

9468.1

26350.7

66.1

15.1

qwen-7b-chat-gptq

INT4-MIXED

32

8561

773.7

67

14.9

gemma-2b-it

FP16

1024

12687.8

12168.7

67.1

14.9

mistral-7b-v0.1

INT4-MIXED

32

8588.7

1020.6

67.4

14.8

llama-2-7b-chat-hf

INT4-MIXED

32

8626.8

1100

69.4

14.4

phi-2

FP16

32

11385.9

693.8

70.2

14.2

dolly-v2-3b

FP16

32

11359

688.5

70.5

14.2

stable-zephyr-3b-dpo

FP16

32

11432.9

648.5

70.6

14.2

red-pajama-incite-chat-3b-v1

FP16

32

11364

692.4

70.7

14.1

stablelm-3b-4e1t

FP16

32

11432.6

649

71.1

14.1

llama-3-8b

INT4-MIXED

1025

9254.8

29700.3

71.9

13.9

mistral-7b-v0.1

INT4-MIXED

1024

9121.9

29492.9

73.3

13.6

phi-3-mini-4k-instruct

INT8-CW

32

7646.1

952.6

75.7

13.2

qwen-7b-chat-gptq

INT4-MIXED

1024

10458.7

29022.2

75.9

13.2

zephyr-7b-beta

INT4-MIXED

32

9217.5

1196.6

76.2

13.1

phi-2

FP16

1024

11902.2

15868

77

13.0

dolly-v2-3b

FP16

1024

11892.5

15987.1

77.1

13.0

baichuan2-7b-chat

INT4-MIXED

32

9440.3

1118.1

77.3

12.9

red-pajama-incite-chat-3b-v1

FP16

1023

11829.1

16008.7

77.3

12.9

stablelm-3b-4e1t

FP16

1023

11897.5

16030

77.7

12.9

phi-3-mini-4k-instruct

INT4-MIXED

32

4961.9

968.8

78.2

12.8

llama-2-7b-chat-hf

INT4-MIXED

1024

9478.1

28958.6

78.6

12.7

zephyr-7b-beta

INT4-MIXED

1024

9764.2

30982

82.3

12.2

phi-3-mini-4k-instruct

INT8-CW

1024

8255.7

23200.5

83.1

12.0

phi-3-mini-4k-instruct

INT4-MIXED

1024

5570.2

22277.1

85.7

11.7

baichuan2-7b-chat

INT4-MIXED

1024

10305.2

29010

86.4

11.6

phi-3-mini-4k-instruct

FP16

32

15292.6

934.7

96.4

10.4

qwen-7b-chat

INT4-MIXED

32

10964.7

1413

97.8

10.2

Topology

Precision

Input Size

max rss memory

1st latency (ms)

2nd latency (ms)

2nd tok/sec

opt-125m-gptq

INT4-MIXED

32

1150.2

35.1

8.2

122.0

opt-125m-gptq

INT4-MIXED

1024

1228

67

8.2

122.0

qwen2-0.5b

INT4-MIXED

1024

1596.2

83.6

14.4

69.4

qwen2-0.5b

INT4-MIXED

32

1675.6

63.6

14.9

67.1

qwen2-0.5b

INT8-CW

32

1857.5

56.9

15

66.7

qwen2-0.5b

INT8-CW

1024

1663.5

87

15

66.7

bloomz-560m

INT8-CW

32

1761.1

62.4

15.1

66.2

tiny-llama-1.1b-chat

INT4-MIXED

1024

1687.9

158.7

15.3

65.4

bloomz-560m

INT4-MIXED

32

1894.2

40.1

15.4

64.9

tiny-llama-1.1b-chat

INT4-MIXED

32

1833

74.5

15.7

63.7

bloomz-560m

INT8-CW

1024

1689.2

146.2

15.8

63.3

bloomz-560m

INT4-MIXED

1024

1791

150.1

16.4

61.0

tiny-llama-1.1b-chat

INT8-CW

32

2132.3

35.6

18.1

55.2

bloomz-560m

FP16

32

2395

36

18.4

54.3

tiny-llama-1.1b-chat

INT8-CW

1024

1986.4

149.3

19.2

52.1

bloomz-560m

FP16

1024

2344.4

157.4

19.3

51.8

qwen2-1.5b

INT4-MIXED

1024

2175.1

184.9

20.4

49.0

qwen2-1.5b

INT4-MIXED

32

2066.2

94.9

20.6

48.5

red-pajama-incite-chat-3b-v1

INT4-MIXED

32

2599.8

118.1

25

40.0

qwen2-1.5b

INT8-CW

32

2377.4

83.3

25.1

39.8

qwen2-1.5b

INT8-CW

1024

2483.3

189.6

25.3

39.5

gemma-2b-it

INT4-MIXED

32

2594.3

181.4

26.1

38.3

phi-2

INT4-MIXED

32

2912.4

77.7

26.8

37.3

gemma-2b-it

INT4-MIXED

1024

2594.4

248.2

26.9

37.2

dolly-v2-3b

INT4-MIXED

32

2610.3

141.3

27

37.0

stable-zephyr-3b-dpo

INT4-MIXED

32

2956.2

149.2

27.4

36.5

minicpm-1b-sft

INT4-MIXED

31

2625.8

159.2

28.1

35.6

red-pajama-incite-chat-3b-v1

INT4-MIXED

1023

3069.7

413.5

28.2

35.5

minicpm-1b-sft

INT8-CW

31

2868.2

74.1

28.9

34.6

dolly-v2-3b

INT4-MIXED

1024

3081.5

386

29.4

34.0

phi-2

INT4-MIXED

1024

3136.2

340

29.6

33.8

stablelm-3b-4e1t

INT4-MIXED

32

3035.9

150.5

30.6

32.7

phi-3-mini-4k-instruct

INT4-MIXED

32

3373.2

57.9

32.6

30.7

stablelm-3b-4e1t

INT4-MIXED

1023

3296.5

456.2

34.4

29.1

phi-3-mini-4k-instruct

INT4-MIXED

1024

3707.1

432

36.1

27.7

gemma-2b-it

INT8-CW

32

3370.5

203.8

36.6

27.3

minicpm-1b-sft

FP16

31

3679.6

80.6

36.9

27.1

gemma-2b-it

INT8-CW

1024

3503.2

258.5

37.9

26.4

dolly-v2-3b

INT8-CW

32

3893.3

142.9

39.4

25.4

red-pajama-incite-chat-3b-v1

INT8-CW

32

3760.7

117.2

39.4

25.4

phi-2

INT8-CW

32

3765.6

121

39.7

25.2

stablelm-3b-4e1t

INT8-CW

32

3641.2

123

39.9

25.1

stable-zephyr-3b-dpo

INT8-CW

32

3743.3

120.1

39.9

25.1

red-pajama-incite-chat-3b-v1

INT8-CW

1023

4083.1

422.9

41.9

23.9

dolly-v2-3b

INT8-CW

1024

4211.5

384.1

42.2

23.7

phi-2

INT8-CW

1024

4096.8

367.2

42.5

23.5

stablelm-3b-4e1t

INT8-CW

1023

4086.6

459.9

43.5

23.0

llama-2-7b-gptq

INT4-MIXED

32

4754.8

75.1

46.2

21.6

codegen25-7b

INT4-MIXED

32

4738.5

74.9

46.9

21.3

gpt-j-6b

INT4-MIXED

32

4506.5

221.4

47.3

21.1

decilm-7b-instruct

INT4-MIXED

36

4794.9

199.3

48.5

20.6

qwen-7b-chat-gptq

INT4-MIXED

32

5615.8

100.5

49.8

20.1

falcon-7b-instruct

INT4-MIXED

32

4738

79.9

50.7

19.7

phi-3-mini-4k-instruct

INT8-CW

32

4589.9

83

50.8

19.7

llama-2-7b-gptq

INT4-MIXED

1024

5246

640

52.1

19.2

llama-3-8b

INT4-MIXED

33

5475.8

114.7

52.2

19.2

codegen25-7b

INT4-MIXED

1024

5241.9

643.7

52.5

19.0

mistral-7b-v0.1

INT4-MIXED

32

5015.3

94.6

52.6

19.0

qwen2-7b

INT4-MIXED

32

5330.7

86.3

52.7

19.0

gpt-j-6b

INT4-MIXED

1024

4926.5

867.2

53.2

18.8

llama-2-7b-chat-hf

INT4-MIXED

32

5100.7

78.7

54.2

18.5

llama-3-8b

INT4-MIXED

33

5527.1

114.9

54.3

18.4

phi-3-mini-4k-instruct

INT8-CW

1024

4959.2

450.6

54.6

18.3

falcon-7b-instruct

INT4-MIXED

1024

4863.4

660.5

54.9

18.2

qwen2-7b

INT4-MIXED

1024

5375.4

659.8

55.4

18.1

mistral-7b-v0.1

INT4-MIXED

1024

5286.8

662.8

55.6

18.0

llama-3-8b

INT4-MIXED

1025

5601

992.5

56.1

17.8

llama-3-8b

INT4-MIXED

1025

5646.8

1047.1

56.7

17.6

baichuan2-7b-chat

INT4-MIXED

32

5913.7

86.5

57.2

17.5

zephyr-7b-beta

INT4-MIXED

32

5339.7

88.5

58.2

17.2

qwen-7b-chat-gptq

INT4-MIXED

1024

6315.8

664.2

60.1

16.6

glm-4-9b-chat

INT4-MIXED

32

6349.7

86.5

60.5

16.5

llama-2-7b-chat-hf

INT4-MIXED

1024

5592.7

856.8

60.9

16.4

zephyr-7b-beta

INT4-MIXED

1024

5459.1

898.6

61.6

16.2

baichuan2-7b-chat

INT4-MIXED

1024

6410.3

942.2

63.5

15.7

gemma-7b-it

INT4-MIXED

32

5816.3

104.5

63.5

15.7

glm-4-9b-chat

INT4-MIXED

1024

6368.8

1128.2

63.8

15.7

llama-3.1-8b

INT4-MIXED

32

6315.3

97.4

65

15.4

llama-3.1-8b

INT4-MIXED

1024

6421.8

902.9

68.2

14.7

gemma-7b-it

INT4-MIXED

1024

6233.2

1052.7

68.7

14.6

qwen-7b-chat

INT4-MIXED

32

7320.5

132.3

68.8

14.5

red-pajama-incite-chat-3b-v1

FP16

32

6318.9

79.2

70.7

14.1

phi-2

FP16

32

6330.2

83.2

70.8

14.1

dolly-v2-3b

FP16

32

6327.2

92.7

71.9

13.9

stable-zephyr-3b-dpo

FP16

32

6356.4

79.8

72.2

13.9

stablelm-3b-4e1t

FP16

32

6261.9

74.6

72.6

13.8

phi-2

FP16

1024

6654.4

379.3

73.9

13.5

red-pajama-incite-chat-3b-v1

FP16

1023

6640.3

442.6

74.4

13.4

dolly-v2-3b

FP16

1024

6653.9

441.9

74.9

13.4

qwen-7b-chat

INT4-MIXED

1024

7814.1

909.4

75.5

13.2

stablelm-3b-4e1t

FP16

1023

6575.3

449.5

75.8

13.2

falcon-7b-instruct

INT8-CW

32

7487.6

109.4

84.3

11.9

gpt-j-6b

INT8-CW

32

6918.7

185.3

85.3

11.7

llama-2-7b-chat-hf

INT8-CW

32

7494.7

110.6

87.9

11.4

qwen2-7b

INT8-CW

32

8177.7

117.8

88.2

11.3

falcon-7b-instruct

INT8-CW

1024

7621.2

675.4

88.3

11.3

codegen25-7b

INT8-CW

32

7582.1

114.6

89

11.2

qwen2-7b

INT8-CW

1024

8226.2

842

90.4

11.1

gpt-j-6b

INT8-CW

1024

7353.1

1093.9

90.8

11.0

phi-3-medium-4k-instruct

INT4-MIXED

38

8184.1

270.2

90.8

11.0

qwen-7b-chat

INT8-CW

32

9223.8

138.4

91.3

11.0

baichuan2-7b-chat

INT8-CW

32

8188.4

122.9

91.8

10.9

phi-3-mini-4k-instruct

FP16

32

8311.5

98.2

92

10.9

llama-2-7b-chat-hf

INT8-CW

1024

7984.3

874.9

92.8

10.8

mistral-7b-v0.1

INT8-CW

32

7908.6

116.3

93.1

10.7

baichuan2-13b-chat

INT4-MIXED

32

10016.5

165.7

93.2

10.7

zephyr-7b-beta

INT8-CW

32

7812.6

117

93.4

10.7

codegen25-7b

INT8-CW

1024

8074.3

870.2

94

10.6

decilm-7b-instruct

INT8-CW

36

7885.2

181.4

94.9

10.5

mistral-7b-v0.1

INT8-CW

1024

8023.7

906.4

95.7

10.4

zephyr-7b-beta

INT8-CW

1024

7930.8

915.2

96.3

10.4

phi-3-medium-4k-instruct

INT4-MIXED

1061

8384.5

2225.7

96.7

10.3

baichuan2-7b-chat

INT8-CW

1024

8678.3

956.7

96.8

10.3

llama-3.1-8b

INT8-CW

32

8615.4

121.6

97.7

10.2

llama-3-8b

INT8-CW

33

8615.1

131.3

97.7

10.2

phi-3-mini-4k-instruct

FP16

1024

8695.2

509

99.9

10.0

Topology

Precision

Input Size

max rss memory

1st latency (ms)

2nd latency (ms)

2nd tok/sec

opt-125m-gptq

INT4-MIXED

32

1116

25.8

8.1

123.5

opt-125m-gptq

INT4-MIXED

1024

1187.1

75.2

8.2

122.0

qwen2-0.5b

INT4-MIXED

32

1587.4

45.1

15.4

64.9

qwen2-0.5b

INT4-MIXED

1024

1587.8

228.2

15.6

64.1

tiny-llama-1.1b-chat

INT4-MIXED

32

1704.2

42.4

17.6

56.8

tiny-llama-1.1b-chat

INT4-MIXED

1024

1616.3

489.2

18.9

52.9

qwen2-0.5b

INT8-CW

32

1477.3

51.5

20.2

49.5

qwen2-0.5b

INT8-CW

1024

1592

263.7

20.6

48.5

tiny-llama-1.1b-chat

INT8-CW

32

1855.6

60.2

20.7

48.3

tiny-llama-1.1b-chat

INT8-CW

1024

1992.6

618.2

21.7

46.1

qwen2-1.5b

INT4-MIXED

32

2024.2

59.6

23.1

43.3

bloomz-560m

FP16

1024

2773.1

647.8

23.8

42.0

qwen2-1.5b

INT4-MIXED

1024

2177.7

577.4

23.8

42.0

bloomz-560m

FP16

32

2582.7

44.2

25.1

39.8

dolly-v2-3b

INT4-MIXED

32

2507.9

79.8

29.4

34.0

phi-2

INT4-MIXED

32

2568.9

74.6

29.7

33.7

qwen2-1.5b

INT8-CW

32

2577.3

81.6

30.5

32.8

red-pajama-incite-chat-3b-v1

INT4-MIXED

32

2489.4

69.9

30.5

32.8

minicpm-1b-sft

INT4-MIXED

31

2442.1

84.7

31

32.3

qwen2-1.5b

INT8-CW

1024

2739.8

773.3

31.2

32.1

gemma-2b-it

INT4-MIXED

32

2998.2

103.5

31.4

31.8

dolly-v2-3b

INT4-MIXED

1024

2508.1

1396.6

32

31.3

gemma-2b-it

INT4-MIXED

1024

3171.5

822.3

32.2

31.1

phi-2

INT4-MIXED

1024

2940.5

1395.3

32.2

31.1

red-pajama-incite-chat-3b-v1

INT4-MIXED

1023

2489.6

1435.5

33.1

30.2

minicpm-1b-sft

INT8-CW

31

2818.6

86.9

33.4

29.9

stable-zephyr-3b-dpo

INT4-MIXED

32

2638.2

87.4

33.8

29.6

stablelm-3b-4e1t

INT4-MIXED

32

2750.5

89.4

35.6

28.1

stablelm-3b-4e1t

INT4-MIXED

1023

3115.5

1473.1

38.1

26.2

phi-3-mini-4k-instruct

INT4-MIXED

32

3039.1

109.2

40.4

24.8

phi-2

INT8-CW

32

3599.7

107.5

42.1

23.8

gemma-2b-it

INT8-CW

32

3845.4

111.3

42.2

23.7

dolly-v2-3b

INT8-CW

32

3596.4

110.1

42.5

23.5

gemma-2b-it

INT8-CW

1024

3844.6

1183

43

23.3

red-pajama-incite-chat-3b-v1

INT8-CW

32

3590

111

43.3

23.1

phi-3-mini-4k-instruct

INT4-MIXED

1024

3467.6

1721.6

43.5

23.0

stablelm-3b-4e1t

INT8-CW

32

3582.8

111

44.3

22.6

stable-zephyr-3b-dpo

INT8-CW

32

3607.2

110.2

44.5

22.5

phi-2

INT8-CW

1024

3982

1508

44.6

22.4

dolly-v2-3b

INT8-CW

1024

3596.5

1529.1

44.9

22.3

minicpm-1b-sft

FP16

31

3769.9

84

45.4

22.0

red-pajama-incite-chat-3b-v1

INT8-CW

1023

3952

2064.5

45.7

21.9

stablelm-3b-4e1t

INT8-CW

1023

3934.5

2286.3

46.8

21.4

gpt-j-6b

INT4-MIXED

32

4443.5

159.3

56.7

17.6

phi-3-mini-4k-instruct

INT8-CW

32

4545

117.1

57.6

17.4

phi-3-mini-4k-instruct

INT8-CW

1024

4810.4

2068.8

60.5

16.5

gpt-j-6b

INT4-MIXED

1024

4746.4

2397

60.6

16.5

falcon-7b-instruct

INT4-MIXED

32

5014

203.7

61.3

16.3

qwen2-7b

INT4-MIXED

32

5269.4

203.8

62.3

16.1

codegen25-7b

INT4-MIXED

32

4641.1

170.6

63.5

15.7

llama-2-7b-gptq

INT4-MIXED

32

4597.3

172.1

63.5

15.7

falcon-7b-instruct

INT4-MIXED

1024

5230.6

2695.3

63.6

15.7

qwen2-7b

INT4-MIXED

1024

5370.8

2505.9

63.9

15.6

decilm-7b-instruct

INT4-MIXED

36

4614.2

301.1

65.3

15.3

codegen25-7b

INT4-MIXED

1024

4641.9

2629.6

67.4

14.8

llama-2-7b-gptq

INT4-MIXED

1024

4928.1

2584.3

67.6

14.8

mistral-7b-v0.1

INT4-MIXED

32

4928.5

180.9

69.2

14.5

llama-2-7b-chat-hf

INT4-MIXED

32

4985.7

160.3

69.5

14.4

qwen-7b-chat-gptq

INT4-MIXED

32

5426.7

188.3

69.5

14.4

llama-3-8b

INT4-MIXED

33

5473.4

285.7

70

14.3

flan-t5-xxl

INT4-MIXED

33

19293.8

211.7

70.1

14.3

llama-3-8b

INT4-MIXED

33

5389.2

281

70.8

14.1

mistral-7b-v0.1

INT4-MIXED

1024

5225.4

2713.3

71.8

13.9

zephyr-7b-beta

INT4-MIXED

32

5306.1

177.9

72.1

13.9

llama-3-8b

INT4-MIXED

1025

5615.2

2937.8

72.4

13.8

llama-3-8b

INT4-MIXED

1025

5531.7

2815.4

73.2

13.7

llama-2-7b-chat-hf

INT4-MIXED

1024

5319.5

2736.2

73.6

13.6

phi-2

FP16

32

6197

104.6

74.7

13.4

zephyr-7b-beta

INT4-MIXED

1024

5306.4

2802.3

74.7

13.4

qwen-7b-chat-gptq

INT4-MIXED

1024

5934.9

2606.9

75

13.3

dolly-v2-3b

FP16

32

6195.1

105.3

75.3

13.3

baichuan2-7b-chat

INT4-MIXED

32

5837.9

188.5

76.8

13.0

red-pajama-incite-chat-3b-v1

FP16

32

6178.6

118

76.8

13.0

gemma-7b-it

INT4-MIXED

32

6495.9

230.6

77

13.0

stablelm-3b-4e1t

FP16

32

6174.2

105.9

77.1

13.0

stable-zephyr-3b-dpo

FP16

32

6217.8

107.9

77.2

13.0

glm-4-9b-chat

INT4-MIXED

32

6333.4

225

77.3

12.9

phi-2

FP16

1024

6411.5

2065.2

77.3

12.9

dolly-v2-3b

FP16

1024

6410.1

2075

77.7

12.9

llama-3.1-8b

INT4-MIXED

32

6324.6

182.2

78.8

12.7

red-pajama-incite-chat-3b-v1

FP16

1023

6394.2

2752.4

79.2

12.6

stablelm-3b-4e1t

FP16

1023

6386.9

2953.3

79.5

12.6

glm-4-9b-chat

INT4-MIXED

1024

6439.5

3282.2

80

12.5

baichuan2-7b-chat

INT4-MIXED

1024

6174.1

2752.6

80.6

12.4

gemma-7b-it

INT4-MIXED

1024

6795.4

3118.3

80.6

12.4

llama-3.1-8b

INT4-MIXED

1024

6324.8

2865.7

81.3

12.3

gpt-j-6b

INT8-CW

32

6793.2

167.6

85

11.8

qwen-7b-chat

INT4-MIXED

32

7274.8

168.8

85.2

11.7

gpt-j-6b

INT8-CW

1024

6793.3

2668.4

88.8

11.3

qwen-7b-chat

INT4-MIXED

1024

7610.3

2991.9

90.6

11.0

flan-t5-xxl

INT4-MIXED

1139

23514

540.8

94.9

10.5

falcon-7b-instruct

INT8-CW

32

7764.1

181.3

95.5

10.5

llama-2-7b-chat-hf

INT8-CW

32

7330.9

172

96.1

10.4

falcon-7b-instruct

INT8-CW

1024

7987.4

3072.8

98.1

10.2

qwen2-7b

INT8-CW

32

8175.3

211.3

99.6

10.0

All models listed here were tested with the following parameters:

  • Framework: PyTorch

  • Beam: 1

  • Batch size: 1