SEO custom title demo

Three Conferences in Half a Month: The AI Battle Spreads to the Smartphone Sector

default / 2021-11-15

What New Changes Have Emerged in Mobile AI?

Compared with two years ago, smartphone manufacturers have rarely made large investments in large-parameter foundation models. Instead, they have focused more on on-device multimodal models.

As holders of hundreds of millions of users, smartphone manufacturers have always been pioneers in exploring AI terminals.

Recently, ahead of their new smartphone launches, vivo, OPPO, and Honor have successively held developer conferences. AI has become the most talked-about topic, and on this occasion, each smartphone manufacturer has demonstrated a new understanding of AI strategy and their respective focuses on model capability applications.

What the outside world is curious about is: as the most frequently used smart devices in daily life, how far has AI on domestic smartphones developed? What application scenarios are there? And what challenges remain to be solved?

AI Smartphones Enter the Era of On-Device Multimodality

Two years ago, mobile AI applications were concentrated in text processing, such as multi-turn conversations, summary generation, and copy continuation, all relying on cloud-side large model processing. A notable change this year is that with the emergence of multimodal on-device models, a large number of scenarios related to image and speech processing have been realized.

vivo showcased 18 on-device intelligence-related applications, such as ID card and certificate recognition, automatic file name filling, and on-device UI Agent. Users can create notes in the memo or record detailed bills in the wallet with a single sentence. Compared with setting an alarm clock in the past, these tasks have more complex interaction logic, requiring intent recognition and autonomous planning capabilities.

OPPO focused on demonstrating two core features: One-Click Screen Query and One-Click Flash Note. Powered by a multimodal large model, One-Click Screen Query allows AI to not only understand on-screen content in real time but also enable users to have voice conversations by pointing at real-world scenes. One-Click Flash Note can automatically extract key information and categorize it—for example, when you make a purchase via WeChat Pay, the phone will automatically record the transaction in the bill without any manual operation, or it can complete bill entry by scanning receipt information. Pickup codes and other information will be displayed as small cards on the phone for real-time reminders.

Honor revealed that its smartphones can automatically execute over 3,000 scenarios covering clothing, food, accommodation, travel, and shopping, addressing users’ troubles of frequent cross-app switching. For instance, one-click price comparison shopping not only helps you compare prices and add products to the shopping cart but also assists in collecting coupons; one-click ride-hailing allows you to directly ask AI to call a ride-hailing app via voice. Tasks that previously required frequent app switching can now be completed with a single AI command.

"Judging from popular large models and agent products, technologically, we already have the ability to understand the physical world or accelerate the integration of the physical and digital worlds," said Zhang Chong, General Manager of Honor MagicOS AI Product Department. Objectively speaking, for smartphone manufacturers, the digital world contains natural data and production data. These data can be used for better model fine-tuning to understand users’ needs in the current context.

However, in the view of a mobile AI technology expert, "there is a certain mismatch between the progress of AI technology and user needs. Users’ most frequent AI use scenario is image processing, but for this generation of technology, language models have matured first." The expert predicts that image processing will likely reach a very high level of maturity next year.

Smartphone manufacturers’ large models have basically gone through three stages: Two years ago, both vivo and OPPO released full-size language models with parameters ranging from hundreds of millions to over 100 billion. A year ago, the industry’s focus shifted from language models to multimodal fields such as speech and images, with greater emphasis on on-device model deployment, accelerating the implementation of large models on smartphones.

Several notable trends this year are as follows: First, on-device models are concentrated in the lightweight 3B parameter size, with multimodality added on top of large language models.

For example, in July this year, Honor released MagicGUI, a 7B multimodal perception large model. vivo simultaneously launched BlueLM-2.5-3B, a 3B multimodal reasoning model that integrates language, vision, and logical reasoning capabilities on the device. In October, OPPO unveiled AndesVL, an on-device multimodal large model offering four size options ranging from 0.6B to 4B parameters. In addition to general multimodal recognition, understanding, and reasoning capabilities, it also features GUI capabilities and multilingual support.

The industry has rapidly reduced model size and memory overhead through low-bit hybrid quantization schemes and on-device LoRA training solutions, accelerating the deployment of on-device multimodal large models.

An industry insider told Digital Intelligence Frontline that today’s 3B models can already achieve the performance of previous 8B models. Moreover, while such tasks previously required combining multiple visual expert models and language models, they can now integrate models of various sizes and modalities into one, resulting in higher recognition rates. For instance, vivo adopted a 1+N architecture, where multimodality, language models, and logical reasoning share a single base model, complemented by LoRA for each modality—enabling one model to support over ten business scenarios.

Second, on-device models have achieved a deep thinking mode for reasoning. Smartphones can now perform complex reasoning locally just like on the cloud, significantly improving the accuracy of solving complex problems.

Third, the introduction of GUI Agent models allows AI to proactively control smartphone interfaces to complete tasks. Its essence is to simulate human operations on smartphones (such as tapping and swiping) without relying on rules, fixed scripts, or special APIs provided by app developers. This enables smartphone agents to operate third-party applications seamlessly.

What Challenges Does the Implementation of On-Device Models Face?

Today’s mobile AI assistants typically call different models to perform various tasks—including self-distilled models and external high-quality cloud-side large model services accessed via APIs. Alibaba Tongyi Qianwen and ByteDance Doubao are two widely integrated services among smartphone manufacturers.

However, a smartphone industry insider told Digital Intelligence Frontline that there are many complications in calling external models: "Whether it’s Doubao or Alibaba, the APIs provided to smartphone manufacturers are not the same as their latest internal versions—they are at least 3 to 6 months behind." The insider added that within cloud service providers, the teams responsible for selling cloud services and developing models are separate.

Cloud providers package their internal capabilities into commercial products for sale, but model developers are also concerned that if smartphone manufacturers optimize using their own data, the results may outperform the original models. "It’s not that we don’t want to integrate them; it’s that they are reluctant to provide full access."

Compared with two years ago, smartphone manufacturers have rarely made large investments in large-parameter foundation models. Instead, they have focused more on on-device multimodal models.

A mobile AI expert told Digital Intelligence Frontline that cloud-side models have achieved significant compression through MOE architecture, but on-device models are limited by chip performance—currently ranging from 2B to 5B parameters, equivalent to the 32-70B models of 2023. "Model providers aim to pursue the upper limit of intelligence, while terminal manufacturers focus on compressing models for on-device deployment," the expert explained. "We don’t engage in 0-to-1 foundation model training; small-size on-device models are essentially distillations of large-size cloud-side models."

"Cloud-side capabilities are relatively easy to build now," said Zhou Wei, Dean of vivo AI Research Institute. "The real challenge lies in on-device capabilities."

Zhou Wei revealed that vivo developed 13B and 7B on-device models last year. They found that only the 7B model was basically usable, but its performance was not ideal—it occupied too much memory, requiring nearly 4GB of RAM. Over the past year, vivo has focused more on 3B on-device multimodal models. Today, the 3B on-device model’s text summarization capability has reached 97%-98% of that of cloud-side large models, which "is more than sufficient."

However, this does not mean smartphone manufacturers have abandoned large-parameter models. Instead, they are making distinctions in capabilities. "If most manufacturers are already addressing a problem, we choose to cooperate with them," a technical expert told Digital Intelligence Frontline. For example, smartphone manufacturers will no longer iterate on models that purely expand world knowledge. Instead, they focus on understanding multi-dimensional data on mobile devices and pursuing personalized intelligence.

Therefore, although most smartphone manufacturers currently adopt a cloud-device collaboration approach, it is clear that the core remains on optimizing on-device models.

On one hand, each API call to cloud-side large models incurs costs, and round-trip latency affects user experience. On the other hand, users’ privacy concerns limit the use of data by cloud-side large models. In contrast, on-device large models require only higher-performance chips and storage space without additional costs. Additionally, local data processing offers greater privacy and security—these features are key to the on-device deployment of large models on smartphones.

The AI boom has brought some "sweet troubles" to smartphone manufacturers. With their massive user bases, frequent calls to cloud-side model services result in substantial cost expenditures. A mobile AI expert told Digital Intelligence Frontline that using an ASR model for real-time transcription and translation on smartphones costs up to 2 yuan per hour, which hardware manufacturers must bear.

In fact, except for some dialogue products from major tech companies, many professional AI tools on the market require payment—such as PPT generation and in-depth research reports—with more exploring paid models.

Furthermore, an industry insider sighed to Digital Intelligence Frontline that cloud service providers have little incentive to invest in on-device models: "Because they mainly sell MaaS (Model-as-a-Service) services." This further relies on smartphone manufacturers to proactively solve the challenges of on-device models.

However, a current problem is the lack of "hit" AI applications. Users’ perception of AI remains limited, and chip manufacturers are adopting a wait-and-see attitude.

"Chip manufacturers have been approaching us to find more flagship scenarios on smartphones," the insider said. Currently, the latest flagship chips from Qualcomm Snapdragon and MediaTek Dimensity already boast AI computing power of 100 TOPS. Chip manufacturers want to sell chips with stronger computing power, but without sufficient supporting applications, higher computing power translates to higher chip prices, which will ultimately impact sales.

The Agent Ecosystem Has Just Taken Off

It is evident that the current automated tasks such as "one-sentence photo editing," "one-sentence Wi-Fi connection," and "one-sentence expense tracking" are basically limited to manufacturers’ first-party applications, such as notes and photo albums.

However, most of users’ usage scenarios involve third-party applications: "85% of usage time is spent on services provided by developers." This means the participation of leading internet companies remains a crucial link.

Zhou Wei mentioned that currently, when mobile AI agents perform tasks, they can only operate the manufacturer’s own functions. To cross applications, terminal manufacturers and internet companies still need complex discussions on security authorization standards. "As terminal manufacturers, we must actively promote the establishment of industry standards and recognize that it will take several years for AI technology to mature from its current state."

As single agents evolve toward multi-agent collaboration, smartphone manufacturers—beyond launching their own agent applications—are actively building agent ecosystems.

For example, vivo has refined high-frequency, reusable system capabilities into a universal system-level agent, packaging features like screen perception and task planning into a "Universal Control Module Suite" for direct access by ecosystem partners. Through its agent development platform, vivo also provides a variety of on-device AI development capabilities to help partners build rich, scenario-specific agents.

OPPO, meanwhile, regards its agent ecosystem framework as one of the three core technical pillars of OPPO AI. This framework not only serves as the central platform for OPPO agents’ cross-device collaboration but also holds the key to upgrading AI agents from single-step execution to complex task planning and multi-device linkage.

Honor has launched the system-level MCP architecture, which now connects over 80% of high-frequency scenarios at the system’s bottom layer and integrates more than 4,000 ecological MCPs and agents. Beyond the software ecosystem, leveraging its geographical advantage in Shenzhen, Honor aims to build an AI hardware ecosystem to enable cross-device collaboration of agents.

Compared with other terminal products, smartphone manufacturers have inherent advantages in building agent ecosystems. They possess massive cross-application, cross-scenario multimodal data, and smartphones can connect with other terminal devices to act as intelligent hubs.

Today, some internet companies have already reaped the benefits. For instance, Ant Group has established strategic cooperation with almost all major smartphone manufacturers, integrating its agent services into their ecosystems. vivo revealed that the traffic share of Ant’s AI health agent AQ in the health scenarios of Blue Heart Xiao V has tripled since the beginning of the year.

However, for most app developers, the agent ecosystem involves dilemmas around traffic distribution and data permissions. Many app manufacturers worry that system-level agents directly serving end-users will undermine the value of their apps. Additionally, as user data is currently controlled by individual apps, enterprises are concerned about whether they need to share this data for system-level agents to execute tasks.

Currently, the common industry practice is to develop GUI large models—a more moderate solution. Essentially, it does not involve direct interaction between agents; instead, AI replaces human operations on mobile interfaces. Users still need to log into their personal accounts, and key nodes require user confirmation, with the mobile agent merely acting as a user.

Zhou Wei from vivo echoed the views of many smartphone manufacturers: "First, those willing to cooperate with us can sit down and discuss joint initiatives. Second, as the AI era arrives, whether a brand-new market position and influence are needed—this will be left to time to tell."

Company News

Three Conferences in Half a Month: The AI Battle Spreads to the Smartphone Sector

AI Smartphones Enter the Era of On-Device Multimodality

What Challenges Does the Implementation of On-Device Models Face?

The Agent Ecosystem Has Just Taken Off