Beyond The Reference Points: How Deepseek-R1 And O1 Perform In Real World's Tasks

Join our daily and weekly newsletters to obtain the latest updates and exclusive content on the coverage of the industry leader. Get more information

Deepseek-R1 has surely created a lot of emotion and concern, especially for the OPENAI Rival model. Therefore, we put them in a comparison from side to side in some simple tasks of data analysis and market research.

To put the models on equal terms, we use Perpleplexity Pro Search, which is now compatible with O1 and R1. Our goal was to look beyond the reference points and see if the models can really perform ad hoc tasks that require web information, choose the correct data and perform simple tasks that would require a substantial manual effort.

Both models are impressive, but make mistakes when the indications lack specificity. O1 is a little better in reasoning tasks, but the transparency of R1 gives it an advantage in cases (and there will be enough) where it makes mistakes.

Here is a breakdown of some of our experiments and links to the pages of perplexity where you can review the results yourself.

Table of Contents

Calculate web investments yields

Our first test described whether the models could calculate investment yields (ROI). We consider a scenario in which the user has invested $ 140 in the magnificent Seven (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla) on the first day of each month from January to December 2024. We ask the model to calculate The value of the value. of the portfolio on the current date.

To achieve this task, the model would have to extract the information from the price of MAG 7 for the first day of each month, divide the monthly investment evenly into the shares ($ 20 per share), add and calculate the value of the portfolio according to the value of the shares on the current date.

In this task, both models failed. O1 returned a sharing price list By January 2024 and January 2025 together with a formula to calculate the value of the portfolio. However, he could not calculate the correct values and basically said that there would be no roi. On the other hand, R1 made the mistake of investing only in January 2024 and calculating returns by January 2025.

*O1 reasoning trace does not provide enough information*

However, the interesting thing was the process of reasoning of the models. While O1 did not provide many details about how it had achieved its results, The reasoning of R1 tracked He showed that he did not have the right information because the perplexity recovery engine had not managed to obtain the monthly data for the prices of the actions (many recovery generation applications do not fail in the lack of skills of the model but for poor recovery). This turned out to be a bit important of feedback that led us to the next experiment.

*R1 reasoning trace reveals that it lacks information*

Reasoning on file content

We decided to execute the same experiment as before, but instead of asking the model to recover the web information, we decided to provide it in a text file. For this, we copy monthly stock-pastor data for each Yahoo! Finance in a text file and give it to the model. The file contained the name of each action plus the HTML table containing the price of the first day of each month from January to December 2024 and the last recorded price. The data were not cleaned to reduce the manual effort and test if the model could choose the correct parts of the data.

Again, both models could not provide the correct answer. O1 seemed to have extracted the data of the file, but suggested that the calculation be done manually in a tool like Excel. The reasoning trail was very vague and did not contain any useful information to solve the model. R1 did not fail either And it did not provide an answer, but the reasoning trace contained a lot of useful information.

For example, it was clear that the model had correctly analyzed the HTML data for each stock and could extract the correct information. He had also been able to make the month to month of investments, add them and calculate the final value according to the last price of the shares in the table. However, that final value remained in its reasoning chain and could not reach the final response. The model had also been confused by a row in the Nvidia table that had marked the division of actions 10: 1 of the company on June 10, 2024, and ended up calculating the final value of the portfolio.

*R1 hid the results in its reasoning trace along with information about where it went wrong*

Again, the true differentiator was not the result itself, but the ability to investigate how the model reached its response. In this case, R1 provided us with a better experience, allowing us to understand the limitations of the model and how we can reformulate our notice and format our data to obtain better results in the future.

Data comparison on the web

Another experiment that we did required that the model compare the statistics of four main centers of the NBA and determine which one had the best improvement in the percentage of field objectives (FG%) from the 2022/2023 seasons until seasons 2023/2024. This task required that the model perform multiple steps reasoning at different data points. The capture in the notice was that it included Victor Wembanyama, who has just entered the League as a rookie in 2023.

The recovery of this notice was much easier, since the statistics of the players are widely reported on the web and are generally included in their Wikipedia and NBA profiles. Both models responded correctly (it is Giannis in case it was curious), although depending on the sources they used, their figures were a bit different. However, they did not realize that Wemby did not describe for the comparison and gathered other statistics of his time in the European League.

In your answer, R1 provided a better breakdown of the results with a comparison table together with links to the sources that you used for your response. The additional context allow us to correct the message. After modifying the notice specifying that we were looking for FG% of NBA’s seasons, the model correctly discarded Wemby from the results.

Adding a simple word to the application made a difference in the result. This is something that a human would implicitly know. Be the most specific thing in your warning and try to include information that a human would implicitly assume.

Final verdict

Reasoning models are powerful tools, but they still have a way to go before they can be totally reliable in tasks, especially as other components of large -language model (LLM) applications continue to evolve. From our experiments, both O1 and R1 can still make basic mistakes. Despite showing impressive results, they still need a little hand to give precise results.

Ideally, a reasoning model should be able to explain to the user when it lacks information for the task. Alternatively, the model of reasoning of the model should be able to guide users to better understand errors and correct their indications to increase the precision and stability of model responses. In this sense, R1 had the advantage. With luck, future reasoning models, including OPENAI’s next O3 series, will provide users with more visibility and control.

Daily insights on commercial use cases with VB daily

If you want to impress your boss, VB Daily has you covered you. We give the interior account of what companies are doing with generative AI, from regulatory changes to practical implementations, so you can share ideas for the maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Look more VB bulletins here.

A mistake happened.

Discounts
Source link

Beyond the reference points: how Deepseek-R1 and O1 perform in real world’s tasks

Calculate web investments yields

Reasoning on file content

Data comparison on the web

Final verdict

Leave a ReplyCancel Reply

What games are on television in August? Schedule, channels, live broadcasts

Russia lavrov meets Kim Jong a North Korea with the war in Fore | Russia-Ukraine War News

How to use clean energy tax credits before they disappear

Calculate web investments yields

Reasoning on file content

Data comparison on the web

Final verdict

Leave a ReplyCancel Reply

Trending now

What games are on television in August? Schedule, channels, live broadcasts

Russia lavrov meets Kim Jong a North Korea with the war in Fore | Russia-Ukraine War News

How to use clean energy tax credits before they disappear